Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Azure by (17.6k points)

I'm building an Azure data lake using data factory at the moment, and am after some advice on having multiple data factories vs just one.

I have one data factory at the moment, that is sourcing data from one EBS instance, for one specific company under an enterprise. In the future though there might be other EBS instances, and other companies (with other applications as sources) to incorporate into the factory - and I'm thinking the diagram might get a bit messy.

I've searched around, and I found this site, that recommends to keep everything in a single data factory to reuse linked services. I guess that is a good thing, however as I have scripted the build for one data factory, it would be pretty easy to build the linked services again to point at the same data lake for instance.

https://www.purplefrogsystems.com/paul/2017/08/chaining-azure-data-factory-activities-and-datasets/

Pros for having only one instance of data factory:

  • Have to only create the data sets, linked services once
  • Can see overall lineage in one diagram

Cons

  • Could get messy over time
  • Could get quite big to even find the pipeline you are after

Has anyone got some large deployments of Azure Data Factories out there, that bring in potentially thousands of data sources, mix them together and transform? Would be interested in hearing your thoughts.

1 Answer

0 votes
by (47.2k points)
  • Having One Data Factory makes it easier to configure multiple integration runtimes.

  • If we have more than one, we need to consider that a pc has only had one integration runtime installed which can only be registered to one data factory instance only.

  • It depends on the complexity of the data factory and inter-dependencies between the various sources and destinations.

  • The UI, particularly in V2, makes managing a large data factory easy.

  • Both are fixed by having naming rules. It is not messy to find a pipeline you want, if you name them like Pipeline_[Database name][db schema][table name] for example.

  • Let us say, We have a project with thousands of datasets and pipelines, and it is not a big deal to handle smaller projects.

Browse Categories

...