Today’s information technology landscape is heavily dependent on data coming in from different sources. This data can be structured or unstructured or can be stored on-premises or on the cloud and require processing in order to organize them and make them usable. Working on all these data to provide a uniform data pipeline is a herculean and costly task. This is where Azure Data Factory comes into the picture.
In this blog we will find out more about Azure Data Factory in the following order:
Before learning more about ‘What is Azure Data Factory?’, Check out this YouTube video to learn more about Azure:
What is Azure Data Factory (ADF)?
Data Factory in Azure is a data integration system that allows users to move data between on-premises and cloud systems, as well as schedule data flows.
Conventionally SQL Server Integration Services (SSIS) is used for data integration from databases stored in on-premises infrastructure but it cannot handle data on the cloud. But Azure Data Factory can work on cloud or on-premises and has superior job scheduling features which makes it better than SSIS.
Microsoft Azure created this platform to enable users to construct workflows that can import data from both on-premise and cloud data stores, as well as convert and process data using current computing services like Hadoop. The results can then be uploaded to an on-premises or cloud data repository for consumption by Business Intelligence (BI) applications.
To know more about Azure Data Science Certification check out our blog on the DP-100 Certification preparation guide.
Why Azure Data Factory?
The most commonly used tool for data integration on-premises is SSIS but there are some challenges to be overcome when dealing with data on the cloud. Azure Data Factory can tackle these challenges faced while moving data to or from the cloud, by the following methods:
- Job scheduling and orchestration: There is a shortage of services that trigger data integration on the cloud. Although there are some services like Azure Scheduler, Azure Automation, SQL VM, etc. available for data movement, the job scheduling capabilities of Azure Data Factory are superior to them.
- Security: Every piece of data in transit between cloud and on-premises is always automatically encrypted by Azure Data Factory.
- Continuous integration and delivery: The Azure Data Factory integration with GitHub allows you to develop, build, and deploy to Azure effortlessly.
- Scalability: Azure Data Factory was designed to be capable of handling large volumes of data.
How does it work?
Azure Data Factory can connect to all of the data and processing sources you’ll need, including SaaS services, file sharing, and other online services. You can use the Data Factory service to design data pipelines that move data, and then schedule them to run at specific intervals. This means that we can choose between a scheduled or one-time pipeline mode.
Copy Activity in a data pipeline can be used to move data from both on-premise and cloud sources to a centralized data store in the cloud or on-premises for further analysis and processing.
Data is then transformed using services such as HDInsight Hadoop, Azure Data Lake Analytics, and Machine Learning, once it is stored in a centralized data storage location.
Go through our set of Azure Data Factory Interview Questions to crack your interview.
Key Azure Data Factory Components
Knowing about the Azure Data Factory features is important in understanding Azure Data Factory’s working. They are:
- Datasets: Datasets contain data source configuration parameters but at a finer level. A table name or file name, as well as a structure, can all be found in a dataset. Each dataset is linked to a certain linked service, which determines the set of potential dataset attributes.
- Activities: Data transfer, transformations, and control flow operations are all examples of activities. Database query, saved procedure name, arguments, script location, and other options can be found in activity configurations. An activity can take one or more input datasets and output one or more datasets.
- Linked Services: Configuration parameters for specific data sources are stored in linked services. This could include information such as the server/database name, file folder, credentials, and so on. Each data flow may contain one or more related services, depending on the nature of the job.
- Pipelines: Pipelines are groups of actions that make sense. Each pipeline in a data factory can have one or more actions. Pipelines make scheduling and monitoring several logically related operations a lot easier.
- Triggers: Triggers are pipeline scheduling configurations that contain configuration settings such as start/end dates, execution frequency, and so on. Triggers aren’t required for ADF implementation; they’re only required if you want pipelines to run automatically and on a set schedule.
Want to learn concepts such as Flow process, Data lake, Analytics & Loading Data to Power BI, then go through our Azure Data Factory tutorial.
Creating a Data Factory
Make sure that you have an Azure subscription and are signing in with a user account that is a member of the contributor, owner, or administrator role on the Azure subscription before building a new Data Factory that will be used to orchestrate the data copying and transformation.
Open the Microsoft Azure Portal in your web browser, login in with an authorized user account, then search for Data Factory in the portal search panel and select the Data Factories option, as shown below:
To create a new data factory, click the + Create option in the Data Factories window, as shown below:
Provide the subscription type that you prefer for the service. Then give a resource group if you already have created one, or else create a new one. Give the nearest Azure region for you to host the ADF on. Provide a unique name for the Data Factory, and whether to create a V1 or V2 data factory from the Basics tab of the Create Data Factory window as shown:
The setup will then require you to configure a repository for your Data Factory CI/CD process in the Git Configuration tab. Here you can make changes between the Development and Production environments, and it will ask you whether to configure Git during the ADF creation or later.
You must decide whether you will use a Managed VNET for the ADF and the type of endpoint that will be utilized for the Data Factory connection from the Networking tab of the Create Data Factory window, as shown below:
Click the Review + Create option after specifying the Data Factory network options to review the selected options before creating the Data Factory, as illustrated below:
After you’ve double-checked your choices, click the Create button to begin creating the Data Factor. You can monitor the progress of the Data Factory creation from the Notifications button of the Azure Portal, and a new window will be displayed once the Data Factory is created successfully, as shown below:
To open the constructed Data Factory, click the Go to Resources option in the given window. Under the Overview pane, you’ll see that a new Data Factory has been built, and you’ll be able to review the Data Factory important information, the Azure Data Factory documentation, and the pipelines and activity summary.
You can also check the Activity Log for different activities performed on the Data Factory, control ADF permissions under Access Control, diagnose and solve problems under Diagnose and Solve Problems, configure ADF networking, lock the ADF to prevent changes or deletion of the ADF resource, and perform other monitoring, automation, and troubleshooting options.
Get certified in Microsoft Azure with this course: Microsoft Azure Training Course for Azure Administrator Certification
The most straightforward approach to begin transferring data is to use Data Copy Wizard. It lets you easily build a data pipeline that transfers data from a supported source data store to a supported destination data store.
In addition to using the DataCopy Wizard, you may customize your activities by manually constructing each of the major components. Data Factory entities are in JSON format, so you may build these files in your favorite editor and then copy them to the Azure portal. So the input and output datasets and the pipelines can be created in JSON for migrating data.
Go through our set of Azure Data Factory Interview Questions to crack your interview.
Data Factory is a great tool for the rapid transition of data onto the cloud. Data Factory provides easy data integration on-premises and on the cloud.
Hope this azure data factory overview cleared your concepts, if you have more queries, reach out to us at our Azure Community.