Today’s information technology landscape is heavily dependent on data. The data coming in from all the different sources require processing in order to organize them and make them usable. This data can be structured or unstructured or can be stored on-premises or on cloud. Processing all these data to provide a uniform data pipeline is a herculean and costly task. This is where Azure Data Factory comes into the picture.
In this blog we will find out more about Azure Data Factory in the following order:
Before learning more about ‘What is Azure Data Factory?’, Check out this YouTube video to learn more about Azure:
What is Azure Data Factory (ADF)?
Data Factory in Azure is a cloud integration system that allows users to move data between on-premises and cloud systems, as well as schedule and coordinate complicated data flows. Conventionally SQL Server Integration Services (SSIS) is used for data integration from databases stored in on-premises infrastructure but it cannot handle data on the cloud. But Azure Data Factory can work on cloud or on-premises and has superior job scheduling, security, and scalability features which makes it better than SSIS.
Microsoft Azure created this platform to enable users to construct workflows that can import data from both on-premise and cloud data stores, as well as convert and process data using current computing services like Hadoop. The results can then be uploaded to an on-premises or cloud data repository for consumption by business intelligence (BI) applications.
To know more about Azure Data Science Certification check out our blog on the DP-100 Certification preparation guide.
Why Azure Data Factory?
The most commonly used tool for data integration on-premises is SSIS but there are some challenges to be overcome when dealing with data on the cloud. Azure Data Factory can tackle these challenges faced while moving data to or from the cloud, like:
- Job scheduling and orchestration: There is a shortage of services that trigger data integration on the cloud. Although there are some services like Azure Scheduler, Azure Automation, SQL VM, etc. available for data movement, the job scheduling capabilities of Azure Data Factory are superior to them.
- Security: Every piece of data in transit between cloud and on-premises is always automatically encrypted by Azure Data Factory.
- Continuous integration and delivery: The Azure Data Factory integration with GitHub allows you to develop, build, and deploy to Azure effortlessly.
- Scalability: Azure Data Factory was designed to be capable of handling large volumes of data.
How does it work?
You can use the Data Factory service to design data pipelines that move and transform data, and then schedule them to run at specific intervals. This means that processes consume and generate time-sliced data, and we can choose between a scheduled or one-time pipeline mode.
The workflows in Azure Data Factory can connect to all of the data and processing sources you’ll need, including SaaS services, file sharing, FTP, and online services. Copy Activity in a data pipeline can be used to move data from both on-premise and cloud source data stores to a centralized data store in the cloud for further analysis or move the data as needed to a centralized location for later processing. Data is transformed utilizing computing services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning once it is stored in a consolidated data repository in the cloud. Azure Data Factory then allows you to deliver cloud-transformed data to on-premises sources like SQL Server, or maintain it in your cloud storage sources for consumption by BI and analytics tools and other applications.
Key Components of Azure Data Factory
Knowing about the Azure Data Factory features is important in understanding Azure Data Factory’s working. They are:
- Datasets: Datasets contain data source configuration parameters but at a finer level. A table name or file name, as well as a structure, can all be found in a dataset. Each dataset is linked to a certain linked service, which determines the set of potential dataset attributes.
- Activities: Data transfer, transformations, and control flow operations are all examples of activities. Database query, saved procedure name, arguments, script location, and other options can be found in activity configurations. An activity can take one or more input datasets and output one or more datasets.
- Linked Services: Configuration parameters for specific data sources are stored in linked services. This could include information such as the server/database name, file folder, credentials, and so on. Each data flow may contain one or more related services, depending on the nature of the job.
- Pipelines: Pipelines are groups of actions that make sense. Each pipeline in a data factory can have one or more actions. Pipelines make scheduling and monitoring several logically related operations a lot easier.
- Triggers: Triggers are pipeline scheduling configurations that contain configuration settings such as start/end dates, execution frequency, and so on. Triggers aren’t required for ADF implementation; they’re only required if you want pipelines to run automatically and on a set schedule.
Want to learn concepts such as Flow process, Data lake, Analytics & Loading Data to Power BI, then go through our Azure Data Factory tutorial.
Creating a Data Factory
Make sure that you have an Azure subscription and are signing in with a user account that is a member of the contributor, owner, or administrator role on the Azure subscription before building a new Data Factory that will be used to orchestrate the data copying and transformation.
Open the Microsoft Azure Portal in your web browser, login in with an authorized user account, then search for Data Factory in the portal search panel and select the Data Factories option, as shown below:
To create a new data factory, click the + Create option in the Data Factories window, as shown below:
Provide the Subscription under which the Azure Data Factory will be created, an existing or new Resource Group where the ADF will be created, the nearest Azure region for you to host the ADF on, a unique and indicative name for the Data Factory, and whether to create a V1 or V2 data factory from the Basics tab of the Create Data Factory window as shown:
You will be asked to configure a repository for your Data Factory CI/CD process in the Git Configuration tab of the Create Data Factory window, which will help you incrementally move changes between the Development and Production environments, and it will ask you whether to configure Git during the ADF creation or later after creating the Data Factory.
You must decide whether you will use a Managed VNET for the ADF and the type of endpoint that will be utilised for the Data Factory connection from the Networking tab of the Create Data Factory window, as shown below:
Click the Review + Create option after specifying the Data Factory network options to review the selected options before creating the Data Factory, as illustrated below:
After you’ve double-checked your choices, click the Create button to begin creating the Data Factor. You can monitor the progress of the Data Factory creation from the Notifications button of the Azure Portal, and a new window will be displayed once the Data Factory created successfully, as shown below:
To open the constructed Data Factory, click the Go to Resources option in the given window. Under the Overview pane, you’ll see that a new Data Factory has been built, and you’ll be able to review the Data Factory important information, the Azure Data Factory documentation, and the pipelines and activity summary.
You can also check the Activity Log for different activities performed on the Data Factory, control ADF permissions under Access Control, diagnose and solve problems under Diagnose and Solve Problems, configure ADF networking, lock the ADF to prevent changes or deletion of the ADF resource, and perform other monitoring, automation, and troubleshooting options.
Get certified in Microsoft Azure with this course: Microsoft Azure Training Course for Azure Administrator Certification
The most straightforward approach to begin transferring data is to use Data Copy Wizard. It lets you easily build a data pipeline that transfers data from a supported source data store to a supported destination data store.
In addition to using the DataCopy Wizard, you may customise your activities by manually constructing each of the major components. Data Factory entities (connected services, datasets, and pipelines) are in JSON format, so you may build these files in your favourite editor and then copy them to the Azure portal (by selecting Autor and deploy) or continue in the DataFactory project. So the input and output datasets and the pipelines can be created in JSON for migrating data.
Go through our set of Azure Data Factory Interview Questions to crack your interview.
Data Factory is a great tool for the rapid transition of data onto the cloud. Data Factory provides easy data integration on-premises and on the cloud.
If you have any queries regarding Microsoft Azure, reach out to us at our Azure Community