• Articles
  • Tutorials
  • Interview Questions

Azure Data Factory Tutorial

Azure Data Factory Tutorial

Table of content

Show More

What is Azure Data Factory?

Data generated by several applications of products is increasing exponentially day by day. As the data is coming from many sources, it is very difficult to manage it.

Azure Data Factory

To analyze and store all this data, we can use Data Factory which:

  • Stores data with the help of Azure Data Lake Storage
  • Analyzes the data
  • Transforms the data with the help of pipelines (a logical grouping of activities that together perform a task)
  • Publishes the organized data
  • Visualizes the data with third-party applications like Apache Spark or Hadoop.

Watch this Azure Data Factory Video Tutorial:

Video Thumbnail
 

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation. We can make use of Azure Data Factory to create and schedule data-driven workflows that can ingest data from various data stores. It can process and transform the data by using compute services such as Azure Data Lake Analytics, Azure Machine Learning, and Azure HDInsight Hadoop. We can publish the output data to data stores such as Azure Data Lake for Business Intelligence(BI) applications to perform visualization or analytics. For better business decisions, we can organize the raw data into meaningful data stores.

Interested in learning Azure Data Factory? Click here to learn more about this Azure Data Factory Training.

Flow Process of Data Factory

In this Azure Data Factory Tutorial for beginners, now we will discuss the working process of Azure Data Factory. The Data Factory service allows us to create pipelines that help us to move and transform data and then run the pipelines on a specified schedule which can be daily, hourly, or weekly. The data that is consumed and produced by workflows is time-sliced, and we can specify the pipeline mode as scheduled or one-time.

AzureDataFactory2

Input dataset: It is the data we have within our data store, which needs to be processed and then passed through a pipeline.

Pipeline: Pipeline operates on data to transform it. Data transformation could be anything like data movement.

Data transformation is possible with the help of USQL, stored procedures, or Hive.

Output dataset: It will contain data that is in a structured form because it is already been transformed and made structured in the pipeline storage. Then, it is given to linked services like Azure Data Lake, blob storage, or SQL.

Linked services: These store information that is very important when it comes to connecting an external source.

For example, consider the SQL server. We need a connection string to connect to an external device. We need to mention the source and the destination of our data.

Gateway: The Gateway connects our on-premises data to the cloud. We need a client installed on our on-premises system so that we can connect to the Azure cloud.

Cloud: Our data can be analyzed and visualized with much different analytical software like Apache Spark, R, Hadoop, and so on.

Read our blog on DP-200 certification to crack the exam on the first attempt!

After learning this Azure Data Factory tutorial, You will be able to use Data Factory for automating the movement and transformation of data by creating linked services, data sets, and pipelines, and scheduling those pipelines.

Get 100% Hike!

Master Most in Demand Skills Now!

Azure Data Lake

As Azure Data Lake is part of the Azure Data Factory tutorial, let us get introduced to Azure Data Lake. Azure Data Lake is a data storage or file system that is highly scalable and distributed. It is located in the cloud and works with multiple analytics frameworks, which are external frameworks, like Hadoop, Apache Spark, and so on.

AzureDataFactory9

We can get our output dataset from the web, mobile, or social media. It is sent into Azure Data Lake Store, and then it is provided to external frameworks like Apache Spark, Hive, etc. 

There are two main concepts when it comes to Azure Data Lake Storage: Storage & Analytics.

Want to read more about Azure? Go through this Azure Tutorial!

Storage is of unlimited size. It can be terabytes, gigabytes, and much more. Azure Data Lake Store shows a wide variety of data, both unstructured and structured. It can store really large files. 

Another concept when it comes to detailing is analytics.

 How does analytics work?

We can monitor and diagnose real-time data. For example, let us take the data we get from vehicles or buildings.

AzureDataFactory10

The data can be used to optimize how it works to respond to certain events or generate alerts, in case something goes wrong. We can also identify fraudulent transactions on our credit card, monitor the current geographical location of our card, check out how many transactions have been taking place on that card, and so on.

monitor and diagnose real-time data

You can find the interview questions related to Azure Data Factory Tutorial from Azure Data Factory Interview Questions.

Watch this Azure Data Lake Tutorial:

Video Thumbnail

Copying Data from Azure SQL to Azure Data Lake

As we got introduced to Azure Data Lake in the Azure Data Factory tutorial for beginners, let us see how to copy data from Azure SQL to Azure Data Lake.

AzureDataFactory12

To create and connect to a database in Azure Data Warehouse, we need to have software known as SSMS (SQL Server Management Studio). Let’s have a look at the process. You can simultaneously implement this process through this Azure Data Factory tutorial. We will start with downloading SSMS.

downloading SSMS
  • After the installation of SSMS, open the dashboard in Microsoft Azure. We need to create a data warehouse.
    • Click on Create  a resource
    • Select Databases
    • Select SQL Data Warehouse
AzureDataFactory114
  • Now, create a new server
AzureDataFactory15
  • Configure the performance
Configure the performance
  • Deploy
Deploy
  • After the successful deployment, click on Go to resource
  • Open SSMS
  • Connect to the server created in Azure data warehouse using SQL authentication
AzureDataFactory18
  • We can find the server name in the overview of the SQL warehouse.
AzureDataFactory19
  • Create a table under the created database
    • Right-click on the database, select a new query and type in the following query
CREATE TABLE student (namee varchar(20),groupp varchar(20),reg int );
INSERT INTO STUDENT (namee,groupp,reg)
VALUES
(‘ASHOK’,’CSE’,11603529),
(‘TARUN’,’ECE’,11608343),
(‘SABID’,’CIVIL’,11603530),
(‘CHARAN’,’MECH’,11603738),
(‘ADARSH’,’BBA’,11603540);
Create a table under the created database
  • Go to the dashboard of Microsoft Azure and create new storage (Azure Data Lake)
AzureDataFactory21
AzureDataFactory22
AzureDataFactory23

Check the blog of Intellipaat about Azure Data Factory vs SSIS vs Azure Databricks and learn the difference.

  • Go to the dashboard of Microsoft Azure and create a Data Factory
Copying Data from Azure SQL to Azure Data Lake
AzureDataFactory25
AzureDataFactory26
  • After the deployment of the Data Factory, click on Author and monitor from quick links.
  • Click on Go to resource
  • Click on Author and monitor
  • Select Overview displayed on the left-side panel
  • Select copy data
AzureDataFactory27
AzureDataFactory28
  • Mention the source data as SQL server and create a new linked service (SQL server)
AzureDataFactory29
  • Select the dataset
AzureDataFactory30
  • Mention the destination datastore as Azure Data Lake and create a new linked service
AzureDataFactory31

To get the service principal ID and service principal key, do the following steps:

  • Go to Azure dashboard
  • Click on Azure active directory
  • Click on App registrations
  • Create a new web application, and the application ID which gets generated is the service principal ID
  • Go to Certificates and secrets, create a new client secret, and then a password is generated which is the service principal key
  • Go to Azure Data Lake Store and give all access to the application which is created for generating the service principal key
  • Copying or moving of data can be done vice versa, i.e., source and destination can be interchanged
AzureDataFactory32
AzureDataFactory33

Want to become an Azure developer, Check out this Azure Developer Certification in Toronto to enhance your career!

Loading Data from Data Lake to Power BI

As visualization is a part of the Azure Data Factory tutorial, let us see how to load data from Data Lake to Power BI for doing visualization and performing analytics.

Microsoft’s Power BI is a cloud-based business analytics service for analyzing and visualizing data. Various processes in Power BI are as follows:

  • Connecting to our data
  • Shaping of data
  • Modeling of data
  • Data visualization
  • Publishing data reports

Certification in Cloud and Devops

Let’s now see how to load data from Data Lake to Power BI

  • Download Power BI Desktop from the below-mentioned link:
    • https://powerbi.microsoft.com/en-us/desktop/
  • Go to Microsoft Azure Dashboard and create a new Azure Data Lake Store
  • Upload the dataset to Azure Data Lake Store
    • Following is the dataset which we are having.
AzureDataFactory34'
  • Click on Data explorer and then upload the dataset
upload the data
  • Open Power BI
  • Connect to Azure Data Lake Store in Power BI with the URL provided by Data Explorer in Azure Data Lake Store 
    • Go to Azure dashboard and open Data Lake Store which we have created
    • Click on Data explorer
    • Click on Properties to find the path, copy the link, and share it with Power BI
    • Open Power BI and click on Get Data
    • Select Data Lake Storage as source
    • Provide the copied path
AzureDataFactory36
AzureDataFactory37
  • Edit query to import data from a binary column

     Click on edit query

  • Click on edit query
  • Select binary
  • Click on transform
  • Use the first row as the header if needed
  • Select columns from the dataset 
  • Create the visualization
  • AzureDataFactory38
  • Delete the resource group to stop paying for the Data Lake

Go through these Microsoft Azure Interview Questions and Answers to excel in your interview.

Creating a Pipeline Using Data Factory – ETL (Extract, Transform, Load) Solution

AzureDataFactory39

Now, in this Azure Data Factory tutorial, let us see how to create a pipeline using Data Factory and learn the steps for creating an ETL solution.
 

We have an SQL database that is on our Azure SQL server database, and we are trying to extract some data from this database. While extracting, if something has to be processed, then it will be processed and then stored in the Data Lake Store. 

Power BI has self-service ETL within itself. Azure Data Factory Data Flow is a new preview feature in Azure Data Factory to visually create ETL flows.

Steps for Creating ETL

  • Create a Linked Service for SQL Server Database
  • Create a Linked Service for Azure Data Lake Store
  • Create a dataset for Data Extraction
  • Create a dataset for Data Saving
  • Create the pipeline and add copy activity
  • Schedule the pipeline (by adding a trigger)

Check out Intellipaat’s Microsoft Azure Training Now!

Introduction to Azure Data Lake Analytics

Azure Data Lake Analytics

In this Azure Data Factory tutorial, let us get introduced to Azure Data Lake Analytics. If we have large data, either structured or unstructured, and we want to process it in a distributed manner without spending time on provisioning data clusters, the best service available is Azure Data Lake Analytics.

If we have large data, either structured or unstructured, and we want to process it in a distributed manner without spending time on provisioning data clusters, the best service available is Azure Data Lake Analytics.

When we say Data Lake Analytics, we are talking about Big Data as a service.

Key components of Azure Data Lake are:

  • Analytics
  • Storage

In terms of Analytics, we have HDInsight Analytics and Azure Data Lake Analytics. Distributed analytics service is built on Apache YARN which is similar to Hadoop because Hadoop also uses YARN for distributed processing.

Distributed processing: Imagine, we have a large amount of data and we want to process it in a distributed manner to speed up the process.

All we want is to get the data separated, process this separated dataset with the instruction set we have, and finally combine them. This is distributed processing, and that’s what we get from Azure Data Lake Analytics.

In Hadoop, we have to spend some considerable time on provisioning it, but still we have been using Hadoop in an open-source platform or HDInsight in Azure mainly for processing data.

Azure Data Lake Analytics simplifies this. Here, we don’t need to worry about provisioning clusters. We can simply make a job of processing our dataset and submit it.

We don’t need to worry about the installation, configuration, and management of our big data cluster management. Moreover, we are only charged per job instance, i.e., we are charged for how nodes that we assign to the job as well as for how long the job runs.

Get certified from the top Azure course in London Now!

Though we compare this with Hadoop, up to some extent, it doesn’t mean that this completely replaces the Hadoop ecosystem. When it comes to a distributed environment system, scalability is a must. There should not be a limit on scaling out. Azure Data Lake Analytics is an on-demand analytics job service to simplify big data analytics. It provides us the ability to increase or decrease the processing power in a per job basis. Let us say, we are handling terabytes of data and we need answers fast. In this case, we can throw nodes per job on the particular instance, and thus we will be able to get our insights quickly.We make use of USQL for processing data. This language is based on internal languages that Microsoft has been using for years: TSQL and C#.NET. We can pass the instruction data using USQL, and there are some other ways too.

Let’s first discuss how to use USQL. USQL can be written in Visual Studio which is the most familiar integrated development environment, and once it is written it can be submitted to Data Lake Analytics. USQL can be used against many data sources.

Data Lake Analytics supports many enterprise features like integration, security, compliance, role-based access control, and auditing. So, we can simply say that it is enterprise-grade. If data is in Azure Data Lake Store, we can expect good performance because Azure Data Lake Analytics is optimized for working with Azure Data Lake Store.

It is also important to understand the difference between HDInsight and Azure Data Lake Analytics (ADLA). The below table will help us differentiate between the two.

HDInsight (PaaS)ADLA (SaaS)
HDInsight does not replace Data Lake Analytics.Azure Data Lake Analytics do not replace HDInsight.
It is a Platform as a Service.It is Software as a Service or Big data queries as a Service.
If we want to process a dataset, first of all, we have to configure the cluster with predefined nodes and then we should use a language like Pig or Hive for processing data.It is all about passing queries written for processing data, and it will create necessary compute nodes as per our instruction on demand and process the dataset.
Since we configure the cluster with HDInsight, we can create clusters and control data as we want. All Hadoop subprojects such as Spark and Kafka can be used without any limitation.Azure Data Lake Analytics does not give much flexibility in terms of the provision in the cluster, but Azure takes care of it. We don’t need to worry about cluster creation. The assignment of nodes will be done based on the instruction we pass. In addition to that, we can make use of USQL taking advantage of .NET for processing data.

USQL works with both structured and unstructured data. We can process images stored in Azure Data Lake Store with the help of image processing libraries. That’s how we process unstructured data using USQL.

When there is a dataset without a predefined structure, we can define the structure as we want and can read data using this defined structure. It allows defining the schema for reading without holding a dataset in a predefined structure.

So in this blog on Azure Data factory basics, we discussed Azure Data Factory concepts such as Flow process, Data lake, Azure Data Lake Analytics & Loading Data from Data lake to Power BI. We hope that this Guide to Azure Data Factory was helpful to you. Let us know if you have any doubts in the comment section below!

Course Schedule

Name Date Details
AWS Certification 23 Nov 2024(Sat-Sun) Weekend Batch View Details
30 Nov 2024(Sat-Sun) Weekend Batch
07 Dec 2024(Sat-Sun) Weekend Batch

About the Author

Senior Cloud Computing Associate

Rupinder is a distinguished Cloud Computing & DevOps associate with architect-level AWS, Azure, and GCP certifications. He has extensive experience in Cloud Architecture, Deployment and optimization, Cloud Security, and more. He advocates for knowledge sharing and in his free time trains and mentors working professionals who are interested in the Cloud & DevOps domain.