What is Azure Data Factory?
Data generated by several applications of products is increasing exponentially day by day. As the data is coming from many sources, it is very difficult to manage it.
To analyze and store all this data, we can use Data Factory which:
- Stores data with the help of Azure Data Lake Storage
- Analyzes the data
- Transforms the data with the help of pipelines (a logical grouping of activities that together perform a task)
- Publishes the organized data
- Visualizes the data with third-party applications like Apache Spark or Hadoop.
Watch this Azure Data Factory Video Tutorial:
Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation. We can make use of Azure Data Factory to create and schedule data-driven workflows that can ingest data from various data stores. It can process and transform the data by using compute services such as Azure Data Lake Analytics, Azure Machine Learning, and Azure HDInsight Hadoop. We can publish the output data to data stores such as Azure Data Lake for Business Intelligence(BI) applications to perform visualization or analytics. For better business decisions, we can organize the raw data into meaningful data stores.
Flow Process of Data Factory
In this Azure Data Factory Tutorial for beginners, now we will discuss the working process of Azure Data Factory. The Data Factory service allows us to create pipelines that help us to move and transform data and then run the pipelines on a specified schedule which can be daily, hourly, or weekly. The data that is consumed and produced by workflows is time-sliced, and we can specify the pipeline mode as scheduled or one-time.
Input dataset: It is the data we have within our data store, which needs to be processed and then passed through a pipeline.
Pipeline: Pipeline operates on data to transform it. Data transformation could be anything like data movement.
Data transformation is possible with the help of USQL, stored procedures, or Hive.
Output dataset: It will contain data that is in a structured form because it is already been transformed and made structured in the pipeline storage. Then, it is given to linked services like Azure Data Lake, blob storage, or SQL.
Linked services: These store information that is very important when it comes to connecting an external source.
For example, consider the SQL server. We need a connection string to connect to an external device. We need to mention the source and the destination of our data.
Gateway: The Gateway connects our on-premises data to the cloud. We need a client installed on our on-premises system so that we can connect to the Azure cloud.
Cloud: Our data can be analyzed and visualized with much different analytical software like Apache Spark, R, Hadoop, and so on.
After learning this Azure Data Factory tutorial, You will be able to use Data Factory for automating the movement and transformation of data by creating linked services, data sets, and pipelines, and scheduling those pipelines.
Get 100% Hike!
Master Most in Demand Skills Now!
Azure Data Lake
As Azure Data Lake is part of the Azure Data Factory tutorial, let us get introduced to Azure Data Lake. Azure Data Lake is a data storage or file system that is highly scalable and distributed. It is located in the cloud and works with multiple analytics frameworks, which are external frameworks, like Hadoop, Apache Spark, and so on.
We can get our output dataset from the web, mobile, or social media. It is sent into Azure Data Lake Store, and then it is provided to external frameworks like Apache Spark, Hive, etc.
There are two main concepts when it comes to Azure Data Lake Storage: Storage & Analytics.
Storage is of unlimited size. It can be terabytes, gigabytes, and much more. Azure Data Lake Store shows a wide variety of data, both unstructured and structured. It can store really large files.
Another concept when it comes to detailing is analytics.
How does analytics work?
We can monitor and diagnose real-time data. For example, let us take the data we get from vehicles or buildings.
The data can be used to optimize how it works to respond to certain events or generate alerts, in case something goes wrong. We can also identify fraudulent transactions on our credit card, monitor the current geographical location of our card, check out how many transactions have been taking place on that card, and so on.
Watch this Azure Data Lake Tutorial:
Copying Data from Azure SQL to Azure Data Lake
As we got introduced to Azure Data Lake in the Azure Data Factory tutorial for beginners, let us see how to copy data from Azure SQL to Azure Data Lake.
To create and connect to a database in Azure Data Warehouse, we need to have software known as SSMS (SQL Server Management Studio). Let’s have a look at the process. You can simultaneously implement this process through this Azure Data Factory tutorial. We will start with downloading SSMS.
- After the installation of SSMS, open the dashboard in Microsoft Azure. We need to create a data warehouse.
- Click on Create a resource
- Select Databases
- Select SQL Data Warehouse
- Configure the performance
- After the successful deployment, click on Go to resource
- Open SSMS
- Connect to the server created in Azure data warehouse using SQL authentication
- We can find the server name in the overview of the SQL warehouse.
- Create a table under the created database
- Right-click on the database, select a new query and type in the following query
CREATE TABLE student (namee varchar(20),groupp varchar(20),reg int );
INSERT INTO STUDENT (namee,groupp,reg)
VALUES
(‘ASHOK’,’CSE’,11603529),
(‘TARUN’,’ECE’,11608343),
(‘SABID’,’CIVIL’,11603530),
(‘CHARAN’,’MECH’,11603738),
(‘ADARSH’,’BBA’,11603540);
- Go to the dashboard of Microsoft Azure and create new storage (Azure Data Lake)
- Go to the dashboard of Microsoft Azure and create a Data Factory
- After the deployment of the Data Factory, click on Author and monitor from quick links.
- Click on Go to resource
- Click on Author and monitor
- Select Overview displayed on the left-side panel
- Select copy data
- Mention the source data as SQL server and create a new linked service (SQL server)
- Mention the destination datastore as Azure Data Lake and create a new linked service
To get the service principal ID and service principal key, do the following steps:
- Go to Azure dashboard
- Click on Azure active directory
- Click on App registrations
- Create a new web application, and the application ID which gets generated is the service principal ID
- Go to Certificates and secrets, create a new client secret, and then a password is generated which is the service principal key
- Go to Azure Data Lake Store and give all access to the application which is created for generating the service principal key
- Copying or moving of data can be done vice versa, i.e., source and destination can be interchanged
Loading Data from Data Lake to Power BI
As visualization is a part of the Azure Data Factory tutorial, let us see how to load data from Data Lake to Power BI for doing visualization and performing analytics.
Microsoft’s Power BI is a cloud-based business analytics service for analyzing and visualizing data. Various processes in Power BI are as follows:
- Connecting to our data
- Shaping of data
- Modeling of data
- Data visualization
- Publishing data reports
Let’s now see how to load data from Data Lake to Power BI
- Download Power BI Desktop from the below-mentioned link:
- https://powerbi.microsoft.com/en-us/desktop/
- Go to Microsoft Azure Dashboard and create a new Azure Data Lake Store
- Upload the dataset to Azure Data Lake Store
- Following is the dataset which we are having.
- Click on Data explorer and then upload the dataset
- Open Power BI
- Connect to Azure Data Lake Store in Power BI with the URL provided by Data Explorer in Azure Data Lake Store
- Go to Azure dashboard and open Data Lake Store which we have created
- Click on Data explorer
- Click on Properties to find the path, copy the link, and share it with Power BI
- Open Power BI and click on Get Data
- Select Data Lake Storage as source
- Provide the copied path
- Edit query to import data from a binary column
Click on edit query
- Click on edit query
- Select binary
- Click on transform
- Use the first row as the header if needed
- Select columns from the dataset
- Create the visualization
- Delete the resource group to stop paying for the Data Lake
Creating a Pipeline Using Data Factory – ETL (Extract, Transform, Load) Solution
Now, in this Azure Data Factory tutorial, let us see how to create a pipeline using Data Factory and learn the steps for creating an ETL solution.
We have an SQL database that is on our Azure SQL server database, and we are trying to extract some data from this database. While extracting, if something has to be processed, then it will be processed and then stored in the Data Lake Store.
Power BI has self-service ETL within itself. Azure Data Factory Data Flow is a new preview feature in Azure Data Factory to visually create ETL flows.
Steps for Creating ETL
- Create a Linked Service for SQL Server Database
- Create a Linked Service for Azure Data Lake Store
- Create a dataset for Data Extraction
- Create a dataset for Data Saving
- Create the pipeline and add copy activity
- Schedule the pipeline (by adding a trigger)
Introduction to Azure Data Lake Analytics
In this Azure Data Factory tutorial, let us get introduced to Azure Data Lake Analytics. If we have large data, either structured or unstructured, and we want to process it in a distributed manner without spending time on provisioning data clusters, the best service available is Azure Data Lake Analytics.
If we have large data, either structured or unstructured, and we want to process it in a distributed manner without spending time on provisioning data clusters, the best service available is Azure Data Lake Analytics.
When we say Data Lake Analytics, we are talking about Big Data as a service.
Key components of Azure Data Lake are:
In terms of Analytics, we have HDInsight Analytics and Azure Data Lake Analytics. Distributed analytics service is built on Apache YARN which is similar to Hadoop because Hadoop also uses YARN for distributed processing.
Distributed processing: Imagine, we have a large amount of data and we want to process it in a distributed manner to speed up the process.
All we want is to get the data separated, process this separated dataset with the instruction set we have, and finally combine them. This is distributed processing, and that’s what we get from Azure Data Lake Analytics.
In Hadoop, we have to spend some considerable time on provisioning it, but still we have been using Hadoop in an open-source platform or HDInsight in Azure mainly for processing data.
Azure Data Lake Analytics simplifies this. Here, we don’t need to worry about provisioning clusters. We can simply make a job of processing our dataset and submit it.
We don’t need to worry about the installation, configuration, and management of our big data cluster management. Moreover, we are only charged per job instance, i.e., we are charged for how nodes that we assign to the job as well as for how long the job runs.
Though we compare this with Hadoop, up to some extent, it doesn’t mean that this completely replaces the Hadoop ecosystem. When it comes to a distributed environment system, scalability is a must. There should not be a limit on scaling out. Azure Data Lake Analytics is an on-demand analytics job service to simplify big data analytics. It provides us the ability to increase or decrease the processing power in a per job basis. Let us say, we are handling terabytes of data and we need answers fast. In this case, we can throw nodes per job on the particular instance, and thus we will be able to get our insights quickly.We make use of USQL for processing data. This language is based on internal languages that Microsoft has been using for years: TSQL and C#.NET. We can pass the instruction data using USQL, and there are some other ways too.
Let’s first discuss how to use USQL. USQL can be written in Visual Studio which is the most familiar integrated development environment, and once it is written it can be submitted to Data Lake Analytics. USQL can be used against many data sources.
Data Lake Analytics supports many enterprise features like integration, security, compliance, role-based access control, and auditing. So, we can simply say that it is enterprise-grade. If data is in Azure Data Lake Store, we can expect good performance because Azure Data Lake Analytics is optimized for working with Azure Data Lake Store.
It is also important to understand the difference between HDInsight and Azure Data Lake Analytics (ADLA). The below table will help us differentiate between the two.
HDInsight (PaaS) | ADLA (SaaS) |
HDInsight does not replace Data Lake Analytics. | Azure Data Lake Analytics do not replace HDInsight. |
It is a Platform as a Service. | It is Software as a Service or Big data queries as a Service. |
If we want to process a dataset, first of all, we have to configure the cluster with predefined nodes and then we should use a language like Pig or Hive for processing data. | It is all about passing queries written for processing data, and it will create necessary compute nodes as per our instruction on demand and process the dataset. |
Since we configure the cluster with HDInsight, we can create clusters and control data as we want. All Hadoop subprojects such as Spark and Kafka can be used without any limitation. | Azure Data Lake Analytics does not give much flexibility in terms of the provision in the cluster, but Azure takes care of it. We don’t need to worry about cluster creation. The assignment of nodes will be done based on the instruction we pass. In addition to that, we can make use of USQL taking advantage of .NET for processing data. |
USQL works with both structured and unstructured data. We can process images stored in Azure Data Lake Store with the help of image processing libraries. That’s how we process unstructured data using USQL.
When there is a dataset without a predefined structure, we can define the structure as we want and can read data using this defined structure. It allows defining the schema for reading without holding a dataset in a predefined structure.
So in this blog on Azure Data factory basics, we discussed Azure Data Factory concepts such as Flow process, Data lake, Azure Data Lake Analytics & Loading Data from Data lake to Power BI. We hope that this Guide to Azure Data Factory was helpful to you. Let us know if you have any doubts in the comment section below!