- The amount of data generated these days is huge and this data comes from different sources. When we move this particular data to the cloud, there are few things needed to be taken care of.
- Data can be in any form as it comes from different sources and these different sources will transfer or channelize the data in different ways and it can be in a different format. When we bring this data to the cloud or particular storage we need to make sure that this data is well managed. i.e you need to transform the data, delete unnecessary parts. As per moving the data is concerned, we need to make sure that data is picked from different sources and bring it at one common place then store it and if required we should transform into more meaningful.
- This can be also done by traditional data warehouse as well but there are certain disadvantages. Sometimes we are forced to go ahead and have custom applications that deal with all these processes individually which is time-consuming and integrating all these sources is a huge pain. we need to figure out a way to automate this process or create proper workflows.
- Data factory helps to orchestrate this complete process into more manageable or organizable manner.
Cloud-based integration service that allows creating data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.
- Using Azure data factory, you can create and schedule the data-driven workflows(called pipelines) that can ingest data from disparate data stores.
- It can process and transform the data by using compute services such as HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.
- The integration runtime is the compute infrastructure that Azure Data Factory uses to provide the following data integration capabilities across various network environments.
- 3 Types of integration runtimes:
- Azure Integration Run Time: Azure Integration Run Time can copy data between cloud data stores and it can dispatch the activity to a variety of compute services such as Azure HDinsight or SQL server where the transformation takes place
- Self Hosted Integration Run Time: Self Hosted Integration Run Time is software with essentially the same code as Azure Integration Run Time. But you install it on an on-premise machine or a virtual machine in a virtual network. A Self Hosted IR can run copy activities between a public cloud data store and a data store in a private network. It can also dispatch transformation activities against compute resources in a private network. We use Self Hosted IR because Data factory will not be able to directly access on-primitive data sources as they sit behind a firewall.It is sometimes possible to establish a direct connection between Azure and on-premises data sources by configuring the firewall in a specific way if we do that we don’t need to use a self-hosted IR.
- Azure SSIS Integration Run Time: With SSIS Integration Run Time, you can natively execute SSIS packages in a managed environment. So when we lift and shift the SSIS packages to data factory, we use Azure SSIS Integration Run TIme.
There is no hard limit on the number of integration runtime instances you can have in a data factory. There is, however, a limit on the number of VM cores that the integration runtime can use per subscription for SSIS package execution.
Data Warehouse is a traditional way of storing data which is still used widely. Data Lake is complementary to Data Warehouse i.e if you have your data at a data lake that can be stored in data warehouse as well but there are certain rules that need to be followed.
|DATA LAKE||DATA WAREHOUSE|
|Complementary to data warehouse||Maybe sourced to the data lake|
|Data is Detailed data or Raw data. It can be in any particular form.you just need to take the data and dump it into your data lake||Data is filtered, summarised,refined|
|Schema on read (not structured, you can define your schema in n number of ways)||Schema on write(data is written in Structured form or in a particular schema)|
|One language to process data of any format(USQL)||It uses SQL|
Azure Blob Storage is a service for storing large amounts of unstructured object data, such as text or binary data. You can use Blob Storage to expose data publicly to the world or to store application data privately. Common uses of Blob Storage include:
- Serving images or documents directly to a browser
- Storing files for distributed access
- Streaming video and audio
- Storing data for backup and restore disaster recovery, and archiving
- Storing data for analysis by an on-premises or Azure-hosted service
|Azure Data Lake Storage Gen1||Azure Blob Storage|
|Purpose||Optimized storage for big data analytics workloads||General purpose object store for a wide variety of storage scenarios, including big data analytics|
|Structure||Hierarchical file system||Object store with flat namespace|
|Key Concepts||Data Lake Storage Gen1 account contains folders, which in turn contains data stored as files||Storage account has containers, which in turn has data in the form of blobs|
|Use Cases||Batch, interactive, streaming analytics and machine learning data such as log files, IoT data, click streams, large datasets||Any type of text or binary data, such as application back end, backup data, media storage for streaming and general purpose data. Additionally, full support for analytics workloads; batch, interactive, streaming analytics and machine learning data such as log files, IoT data, click streams, large datasets|
|Server-side API||WebHDFS-compatible REST API||Azure Blob Storage REST API|
|Data Operations – Authentication||Based on Azure Active Directory Identities||Based on shared secrets – Account Access Keys and Shared Access Signature Keys.|
While we are trying to extract some data from Azure SQL server database, if something has to be processed, then it will be processed and is stored in the Data Lake Store.
Steps for Creating ETL
- Create a Linked Service for source data store which is SQL Server Database
- Assume that we have a cars dataset
- Create a Linked Service for destination data store which is Azure Data Lake Store
- Create a dataset for Data Saving
- Create the pipeline and add copy activity
- Schedule the pipeline by adding a trigger
|With azure data lake analytics, it does not give much flexibility in terms of the provision in the cluster, but Azure takes care of it. We don’t need to worry about cluster creation. The assignment of nodes will be done based on the instruction we pass. In addition to that, we can make use of USQL taking advantage of dotnet for processing data.|
- Pipeline: It acts as a carrier in which we have various processes taking place.
This individual process is an activity.
- Activities: Activities represent the processing steps in a pipeline. A pipeline can have one or multiple activities. It can be anything i.e process like querying a data set or moving the dataset from one source to another.
- Datasets: Sources of data. In simple words, it is a data structure that holds our data.
- Linked services: These store information that is very important when it comes to connecting an external source.
For example: Consider SQL server, you need a connection string that you can connect to an external device. you need to mention the source and the destination of your data.
- You can use the scheduler trigger or time window trigger to schedule a pipeline.
- The trigger uses a wall-clock calendar schedule, which can schedule pipelines periodically or in calendar-based recurrent patterns (for example, on Mondays at 6:00 PM and Thursdays at 9:00 PM).
- Yes, parameters are a first-class, top-level concept in Data Factory.
- You can define parameters at the pipeline level and pass arguments as you execute the pipeline run on demand or by using a trigger.
You can define default values for the parameters in the pipelines.
Each activity within the pipeline can consume the parameter value that’s passed to the pipeline and run with the @parameter construct.
An activity output can be consumed in a subsequent activity with the @activity construct.
You can use the @coalesce construct in the expressions to handle the null values gracefully.
Use the Data Factory V2 version to create data flows.
- You will no longer have to bring your own Azure Databricks clusters.
- Data Factory will manage cluster creation and tear-down.
- Blob datasets and Azure Data Lake Storage Gen2 datasets are separated into delimited text and Apache Parquet datasets.
- You can still use Data Lake Storage Gen2 and Blob storage to store those files. Use the appropriate linked service for those storage engines.
- The Mapping Data Flow feature currently allows Azure SQL Database, Azure SQL Data Warehouse, delimited text files from Azure Blob storage or Azure Data Lake Storage Gen2, and Parquet files from Blob storage or Data Lake Storage Gen2 natively for source and sink.
- Use the Copy activity to stage data from any of the other connectors, and then execute a Data Flow activity to transform data after it’s been staged. For example, your pipeline will first copy into Blob storage, and then a Data Flow activity will use a dataset in source to transform that data.
The two levels of security applicable to ADLS Gen2 were also in effect for ADLS Gen1. Even though this is not new, it is worth calling out the two levels of security because it’s a very fundamental piece to getting started with the data lake and it is confusing for many people just getting started.
- Role-Based Access Control (RBAC). RBAC includes built-in Azure roles such as reader, contributor, owner or custom roles. Typically, RBAC is assigned for two reasons. One is to specify who can manage the service itself (i.e., update settings and properties for the storage account). Another reason is to permit the use of built-in data explorer tools, which require reader permissions.
- Access Control Lists (ACLs). Access control lists specify exactly which data objects a user may read, write, or execute (execute is required to browse the directory structure). ACLs are POSIX-compliant, thus familiar to those with a Unix or Linux background.
POSIX does not operate on a security inheritance model, which means that access ACLs are specified for every object. The concept of default ACLs is critical for new files within a directory to obtain the correct security settings, but it should not be thought of as inheritance. Because of the overhead assigning ACLs to every object, and because there is a limit of 32 ACLs for every object, it is extremely important to manage data-level security in ADLS Gen1 or Gen2 via Azure Active Directory groups.