CTA
Azure Data Factory is a cloud-based Microsoft tool that collects raw business data and further transforms it into usable information. It is a data integration ETL (extract, transform, and load) service that automates the transformation of the given raw data. This Azure Data Factory Interview Questions blog includes the most-probable questions asked during Azure job interviews.
Basic Azure Data Factory Interview Questions for Freshers
1. Why do we need Azure Data Factory?
- The amount of data generated these days is huge, and this data comes from different sources. When we move this particular data to the cloud, a few things need to be taken care of.
- Data can be in any form, as it comes from different sources. These sources will transfer or channel the data in different ways. They will be in different formats. When we bring this data to the cloud or particular storage, we need to make sure it is well managed, i.e., you need to transform the data and delete unnecessary parts. As far as moving the data is concerned, we need to make sure that data is picked from different sources, brought to one common place, and stored. If required, we should transform it into something more meaningful.
- This can be done by a traditional data warehouse, but there are certain disadvantages. Sometimes we are forced to go ahead and have custom applications that deal with all these processes individually, which is time-consuming, and integrating all these sources is a huge pain. We need to figure out a way to automate this process or create proper workflows.
- Data Factory helps to orchestrate this complete process in a more manageable or organizable manner.
2. What is Azure Data Factory?
It is a cloud-based integration service that allows the creation of data-driven workflows in the cloud for orchestrating and automating data movement and transformation.
- Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that ingest data from disparate data stores.
- It can process and transform data using computer services such as HDInsight, Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.
3. What is the integration runtime?
- The integration runtime is the compute infrastructure Azure Data Factory uses to provide the following data integration capabilities across various network environments.
- Three Types of Integration Runtimes:
- Azure Integration Runtime: Azure integration runtime (IR) can copy data between cloud data stores and dispatch the activity to a variety of computing services, such as Azure HDInsight or SQL Server, where the transformation takes place.
- Self-Hosted Integration Runtime: A self-hosted integration runtime is software with essentially the same code as Azure integration runtime. But you install it on an on-premise machine or a virtual machine in a virtual network. A self-hosted IR can run copy activities between a public cloud data store and a data store on a private network. It can also dispatch transformation activities against compute resources on a private network. We use self-hosted IR because the Data Factory will not be able to directly access primitive data sources because they sit behind a firewall. It is sometimes possible to establish a direct connection between Azure and on-premises data sources by configuring the Azure Firewall in a specific way. If we do that, we don’t need to use a self-hosted IR.
- Azure-SSIS Integration Runtime: With SSIS integration runtime, you can natively execute SSIS packages in a managed environment. So when we lift and shift the SSIS packages to the Data Factory, we use Azure SSIS IR.
Check out this video on Azure Interview Questions And Answers:
Are you ready for your interview?
Take a quick Quiz to check it out
4. What is the limit on the number of integration runtimes?
There is no hard limit on the number of integration runtime instances you can have in a data factory. There is, however, a limit on the number of VM cores that the integration runtime can use per subscription for SSIS package execution.
5. What are the top-level concepts of Azure Data Factory?
- Pipeline: It acts as a carrier in which various processes take place. An individual process is an activity.
- Activities: Activities represent the processing steps in a pipeline. A pipeline can have one or multiple activities. It can be anything, i.e., a process like querying a data set or moving the dataset from one source to another.
- Datasets: In simple words, it is a data structure that holds our data.
- Linked Services: These store information that is very important when connecting to an external source.
For example, consider an SQL Server. You need a connection string that you can connect to an external device. You need to mention the source and destination of your data.
6. How can I schedule a pipeline?
- You can use the scheduler trigger or time window trigger to schedule a pipeline.
- The trigger uses a wall-clock calendar schedule, which can schedule pipelines periodically or in calendar-based recurrent patterns (for example, on Mondays at 6:00 PM and Thursdays at 9:00 PM).
7. Can I pass parameters to a pipeline run?
- Yes, parameters are a first-class, top-level concept in Data Factory.
- You can define parameters at the pipeline level and pass arguments as you execute the pipeline run-on-demand or by using a trigger.
8. Can I define default values for the pipeline parameters?
You can define default values for the parameters in the pipelines.
9. Can an activity’s output property be consumed in another activity?
An activity output can be consumed in a subsequent activity with the @activity construct.
CTA
10. How do I handle null values in an activity output?
You can use the @coalesce construct in the expressions to handle the null values.
Are you ready for your interview?
Take a quick Quiz to check it out
11. Which Data Factory version do I use to create data flows?
Use the Data Factory version 2 to create data flows.
12. What are datasets in Azure Data Factory?
Datasets are defined as named views of data that simply point to or reference the data to be used in activities as inputs or outputs.
13.How are pipelines monitored in Azure Data Factory?
Azure Data Factory uses user experience to monitor pipelines in the “Monitor and Manage” tile in the data factory blade of the Azure portal.
14. What are the three types of integration runtime?
The three types of integration runtime are:
-
- Azure Integration Runtime
- Self-Hosted Integration Runtime
- Azure-SQL Server Integration Services
Get 100% Hike!
Master Most in Demand Skills Now!
15. What are the types of data integration design patterns?
There are 4 types of common data integration, namely:
-
- Broadcast
- Bi-directional syncs
- Correlation
- Aggregation
Intermediate Azure Data Factory Interview Questions
16. What is the difference between Azure Data Lake and Azure Data Warehouse?
The data warehouse is a traditional way of storing data that is still widely used. The data lake is complementary to a data warehouse, i.e., if you have your data in a data lake that can be stored in the data warehouse, you have to follow specific rules.
DATA LAKE |
DATA WAREHOUSE |
Complementary to the data warehouse |
Maybe sourced to the data lake |
Data is either detailed or raw. It can be in any particular form. You need to take the data and put it in your data lake. |
Data is filtered, summarized, and refined. |
Schema on read (not structured, you can define your schema in n number of ways) |
Schema on write (data is written in structured form or a particular schema) |
One language to process data of any format(USQL) |
It uses SQL. |
17. What is blob storage in Azure?
Azure Blob Storage is a service for storing large amounts of unstructured object data, such as text or binary data. You can use Blob Storage to expose data publicly to the world or to store application data privately. Common uses of Blob Storage are as follows:
- Serving images or documents directly to a browser
- Storing files for distributed access
- Streaming video and audio
- Storing data for backup and restore disaster recovery, and archiving
- Storing data for analysis by an on-premises or Azure-hosted service
18. What is the difference between Azure Data Lake store and Blob storage?
|
Azure Data Lake Storage Gen1 |
Azure Blob Storage |
Purpose |
Optimized storage for big data analytics workloads |
General-purpose object store for a wide variety of storage scenarios, including big data analytics |
Structure |
Hierarchical file system |
Object store with a flat namespace |
Key Concepts |
Data Lake Storage Gen1 account contains folders, which in turn contain data stored as files |
Storage account has containers, which in turn has data in the form of blobs |
Use Cases |
Batch, interactive, streaming analytics, and machine learning data such as log files, IoT data, clickstreams, and large datasets |
Any type of text or binary data, such as application back end, backup data, media storage for streaming, and general-purpose data. Additionally, full support for analytics workloads: batch, interactive, streaming analytics, and machine learning data such as log files, IoT data, clickstreams, and large datasets |
Server-Side API |
WebHDFS-compatible REST API |
Azure Blob Storage REST API |
Data Operations – Authentication |
Based on Azure Active Directory Identities |
Based on shared secrets – Account Access Keys and Shared Access Signature Keys. |
19. What are the steps for creating ETL process in Azure Data Factory?
While we are trying to extract some data from the Azure SQL Server database, if something has to be processed, it will be processed and stored in the Data Lake Storage.
Steps for Creating ETL
- Create a linked service for the source data store, which is SQL Server Database
- Assume that we have a cars dataset
- Create a linked service for the destination data store, which is Azure Data Lake Storage (ADLS)
- Create a dataset for data saving
- Create the pipeline and add copy activity
- Schedule the pipeline by adding a trigger
20. What is the difference between HDInsight and Azure Data Lake Analytics?
HDInsight |
Azure Data Lake Analytics |
If we want to process a data set, first of all, we have to configure the cluster with predefined nodes, and then we use a language like Pig or Hive for processing the data. |
It is all about passing queries written for processing data. Azure Data Lake Analytics will create the necessary compute nodes per our instructions on demand and process the data set. |
Since we configure the cluster with HDInsight, we can create it as we want and control it as we want. All Hadoop subprojects, such as Spark and Kafka, can be used without limitations. |
With Azure Data Lake Analytics, it does not give much flexibility in terms of the provision in the cluster, but Microsoft Azure takes care of it. We don’t need to worry about cluster creation. The assignment of nodes will be done based on the instructions we pass. In addition, we can make use of U-SQL taking advantage of .Net for processing data. |
21. Can an activity in a pipeline consume arguments that are passed to a pipeline run?
In a pipeline, an activity can indeed consume arguments that are passed to a pipeline run. Arguments serve as input values that can be provided when triggering or scheduling a pipeline run. These arguments can be used by activities within the pipeline to customize their behavior or perform specific tasks based on the provided values. This flexibility allows for dynamic and parameterized execution of pipeline activities, enhancing the versatility and adaptability of the pipeline workflow.
Each activity within the pipeline can consume the parameter value that’s passed to the pipeline and run with the @parameter construct.
CTA
22. What has changed from private preview to limited public preview in regard to data flows?
- You will no longer have to bring your own Azure Databricks clusters.
- Data Factory will manage cluster creation and teardown.
- Blob datasets and Azure Data Lake Storage Gen2 datasets are separated into delimited text and Apache Parquet datasets.
- You can still use Data Lake Storage Gen2 and Blob Storage to store those files. Use the appropriate linked service for those storage engines.
23. How do I access data using the other 80 dataset types in Data Factory?
- The mapping data flow feature currently allows Azure SQL Database, Azure SQL Data Warehouse, delimited text files from Azure Blob Storage or Azure Data Lake Storage Gen2, and Parquet files from Blob Storage or Data Lake Storage Gen2 natively for source and sink.
- Use the copy activity to stage data from any of the other connectors, and then execute a Data Flow activity to transform the data after it’s been staged. For example, your pipeline will first copy into Blob Storage, and then a Data Flow activity will use a dataset in the source to transform that data.
24. What is the Get Metadata activity in ADF?
The Get Metadata activity is utilized for getting the metadata of any data in the Synapse pipeline or ADF. To perform validation or consumption, we can utilize the output from the Get Metadata activity in conditional expressions. It takes a dataset as input and returns metadata information as output. The maximum size of the returned metadata is 4 MB.
25. List any 5 types of data sources that Azure Data Factory supports.
Azure supports the following data sources:
- Azure Blob Storage: Azure Blob is a cloud storage solution to store large-scale unstructured data.
- Azure SQL Database: It is a managed, secured, intelligent service that uses the SQL Server Database engine in ADF.
- Azure Data Lake Storage: It is a service that can store data of any size, shape, and speed and perform all kinds of processing and analytics across platforms and languages.
- Azure Cosmos DB: It is a service that works entirely on NoSQL and relational databases for modern app development.
- Azure Table Storage: It is a service used for storing structured NoSQL data; it provides a key/attribute with no schema design.
26. How can one set up data sources and destinations in Azure Data Factory?
To connect with a data source or destination, one needs to set up a linked service. A linked service is a configuration containing the connection information required to connect to a data source or destination. The following steps show how to set linked services:
- Navigate to your Azure Data Factory Instance in Azure Portal.
- Select “Author and Monitor” to open UI.
- From the left-hand menu, select connections and create a new linked service.
- Choose the type of data source you want to connect with: Azure Blob Storage, Azure SQL Database, Amazon S3, etc.
- Configure and test the connection
27. How can one set up a linked service?
To set up a linked service, follow the steps below:
- Click “Author & Monitor” tab in the ADF portal
- Next, click the “Author” button to launch ADF authoring interface.
- Click the “Linked Services” tab to create a new linked service.
- Select the type of service corresponding to the data source or destination one wants to connect with.
- Mention the connection information, such as server name, database name, and credentials.
- Test the connection service to ensure the working.
- Save the linked service.
28. What is a Synapse workspace, and where is it required?
Azure Synapse Analytics workspace was previously called Azure SQL Data Warehouse. It is a service that manages and integrates enterprise data warehousing, big data analytics, and data integration capabilities into a single platform. It supports role-based access control (RBAC) encryption and auditing capabilities to ensure data protection and compliance with regulatory requirements.
Use cases:
- Collaboration on analytical projects by data engineers, data scientists, and business analysts, leveraging the capabilities for data querying, analysis, and visualization.
- It is used in analyzing and visualizing the data, creating reports and dashboards, and gaining insights into business performance and trends, supporting business intelligence and reporting needs.
29. What is the general connector error in Azure Data Factory? Mention the causes of the errors.
The general connector errors are:
1. UserErrorOdbcInvalidQueryString
Cause: when the user commits a wrong or invalid query for fetching the data/schemas.
2. FailedToResolveParametersInExploratoryController
Cause: This error arises due to the limitation of supporting the linked service, which provides a reference to another linked service with parameters for test connections or preview data.
Advanced Azure Data Factory Interview Questions for Experienced
30. Explain the two levels of security in ADLS Gen2.
The two levels of security applicable to ADLS Gen2 were also in effect for ADLS Gen1. Even though this is not new, it is worth calling out the two levels of security because it’s a fundamental piece to getting started with the data lake, and it is confusing for many people to start.
- Role-Based Access Control (RBAC): RBAC includes built-in Azure roles such as reader, contributor, owner, or custom roles. Typically, RBAC is assigned for two reasons. One is to specify who can manage the service itself (i.e., update settings and properties for the storage account). Another reason is to permit the use of built-in data explorer tools, which require reader permissions.
- Access Control Lists (ACLs): Access control lists specify exactly which data objects a user may read, write, or execute (execute is required to browse the directory structure). ACLs are POSIX-compliant, thus familiar to those with a Unix or Linux background.
POSIX does not operate on a security inheritance model, which means that access ACLs are specified for every object. The concept of default ACLs is critical for new files within a directory to obtain the correct security settings, but it should not be thought of as an inheritance. Because of the overhead assigning ACLs to every object, and because there is a limit of 32 ACLs for every object, it is extremely important to manage data-level security in ADLS Gen1 or Gen2 via Azure Active Directory groups.
31. How is the performance of pipelines optimized in Azure Data Factory?
Optimizing the performance of Azure Data Factory pipelines involves strategically enhancing data movement, transformation, and overall pipeline execution. Some ways to optimize performance are:
- Choosing the appropriate integration runtime for data movement activities based on the location of the destination and data source, integration runtimes help optimize performance by providing compute resources closer to the data.
- Usage of parallel activities such as breaking data into smaller chunks and executing them in parallel activities such as pipelines or within data flow activities.
- While mapping data flows, minimize the unwanted transformations and data shuffling by reducing the data flow magic.
32. What are triggers in ADF, and how can they be used to automate pipeline expressions? What is their significance in pipeline development?
In ADF, triggers are components that enable the automated execution of pipeline activities based on predefined conditions or schedules. In orchestrating the data workflows, triggers have played a crucial role, along with automating data integration and transformation tasks within ADF.
Significance of triggers in pipeline development:
- Automation: Trigger enables automated execution of the pipeline, eradicating manual intrusion and scheduling of tasks.
- Scheduling: Scheduling triggers help users define recurring schedules for pipeline execution, ensuring that the tasks are performed and integrated at regular intervals.
- Event-Driven Architecture: Event triggers enable event-driven architecture in ADF, where pipelines are triggered in response to specific data events or business events.
33. How many types of datasets are supported in ADF?
The datasets supported in ADF are as follows:
- CSV
- Excel
- Binary
- Avro
- JSON
- ORC
- XML
- Parquet
34. What are the prerequisites for Data Factory SSIS execution?
The prerequisites include an Azure SQL managed instance or an Azure SQL Database for hosting the SSIS IR and SSISDB catalog.
35. What are the differences between the transformation procedures called mapping data flows and wrangling data flow in ADF?
The mapping data flow is the process of graphically designing and transforming the data. This allows the user to design data transformation logic in the graphical user interface without the need for a professional programmer, which eventually makes it cost-effective.
On the contrary, in the wrangling data flow activity, the method of data preparation is without the use of a program. In Spark, the data manipulation capabilities of Power Query M are provided to the user, as Power Query Online has a compatible nature.
36.What is an ARM template? Where are they used?
ARM stands for Azure Resource Manager Template. In ADF, this template allows the user to create and deploy an Azure infrastructure not only on virtual machines but on infrastructure, storage systems, or other resources.
37. Mention a few functionalities of the ARM template in ADF.
The ARM template consists of a few functions that can be used in deployment to a resource group subscription or management group, such as
- CIDR
- Array functions
- Comparison functions
- Resource functions
- Deployment value functions
- Subscription scope function
- Logical function
38.What are the functions in ARM with respect to CIDR and Comparison functions?
A few functionalities of ARM are as follows:
CIDR: It consists of functions of the sys namespace such as
- parseCidr
- cidrSubnet
- cidrHost
Comparison Functions: In ARM, this feature helps in comparing templates such as:
- coalesce
- equals
- less
- lessOrEquals
- greater
- greaterOrEquals
39. What is a Bicep in ADF?
Bicep is a domain-specific language that utilizes declarative syntax to deploy Azure resources. A Bicep file consists of the infrastructure to be deployed in Azure. The file gets used throughout the development lifecycle to repeatedly deploy the infrastructure.
40. What is ETL in ADF?
ETL stands for Extract, Transform, and Load process. It is a data pipeline used for the collection of data from various resources. The data is then transformed according to the business rules. After transforming the data, it is then loaded into the destination data store.
The data transformation happens based on filtering, sorting, aggregating, joining and cleaning data, and duplicating and validating data.
41. What is a V2 data factory?
- V2 in ADF is Azure Data Factory version 2, which allows one to create and schedule pipelines, which are data-driven workflows that can ingest data from disparate data stores.
- It processes or transforms data by using compute services such as Azure HDInsight Hadoop, Spark, and Azure Data Lake Analytics.
- It publishes output data to data stores such as Azure SQL Data Warehouse for BI applications to consume.
42. What are the three most important tasks that you can complete with ADF?
The three most important tasks that you can complete with ADF are moving data, transferring data, and exercising control.
- In the movement of data, the operations facilitate the flow of data from one data store to another using the data factory’s copy activity.
- Data transformation activities are the modification of data activities that modify the loaded data as the data moves towards the final destination stage. Some examples are stored procedures, U-SQL, and Azure functions.
43. How is Azure Data Factory different from Data Bricks?
Azure Data Factory mainly excels in ETL workflows, i.e., (Extract Transform Load) which smoothens the data movement and data transformation. Data Bricks, which is mainly built on Apache Spark, is mainly focused on advanced analytics, which involves data processing on a large scale.
44. What do you mean by Azure SSIS integration runtime?
The cluster or group of virtual machines that are hosted in Azure and are more dedicated to running the SSIS packages in the data factory are termed the Azure SSIS integration runtime (IR). The size of the nodes can be configured to scale up, while the number of nodes on the virtual machine’s cluster can be configured to scale out.
45. What is Data Flow Debug?
Data Flow Debug is an excellent feature provided by Azure Data Factory to the developers, within which developers are facilitated to simultaneously observe and analyze the transformations made in the data during the building phase, including the designing and debugging phases. This helps the user get real-time feedback on the data shape at each phase of execution and the data flow within the pipelines.
46. How are email notifications sent on a Pipeline Failure?
There are multiple options to send an email notification to the developer in case of a Pipeline Failure:
- Logical Application with Web/Webhook Activity: An application can be configured that, upon receiving an HTTP request, can quickly notify the required set of people about the failure.
- Alerts and Metrics Options: These options can be set up in the pipeline itself, where a number of options are available to email in case failure activity is detected.
47. What do you mean by an Azure SQL database?
Azure SQL database is also an integral part of the Azure family, which extends as a fully managed, secured, and intelligent product that uses an SQL Server Database Engine to store the data within the Azure Cloud.
48. What do you understand from a data flow map?
A data flow map is also called a data flow diagram (DFD), which depicts the flow of data inside a system or organization. It depicts the movement of data from one process to another process or entity, highlighting the source’s destination and the transformations of data during the process. Data flows come in handy in system analysis and design to visualize and understand the data flow.
49. What is the capability of Lookup Activity in ADF?
Lookup Activity can retrieve a dataset from any of the data sources supported by the data factory and synapse pipelines. Some capabilities of lookup are:
- It can return up to 5000 rows at once; if there are more than 5000 rows, it will return the first 5000 data values.
- The supported activity size for lookup is 4 MB; the activity fails if the size exceeds the limit.
- The longest duration before timeout for lookup activity is 24 hours.
Azure Data Factory Salary Trends
This tech arena, which has been proven to be data-driven and cloud-dominant, has catalyzed the demand for cloud architects, which in turn results in high-paying jobs across the globe.
According to the recent reports fetched from AmbitionBox in India, Data Engineers are employed on a salary starting from INR 4 LPA to INR 60 LPA, making an average pay of INR 10 LPA.
The numbers from the USA suggest that the salary for a Data Engineer starts at the base pay of $40K and lasts until $531K, as per Built In.
Azure Data Factory Job Trends
Azure Data Factory, a Microsoft Cloud Computing platform, offers a comprehensive suite of data services, which has in turn made it a valuable choice for different firms that are willing to seek scalable, secured, and advanced data solutions. Therefore, these factors have now increased the demand for professionals who are skilled in Azure Data Engineering.
According to the latest statistics available on LinkedIn, there are more than 22,000 jobs available across India, with the Delhi NCR region alone having more than 2000 job openings.
The global demand for Data Engineering has also seen a surge in countries like the USA, where tech industries are adamant about getting a well-established infrastructure over the cloud, which has resulted in creating around 4000 Azure Data Engineer job openings in the USA, as cited by Glassdoor.
Azure Data Factory Roles & Responsibilities
The developers skilled in Data Engineering are the real architects of the data ecosystem. They are the actual builders of the infrastructure who ensure the quality and security of the data, empowering the data-driven decision-making system. According to the recent job description released by Walmart Global Tech India, Data Engineers are required to fulfill the following key roles and responsibilities:
Technical Skills:
- Develop the conceptual, logical, and physical data models
- Implement RDBMS and Operational Data Store
- Decent knowledge of GBQ or Azure Synapse
- Strong understanding of data warehousing concepts
- Knowledge of NoSQL and ETL tools
Soft Skills:-
- Excellent oral and written communication skills
- Ability to work in diverse and different team environments
- Proficient enough in problem-solving skills and critical and analytical thinking skills
We hope this set of Azure Data Factory interview questions will help you prepare for your interviews. All the best!