Process Advisors

ey-logo
*Subject to Terms and Condition

Categories

Top Answers to ETL Interview Questions

CTA

ETL stands for extract, transform, and load. These are the three functions of databases that are combined into a single tool such that you can take out data from a particular database and store or keep it in another. This ETL Interview Questions blog has a compiled list of questions that are most generally asked during interviews. Prepare the ETL interview questions listed below and get ready to crack your job interview:

Q1. Compare between ETL and ELT.
Q2. What is an ETL process?
Q3. How many steps are there in an ETL process?
Q4. What are the steps involved in an ETL process?
Q5. Can there be sub-steps for each of the ETL steps?
Q6. What are initial load and full load?
Q7. What is meant by incremental load?
Q8. What are the names of the layers in the three-tier system of ETL and how does it function?
Q9. What are the characteristics of snapshots and what is their meaning?
Q10. What are views?

This ETL Interview Questions blog is broadly divided into the categories mentioned below:
1. Basic

2. Intermediate

3. Advanced

Watch this ETL Pipeline video:

Youtube subscribe

Basic Interview Questions

1. Compare between ETL and ELT.

Criteria ETL ELT
Working methodology Data from the source system to the data warehouse Leverages the target system to transform data
Privacy Pre-loading transformation has the potential to eliminate PII, which provides assistance in complying with HIPAA regulations. More privacy safeguards are necessary for directly loading data.
Costs Cost issues can arise from the use of separate servers. The simplified data stack has a lower cost.
Performance Average Good
Flexibility High Low

2. What is an ETL process?

ETL is the process of Extraction, Transformation, and Loading.

Learn more about Business Objects vs Informatica in this insightful blog!

3. How many steps are there in an ETL process?

In its fundamental form, the ETL process involves the extraction, transformation, and loading of data. Although the acronym suggests a concise and orderly three-step procedure – extract, transform, load.

Go through this insightful blog to get a detailed understanding of the ETL process!

4. What are the steps involved in an ETL process?

The steps involved are defining the source and the target, creating the mapping, creating the session, and creating the workflow.

5. Can there be sub-steps for each of the ETL steps?

Yes, each step in the ETL (Extract, Transform, Load) process can have sub-steps that further refine and enhance the data. Here are some examples of sub-steps for each of the ETL steps:

Extract:

  • Connect to the data source
  • Retrieve data based on specific criteria or filters
  • Perform data sampling or profiling to understand the structure and quality of the data
  • Handle data extraction errors or exceptions

Transform:

  • Filtering
  • Cleansing
  • De-duplicating
  • Validating
  • Standardizing
  • Normalizing
  • Aggregating
  • Enriching
  • Deriving

Load:

  • Create or configure the destination database or data warehouse
  • Define the schema or structure of the target data tables
  • Map and transform the transformed data to match the target schema
  • Handle data loading errors or conflicts
  • Optimize the loading process for performance and efficiency
  • Ensure data consistency and referential integrity during the loading process

These sub-steps provide a more detailed breakdown of the tasks and operations that can be performed within each step of the ETL process. They help in ensuring data quality, consistency, and alignment with the desired data model or requirements.

6. What are initial load and full load?

In ETL, the initial load is the process for populating all data warehousing tables for the very first time. In full load, when the data is loaded for the first time, all set records are loaded at a stretch depending on its volume. It would erase all contents from the table and would reload the fresh data.

7. What is meant by incremental load?

Incremental load refers to applying dynamic changes as and when required in a specific period and predefined schedules.

Get 100% Hike!

Master Most in Demand Skills Now !

8. What are the names of the layers in the three-tier system of ETL and how does it function?

The data warehouse is considered to be the 3-tier system in ETL.
It is known as a 3-tier system because data warehouses generally have a three-level (tier) architecture:

  1. Bottom Tier (Data Warehouse Server)
  2. Middle Tier (OLAP Server)
  3. Top Tier (Front end Tools)

The middle tier in ETL provides end users the data that is usable in a secure way. the other two layers are on either side of the middle tier, the end users and the back-end data storage.

The first layer in ETL is the source layer, and it is the layer where data lands. The second layer is the integration layer where the data is stored after transformation. The third layer is the dimension layer where the actual presentation layer is present.

9. What are the characteristics of snapshots and what is their meaning?

Snapshots are copies of the read-only data that is stored in the master table.

Snapshots are located on remote nodes and refreshed periodically so that the changes in the master table can be recorded. They are also replicas of tables.

10. What are views?

Views are built using the attributes of one or more tables. View with a single table can be updated, but those with multiple tables cannot be updated.

Intermediate Interview Questions

11. What is the meaning of a materialized view log, and how would you define a materialized view?

A materialized view log is a table that stores changes to the base tables used in a materialized view. A materialized view is a pre-computed aggregate table that contains summarized or joined data from fact and dimension tables.

Learn more about Informatica from this in-depth Informatica Tutorial!

12. What is the difference between PowerCenter and PowerMart?

PowerCenter processes large volumes of data, whereas Power Mart processes small volumes of data.

13. Mention a few ETL flaws.

Following are a few common ETL bugs:

User Interface Bug: GUI bugs encompass issues with color selection, font style, navigation, spelling check, etc., in the user interface.

Input/Output Bug: This type of bug causes the application to accept invalid values instead of valid ones.

Boundary Value Analysis Bug: Bugs in this section verify both the minimum and maximum values.

Calculation bugs: Calculation bugs usually result in mathematical errors that lead to incorrect results.

Load Condition Bugs: This software defect prevents the inclusion of multiple users and prohibits the utilization of user-accepted data.

Race Condition Bugs: This type of bug disrupts the proper functioning of your system, causing it to crash or hang.

ECP (Equivalence Class Partitioning) Bug: A bug of this type leads to the presence of invalid types.

Version Control Bugs: Regression testing is where you commonly encounter these types of bugs, and it lacks version details.

Hardware Bugs: This type of bug hinders the device from responding to an application as expected.

Help Source Bugs: Due to this bug, the help documentation becomes incorrect.

14. With which apps can PowerCenter be connected?

PowerCenter can be connected with ERP sources such as SAP, Oracle Apps, PeopleSoft, etc.

15. Which partition is used to improve the performance of ETL transactions?

To improve the performance of ETL (Extract, Transform, Load) transactions, partitioning is often employed. Partitioning involves dividing a large dataset into smaller, more manageable sections based on a specified criterion. 

The partitioning strategy used in ETL depends on the specific needs and characteristics of the data. However, a common approach is to utilize horizontal partitioning. In horizontal partitioning, the dataset is divided into distinct partitions based on a specific attribute or range of values.

By partitioning the data, several performance benefits can be achieved:

  1. Parallel processing: Partitioning allows different partitions to be processed simultaneously by multiple threads or processes. This parallelism enhances overall ETL performance by utilizing the available computing resources efficiently.
  2. Scalability: As the dataset grows in size, partitioning enables the ETL process to scale effectively. Each partition can be processed independently, enabling horizontal scalability by distributing the workload across multiple processing units or nodes.
  3. Reduced I/O operations: By isolating data within partitions, I/O operations can be minimized. During extraction and transformation, only the relevant partitions need to be accessed, reducing the amount of data read or written.

4. Enhanced query performance: Partitioning can improve query performance when accessing or querying the data. By partitioning based on attributes commonly used in queries, the database optimizer can efficiently navigate to the relevant partition, resulting in faster query execution times.

16. Does PowerMart provide connections to ERP sources?

No! PowerMart does not provide connections to any of the ERP sources.

17. What is meant by partitioning in ETL?

Partitioning in ETL refers to the sub-division of the transactions in order to improve their performance.

18. What is the benefit of increasing the number of partitions in ETL?

An increase in the number of partitions enables the Informatica server to create multiple connections to a host of sources.

19. What are the types of partitions in ETL?

Types of partitions in ETL are Round-Robin partition and Hash partition.

Career Transition

Advanced Interview Questions

20. What is Round-Robin partitioning?

In Round-Robin partitioning, the data is evenly distributed by Informatica among all partitions. It is used when the number of rows in the process in each of the partitions is nearly the same.

21. What is Hash partitioning?

In Hash partitioning, the Informatica server would apply a hash function in order to partition keys to group data among the partitions. It is used to ensure the processing of a group of rows with the same partitioning key in the same partition.

22. What is mapping in ETL?

Mapping refers to the flow of data from the source to the destination.

23. What is a session in ETL?

A session is a set of instructions that describe the data movement from the source to the destination.

24. What is meant by Worklet in ETL?

Worklet is a set of tasks in ETL. It can be any set of tasks in the program.

25. What is Workflow in ETL?

Workflow is a set of instructions that specify the way of executing the tasks to the Informatica.

27. What is meant by Worklet in ETL?

Worklet is a set of tasks in ETL. It can be any set of tasks in the program.

26. What is the use of Mapplet in ETL?

Mapplet in ETL is used for the purpose of creation as well as the configuration of a group of transformations.

Also, prepare yourself for the Informatica Interview Questions and Answers!

27. What is meant by operational data store?

The operational data store (ODS) is the repository that exists between the staging area and the data warehouse. The data stored in ODS has low granularity.

28. How does the operational data store work?

Aggregated data is loaded into the enterprise data warehouse (EDW) after it is populated in the operational data store (ODS). Basically, ODS is a semi-data warehouse (DWH) that allows analysts to analyze business data. The data persistence period in ODS is usually in the range of 30–45 days and not more.

29. What does the ODS in ETL generate?

Operational data stores function similarly to the extract, transform, and load (ETL) process. In the ETL process, data is extracted from specific sources, transformed, and loaded to its designated destination. Conversely, in the ODS process, raw data from production systems is imported and stored in its original state, without undergoing any transformations. The purpose of this approach is to present the data as-is to business intelligence (BI) applications for analysis and operational decision-making.

30. When are the tables in ETL analyzed?

To find a cost-efficient plan for data retrieval, we analyze tables in ETL.

The use of the ANALYZE statement allows the validation and computing of statistics for either the index, table, or cluster.

31. How are the tables analyzed in ETL?

Statistics generated by the ANALYZE statement is reused by a cost-based optimizer in order to calculate the most efficient plan for data retrieval. The ANALYZE statement can support the validation of structures of objects, as well as space management, in the system. Operations include COMPUTER, ESTIMATE, and DELETE.

Example for Oracle 7:

select OWNER,

sum(decode(nvl(NUM_ROWS,9999), 9999,0,1)) analyzed,

sum(decode(nvl(NUM_ROWS,9999), 9999,1,0)) not_analyzed,

count(TABLE_NAME) total

from dba_tables

where OWNER not in ('SYS', 'SYSTEM')

group by OWNER

This program executes a SQL query against the database’s `dba_tables` view to retrieve information about tables. The purpose is to analyze the data in the tables and determine the number of analyzed and not analyzed tables for each owner.

Here’s what the program does step by step:

  1. It selects the following columns from the `dba_tables` view: `OWNER` (representing the owner of the table), `analyzed` (representing the sum of tables that are analyzed), `not_analyzed` (representing the sum of tables that are not analyzed), and `total` (representing the count of all tables for the owner).
  2. The `decode` function is used to conditionally sum the tables based on the `NUM_ROWS` column. If the `NUM_ROWS` value is null or equal to 9999, it indicates that the table is not analyzed. In that case, the `decode` function increments the `not_analyzed` count by 1, and `analyzed` count by 0. Otherwise, if `NUM_ROWS` is not null and not equal to 9999, it indicates an analyzed table, so the `decode` function increments the `analyzed` count by 1 and `not_analyzed` count by 0.
  3. The `where` clause filters out tables owned by the ‘SYS‘ and ‘SYSTEM‘ users. It ensures that only tables owned by other users are considered for analysis.

4. Finally, the results are grouped by the `OWNER` column, aggregating the counts for each owner.

Become a Business Intelligence Architect

32. How can the mapping be fine-tuned in ETL?

Steps for fine-tuning the mapping involves using the condition for filter in the source qualifying the data without the use of filter, utilizing persistence as well as cache store in lookup t/r, using the aggregations t/r in sorted i/p group by different ports, using operators in expressions instead of functions, and increasing the cache size and commit interval.

33. What are the differences between connected and unconnected lookups in ETL?

Connected Lookup Transformation Unconnected Lookup Transformation
The upstream transformations in the pipeline directly provide input values to Connected Lookup. Another transformation provides input values to the Unconnected Lookup through its lookup expression.
The multiple return ports allow for retrieving multiple columns from the same row. It has just a single return port and it returns one column from every row.
The lookup cache holds both the lookup condition column and the output ports of the lookup source columns. The lookup cache includes the lookup condition and the lookup/return port, which comprise all the lookup/output ports.
Users can define values and it supports them. User-defined values are not supported.
Multiple values are sent to downstream transformations. One transformation receives a single output value.

This blog will help you get a better understanding of ETL Reporting Tools!

34. Can you define cubes and OLAP cubes?

The cube plays a crucial role in data processing, serving as a fundamental component. Essentially, cubes function as data processing units within the Data Warehouse, encompassing dimensions and fact tables. They facilitate clients by offering a multidimensional perspective on data, along with capabilities for querying and analysis.

On the other hand,, Online Analytical Processing (OLAP) is software designed to enable the analysis of data from multiple databases simultaneously. To serve reporting purposes, an OLAP cube can be employed to store data in a multidimensional format. By utilizing cubes, the creation and viewing of reports are simplified, thereby enhancing and streamlining the reporting process. It is the responsibility of end users to manage and maintain these cubes, requiring them to manually update the data contained within.

35. Describe the facts and their sources.

An integral component of data warehousing pertains to the concept of a fact table. Essentially, a fact table serves as a representation of the measurements, metrics, or factual information pertaining to a business process. Within these fact tables, the facts themselves are stored and connected to multiple dimension tables through the use of foreign keys. Facts typically consist of detailed or aggregated measurements related to a business process that can be calculated and grouped together to address specific business inquiries. Data schemas such as the star schema or snowflake schema comprise a central fact table surrounded by several dimension tables. Examples of facts include figures such as sales, cost, profit and loss, among others.

Fact tables contain two types of columns: foreign keys and measure columns. Foreign keys are used to store references to dimensions, while measure columns contain numerical facts. Additional attributes may be incorporated based on the specific requirements and needs of the business.

Various types of facts exist, including the following three fundamental categories:

  1. Additive: Fully additive facts offer the greatest flexibility and utility, as they can be summed across any dimension associated with the fact table.
  2. Semi-additive: Semi-additive facts can be summed across certain dimensions associated with the fact table, but not all.

3. Non-additive: Non-additive facts, found within the Fact table, cannot be summed across any dimension. An example of a non-additive fact is a ratio.

Course Schedule

Name Date Details
ETL Certification Course 23 Sep 2023(Sat-Sun) Weekend Batch
View Details
ETL Certification Course 30 Sep 2023(Sat-Sun) Weekend Batch
View Details
ETL Certification Course 07 Oct 2023(Sat-Sun) Weekend Batch
View Details

3 thoughts on “Top ETL Interview Questions and Answers”

  1. Fantastic! Each and every questions (with answers) make brightness of our future. Well done. Keep adding this. Thank you, once again

Leave a Reply

Your email address will not be published. Required fields are marked *