|Working methodology||Data from source system to data warehouse||Leverages target system to transform data|
ETL is the process of Extraction, Transforming and Loading.
In ETL process data is extracted from source such as the database servers and it is used to generate business roll.
Learn ETL in 50 hrs. Download e-book now
The steps involved are defining the source; target, creating the mapping, creating the session, and creating the workflow.
Each of the steps involved in ETL has several sub steps. The transform step has most sub steps.
In ETL the initial load is the process for populating all data warehousing tables for very first time. Full load means when the data is loaded for the first time all set records are loaded at a stretch depending on its volume. It would erase all contents in the table and would reload fresh data.
Incremental load refers to applying the dynamic changes as and when required in a specific period and predefined schedules.
The data warehouse is considered to be the three tier system in ETL.
Middle layer in ETL provides the data that is usable in a secure way to end users. Other two layers are on the other side of the middle tier, the end user and back end data storage.
Download ETL Interview questions asked by top MNCs in 2017 ?
The first layer in ETL is the source layer and it is the layer where data lands. Second layer is integration layer where data is stored after transformation. Third layer is the dimension layer where actual presentation layer stands.
Snapshots are the copies of read only data that is stored in the master table.
Snapshots are located on remote node and refreshed periodically so that the changes in master table can be recorded. They are also replica of tables.
Views are built using the attributes of one or more tables. View with single tables can be updated but those with multiple tables cannot be updated.
Materialized view log is the pre-computed table with aggregated or joined data from the fact tables as well as the dimension tables.
Materialized view is an aggregate table.
Task accomplished by Power Center is processing large volumes of data. Power Mart processes low volumes of data.
Power Center can be connected with ERP source like the SAP, Oracle Apps, and the People Soft etc.
To improve the performances of ETL transactions the session partition is used.
No! Power Mart does not provide connection to any of the ERP sources. It also does not allow sessions partition.
Partitioning in ETL refers to sub division of the transactions in order to improve their performances.
Increase in the number of partitions enables the informatics Server to create multiple connections to a host of sources.
Types of partitions in ETL are Round-Robin partition and Hash partition.
In Round Robin partitioning the data is evenly distributed by the informatica among all the partitions. It is used when the number of rows in process in each of the partitions is nearly the same.
In Hash partitioning the informatica server would apply a hash function in order to partition keys to group data among the partitions. It is used to ensure the processing of group of rows with the same partitioning key in same partition.
Mapping refers to flow of data from source to the destination.
Session is a set of instructions that describes the data movement from the source to the destination.
Worklet is the set of tasks in ETL. It can be any set of tasks in the program.
Workflow is a set of instruction that specifies the way of executing the tasks to the informatica.
Mapplet in ETL is used for the purpose of creation as well as configuration of a group of transformations.
Operational data store is the repository that exists between the staging area and the data warehouse. Data stored in ODS has low granularity.
Aggregated data is loaded into the EDW after it is populated in operational data store or ODS. Basically ODS is also semi DWH helping analysis of business data. Data persistence period in ODS is usually in the range of 30-45 days and not more.
ODS in ETL generates primary keys, takes care of the error and also rejects just like the DWH.
Use of ANALYZE statement allows validation and computing of statistics for either the index table or the cluster.
Statistics generated by the ANALYZE statement use is reused by cost based optimizer in order to calculate the most efficient plan for data retrieval. ANALYZE statement can support validation of structures of objects as well as space management in the system. Operations include COMPUTER, ESTIMATE, and DELETE.
Steps for fine tuning the mapping involves using condition for filter in source qualifying the data without use of filter; utilizing persistence as well as cache store in look up t/r; using the aggregations t/r in sorted i/p group by different ports, using operators in expressions instead of functions, and increase the cache size and commit interval.
Connected look up is used for mapping and returns multiple values. It can be connected to another transformation and also returns a value. Unconnected look up is used when look up is not available in main flow and it returns only single output. It also cannot be connected to other transformation but are reusable.This blog will help you get a better understanding of ETL Reporting Tools!