• Articles
  • Tutorials
  • Interview Questions

DataStage Parallel Stages Group and Designing Jobs in DataStage palette

Datastage stages palette – ETL Tools

Datastage and QualityStage stages are sorted into the below logical sections:

  • General Objects
  • Stages of Data Quality
  • Development and Debug Stages
  • Database connectors
  • Restructure stages
  • Real-time stages
  • Debug and Development stages
  • Sequence activities

The widely and popular stages used in DataStage and Quality Stage are discussed below and specific major features are explained below

General Elements

<ph

 

  • A Link indicates the flow of the data. Reference, stream and lookup are the main types of links.
  • The container can be shared or private; the main role is to visually make it simple so that the complex database job design and letting the design for ease of usage and recognition.
  • Annotation is basically used for adding floating Datastage descriptions and notes on job posts. This provides a way to document ETL process and used in understanding the given job.

Debug and Development stages

i

  • Row generator replicates a set of data which sticks to the appropriate metadata (cycled or randomized). Useful for development and testing.
  • Column generator can add one or more column to the flow and can generate test data for the column.
  • Peek The column values are recorded and could be viewed in the director. It can have multiple output link, but single input link.
  • Sample operates on an input data set and has two modes:- percent mode and period mode.
  • Head occupies first N rows at each partition and copies them to an output data set from an input data set
  • A Tail is same as the head stage. From each of the partitions, last N rows are selected.
  • Write Range Map writes a form where the dataset is usable by the range partitioning method.

Processing Stages

j

  • Aggregator joins data vertically from grouping the incoming data stream and calculating brief about a min, count, max etc.; for each team. The data could be sorted using two methods: pre-sort and hash table.
  • Copy copies the input data to single or more output data flows.
  • FTP stage implements the FTP protocol to transfer data to a remote machine
  • Filter filters records that don’t meet relevant requirements.
  • Funnel converts multiple streams into a single one.
  • Join combines more than one input according to the values of a key column/s.
  • Lookup includes two or more inputs based on values of key column/s. It should have a 1 source, but can have multiple lookup tables
  • Merge includes one master input with multiple updates inputs related to the values of a key column/s. The inputs need to be sorted and unmatched secondary entries can be caught using multiple reject links.
  • Modify stage alters the dataset record. Useful for renaming columns, not default data type conversion and null handling.
  • Remove duplicates wants input to be single sorted data set.
  • Removes all duplicate content and writes into a single output.
  • Slowly changing Dimension computerizes the revised process of dimension tables where data frequently change.
  • Sort sorts the input columns
  • Transformer stage handles validation of data, extracted data and lookups.
  • Change Capture catches the before and the after states of 2 data which are inputted and are converted into a single data set that shows the differences made.
  • Change apply will apply changes to the operation to the previous dataset so as to compute after dataset. It retrieves data from the change capture stage.
  • Difference stage executes a record-by-record comparison of 2 input data and outputs the record which is the difference between records.
  • Checksum produces checksum from the specific columns in a row and gets added to the stream. Also able to differentiate between records.
  • Compare does a searching comparison on column-column of pre-sorted records. It can have one output and two input link.
  • Encode encodes data, such as gzip with encoding command
  • Decode decodes the previously encoded data in the previous stage.
  • External Filter allows specifying an OS command that filters the processed data.
  • Generic stage permits users to provoke OSH operator from DataStage stage having options.
  • Pivot Enterprise is used for horizontal pivoting. It maps or assigns multiple columns in the input row to a single column over multiple output rows.
  • Surrogate Key Generator manages key source by generating to a column, the surrogate key.
  • Switch stage matches each input row to an output link on the basis of the value of a selector field. The concept is similar to switch statement in most programming languages.
  • Compress combines data set using the GZIP utility (or can be done using LINUX/UNIX command)
  • Expand extracts previous compressed data set into raw binary data.

Become a Business Intelligence Architect

File stage types

k

  • The Sequential file is used to see data from or write into one or more flat (sequential) files.
  • Data Set stage allows users to see data from or write data to a file set. File sets are OS files, each of which has .ds extension and one or more data files.
  • File Set stage allows users to read data from or write data to a file set. Unlike datasets, file sets conserve formatting and are readable by other apps.
  • The complex flat file allows reading on a mainframe machine similar to a header, MVS data sets, and trailer structured files from complex file structures.
  • External Source allows reading data from multiple source programs to output.
  • Lookup File Set is similar to File set Stage. Partitioned hashed file can be used for lookups.

 

Database Stages

l

  • Oracle Enterprise permits reading data to an Oracle database.
  • ODBC Enterprise permits looking data from and writing it to a database called as ODBC source. It is used in the processing of data in Microsoft Access and Excel spreadsheets.
  • DB2/UDB Enterprise allows writing, reading data to a DB2 Database.
  • Teradata allows writing, reading data to a Teradata data warehouse. Three Teradata stages are present; Teradata connector, Enterprise and Multiload.
  • SQL Server Enterprise allows writing, reading data to Microsoft SQLI Server 2005 and 2008 database.
  • Sybase allows into Sybase Databases, reading and writing data.
  • Stored procedure stage chains DB2, Oracle, Sybase, Teradata and Microsoft SQL Server.
  • MS OLEDB is used to retrieve information from all types of the information repositories, such as an ISAM file, relational Source or a spreadsheet.
  • Dynamic Relational Stage is reading from and writing into a different supported relational DB using interfaces such as Microsoft SQL, DB2, Oracle, Sybase and Informix.
  • Informix (CLI or Load)
  • DB2 UDB (API or Load)

 

Real Time Stages

m

  • XML Input stage makes it possible to convert hierarchical XML data to flat relational data sets.
  • XML Output writes on to XML structures.
  • XML Transformer converts XML documents by XSLT stylesheet.
  • Websphere MQ stages give a collection of connectivity menu to access IBM WebSphere MQ enterprise messaging systems. Two MQ stages available in Datastage and in QualityStage;they are Websphere MQ connector and plug-in stage.
  • Java Client stage could be useful as a target and lookup too. The package contains 3 public classes.
  • Java Transformer level helps in the following three links: input, output and reject
  • WISD Input: Defined as Information Services Input stage
  • WISD Output: Defined as Information Services Output stage

Restructure stages

n

  • Column export, export data onto a single column of datatype binary or string from numerous columns of different data types. It can have a single output, input and a reject link.
  • Column imports corresponding to the Column Export stage. Characteristically used to divide data inward bound in a single column into numerous columns.
  • Combine records stage group rows which have exact same keys, over vectors of sub-records.
  • Make sub-records merge particular input vectors into a vector of sub-records whose columns have the same names and data types as the original vectors.
  • Make vector combines specified input vectors into a vector of columns
  • Promote sub-records supports input sub-records columns to columns top level.
  • Split sub-records splits an input sub-records field into a set of top level vector columns.
  • Split vector supports the elements of a fixed length vector over the set of top level columns.

 

Data quality QualityStage stages

o

 

  • Investigate stage predicts data module of appropriate columns of all records from the source file. Offers word and character investigation methods.
  • Match frequency stage obtains input from a file, a database or processing stages and generates an occurrence distribution report.
  • MNS: Refers to Multinational address standardization
  • WAVES: Refers to worldwide address verification and enhancement system.

Sequence types of activity stages

p

  • Job Activity indicates Datastage server or else similar job to execute.
  • Notification Activity used to move the emails to client described recipients by Datastage.
  • Sequencer used for synchronization of the control flow of numerous actions in a job progression.
  • Terminator Activity allows shutting down the entire progression once certain circumstances persist.
  • Wait for file Activity waits for an exact file to emerge or fade away and launches the dispensation.

 

Course Schedule

Name Date Details
Power BI Training 30 Nov 2024(Sat-Sun) Weekend Batch View Details
07 Dec 2024(Sat-Sun) Weekend Batch
14 Dec 2024(Sat-Sun) Weekend Batch

About the Author

Data Analyst & Machine Learning Associate

As a Data Analyst and machine learning associate, Nishtha combines her analytical skills and machine learning knowledge to interpret complicated datasets. She is also a passionate storyteller who transforms crucial findings into gripping tales that further influence data-driven decision-making in the business frontier.