Datastage stages palette – ETL Tools Datastage and QualityStage stages are sorted into the below logical sections: General Objects Stages of Data Quality Development and Debug Stages Database connectors Restructure stages Real-time stages Debug and Development stages Sequence activities The widely and popular stages used in DataStage and Quality Stage are discussed below and specific major features are explained below General Elements <p A Link indicates the flow of the data. Reference, stream and lookup are the main types of links. The container can be shared or private; the main role is to visually make it simple so that the complex database job design and letting the design for ease of usage and recognition. Annotation is basically used for adding floating Datastage descriptions and notes on job posts. This provides a way to document ETL process and used in understanding the given job. Debug and Development stages Row generator replicates a set of data which sticks to the appropriate metadata (cycled or randomized). Useful for development and testing. Column generator can add one or more column to the flow and can generate test data for the column. Peek The column values are recorded and could be viewed in the director. It can have multiple output link, but single input link. Sample operates on an input data set and has two modes:- percent mode and period mode. Head occupies first N rows at each partition and copies them to an output data set from an input data set A Tail is same as the head stage. From each of the partitions, last N rows are selected. Write Range Map writes a form where the dataset is usable by the range partitioning method. Processing Stages Aggregator joins data vertically from grouping the incoming data stream and calculating brief about a min, count, max etc.; for each team. The data could be sorted using two methods: pre-sort and hash table. Copy copies the input data to single or more output data flows. FTP stage implements the FTP protocol to transfer data to a remote machine Filter filters records that don’t meet relevant requirements. Funnel converts multiple streams into a single one. Join combines more than one input according to the values of a key column/s. Lookup includes two or more inputs based on values of key column/s. It should have a 1 source, but can have multiple lookup tables Merge includes one master input with multiple updates inputs related to the values of a key column/s. The inputs need to be sorted and unmatched secondary entries can be caught using multiple reject links. Modify stage alters the dataset record. Useful for renaming columns, not default data type conversion and null handling. Remove duplicates wants input to be single sorted data set. Removes all duplicate content and writes into a single output. Slowly changing Dimension computerizes the revised process of dimension tables where data frequently change. Sort sorts the input columns Transformer stage handles validation of data, extracted data and lookups. Change Capture catches the before and the after states of 2 data which are inputted and are converted into a single data set that shows the differences made. Change apply will apply changes to the operation to the previous dataset so as to compute after dataset. It retrieves data from the change capture stage. Difference stage executes a record-by-record comparison of 2 input data and outputs the record which is the difference between records. Checksum produces checksum from the specific columns in a row and gets added to the stream. Also able to differentiate between records. Compare does a searching comparison on column-column of pre-sorted records. It can have one output and two input link. Encode encodes data, such as gzip with encoding command Decode decodes the previously encoded data in the previous stage. External Filter allows specifying an OS command that filters the processed data. Generic stage permits users to provoke OSH operator from DataStage stage having options. Pivot Enterprise is used for horizontal pivoting. It maps or assigns multiple columns in the input row to a single column over multiple output rows. Surrogate Key Generator manages key source by generating to a column, the surrogate key. Switch stage matches each input row to an output link on the basis of the value of a selector field. The concept is similar to switch statement in most programming languages. Compress combines data set using the GZIP utility (or can be done using LINUX/UNIX command) Expand extracts previous compressed data set into raw binary data. File stage types The Sequential file is used to see data from or write into one or more flat (sequential) files. Data Set stage allows users to see data from or write data to a file set. File sets are OS files, each of which has .ds extension and one or more data files. File Set stage allows users to read data from or write data to a file set. Unlike datasets, file sets conserve formatting and are readable by other apps. The complex flat file allows reading on a mainframe machine similar to a header, MVS data sets, and trailer structured files from complex file structures. External Source allows reading data from multiple source programs to output. Lookup File Set is similar to File set Stage. Partitioned hashed file can be used for lookups. Database Stages Oracle Enterprise permits reading data to an Oracle database. ODBC Enterprise permits looking data from and writing it to a database called as ODBC source. It is used in the processing of data in Microsoft Access and Excel spreadsheets. DB2/UDB Enterprise allows writing, reading data to a DB2 Database. Teradata allows writing, reading data to a Teradata data warehouse. Three Teradata stages are present; Teradata connector, Enterprise and Multiload. SQL Server Enterprise allows writing, reading data to Microsoft SQLI Server 2005 and 2008 database. Sybase allows into Sybase Databases, reading and writing data. Stored procedure stage chains DB2, Oracle, Sybase, Teradata and Microsoft SQL Server. MS OLEDB is used to retrieve information from all types of the information repositories, such as an ISAM file, relational Source or a spreadsheet. Dynamic Relational Stage is reading from and writing into a different supported relational DB using interfaces such as Microsoft SQL, DB2, Oracle, Sybase and Informix. Informix (CLI or Load) DB2 UDB (API or Load) Real Time Stages XML Input stage makes it possible to convert hierarchical XML data to flat relational data sets. XML Output writes on to XML structures. XML Transformer converts XML documents by XSLT stylesheet. Websphere MQ stages give a collection of connectivity menu to access IBM WebSphere MQ enterprise messaging systems. Two MQ stages available in Datastage and in QualityStage;they are Websphere MQ connector and plug-in stage. Java Client stage could be useful as a target and lookup too. The package contains 3 public classes. Java Transformer level helps in the following three links: input, output and reject WISD Input: Defined as Information Services Input stage WISD Output: Defined as Information Services Output stage Restructure stages Column export, export data onto a single column of datatype binary or string from numerous columns of different data types. It can have a single output, input and a reject link. Column imports corresponding to the Column Export stage. Characteristically used to divide data inward bound in a single column into numerous columns. Combine records stage group rows which have exact same keys, over vectors of sub-records. Make sub-records merge particular input vectors into a vector of sub-records whose columns have the same names and data types as the original vectors. Make vector combines specified input vectors into a vector of columns Promote sub-records supports input sub-records columns to columns top level. Split sub-records splits an input sub-records field into a set of top level vector columns. Split vector supports the elements of a fixed length vector over the set of top level columns. Data quality QualityStage stages Investigate stage predicts data module of appropriate columns of all records from the source file. Offers word and character investigation methods. Match frequency stage obtains input from a file, a database or processing stages and generates an occurrence distribution report. MNS: Refers to Multinational address standardization WAVES: Refers to worldwide address verification and enhancement system. Sequence types of activity stages Job Activity indicates Datastage server or else similar job to execute. Notification Activity used to move the emails to client described recipients by Datastage. Sequencer used for synchronization of the control flow of numerous actions in a job progression. Terminator Activity allows shutting down the entire progression once certain circumstances persist. Wait for file Activity waits for an exact file to emerge or fade away and launches the dispensation.