Top DataStage Interview Questions And Answers
Top Answers to DataStage Interview Questions
|Support for Big Data Hadoop||Access Big Data on a distributed file system, JSON support & JDBC integrator|
|Ease of use||Improve speed, flexibility, & efficacy for data integration|
|Deployment||On-premise or cloud as the need dictates|
Datastage is an extract, transform and load tool that is part of the IBM Infosphere suite. It is a tool that is used for working with large data warehouses and data marts for creating and maintaining such a data repository.
Learn more about DataStage in this insightful blog post.
We can develop an SQL query or we can use a row generator extract tool through which we can fill the source file in Data Stage.
Merging is done when two or more tables are expected to be combined based on their primary key column. This is the basis for merging in Data Stage.
Interested in learning DataStage? We have the in-depth DataStage Training Courses to give you a head start in your career!
Both these files are as the name indicates are serving different purpose in Data Stage. The descriptor files contain all the information or description while the data file is the one that just contains the data.
Data stage and Informatica are both powerful ETL tools but there are a few difference between the two tools. Data stage has the parallelism and partition concept for node configuration whereas the Informatica tool there is not support for parallelism in node configuration. Data stage is simpler to use as compared to Informatica.
The DataStage manager defines a collection of functions within this tool which is called as a Routine. There are basically there types of Routines in DataStage namely Job Control Routine, Before/After Sub-routine, Transform Function.
The duplicates within the data stage can be removed using the sort function. While running the sort function you need specify for the option which allows for duplicates by setting it to false
The fundamental difference between these three stages is the amount of memory they take. Other than that how they treat the input requirement and the various records is also a differentiating factor. So based on memory usage, the Lookup stage uses a very less amount of memory. Both Lookup and Merge use a huge amounts of memory.
The quality state is used for cleansing the data with the DataStage tool. It is a client server software tool that is provided as part of the IBM Information server.
This tool is used for control the job or executing multiple jobs in a parallel manner. It is deployed using the Job Control Language within the IBM data stage tool.
First you have to select the right configuration files. Then you need to select the right partition and buffer memory. You have to handle the sorting of data and handling null time values. Try to use the modify, copy or filter instead of the transformer. Reduce the propagation of unnecessary metadata between the various stages.
The repository is another name for a data warehouse. It can be centralized or a distributed one. The repository table is used for answering the queries like ad hoc, historical, analytical or complex queries.
In the process of massive parallel processing many of the computers are present in the same chasis. While in the symmetric multiprocessing there are many processors that a share the same hardware resources. The massive parallel processing is called as shared nothing as there is no aspect between the various computers. On the other hand the massive parallel processing is faster than the symmetric multiprocessing.
To kill a DataStage job you need to first kill the individual processing ID so that this ensures that the DataStage is killed.
The Compiled step ensures tha the important stage parameters are mapped and these are correct so this creates an executable job. Whereas in the Validated OK we make sure that the connections are valid.
If you want to do data conversion in DataStage then you can use the data conversion function. For this to be successfully executed you need to ensure that the input or the output to and from the operator is the same and the record schema needs to be compatible with the operator.
Whenever there is an unfamiliar error that is happening when we are executing the job sequencer, during this time all the stages after the exception activity are run. So this makes the exception activity so important in the DataStage.
Learn how the DataStage Training Videos can take your career to the next level!
There are different types of Lookups in DataStage. These include the Normal, Sparse, Range and Caseless Lookup in DataStage.
Using the parallel job or a server job depends on the processing need, functionality, time to implement and the cost. The server job usually runs on a single node, it executes on a DataStage Server Engine and handles small volumes of data. The Parallel job runs on multiple nodes, it executes on a DataStage Parallel Engine and handles large volumes of data.
If you want to whether a certain job is part of the sequence then you right click in the Manager on the job and then choose the Usage Analysis.
For counting the number of rows, we should use the @INROWNUM variable.
The Hash file is based on a Hash algorithm and it can used with a key value. The sequential file on the other hand does not have any key column value. The Hash file can be used as a reference for a Look Up while a Sequential file cannot be used for Look Up. Due to the presence fo the Hash key, the Hash file is easier to search than a Sequential file.
For cleaning a DataStage Repository you need to go to the DataStage Manager and go to the Job in the Menu bar and go to the Clean Up Resources. If you want to further remove the logs then you need to go to the respective job and clean the log files.
The Routines are stored in the Routine branch of the DataStage Repository. This is where you can create, view or edit all the Routines. The Routines in DataStage could be among the following: Job Control Routine, Before-after Sub-Routine, Transform Function.
An Operational DataStage can be considered as a staging area, for real-time analysis, for user processing. Thus it is a temporary repository. Whereas the data warehouse is used for long-term data storage needs and has the complete data of the entire business.
NLS means National Language Support in DataStage. This means you can use this IBM DataStage tool in various languages like multi-byte character languages like Chinese or Japanese too. You can read and write in any language and process it as per the requirement.