CTA
DataStage is a commonly used ETL tool in the current market. In this DataStage Interview Questions blog, we have shared an extremely useful set of questions and answers intended for cracking DataStage interviews. Here, we have provided in-depth solutions to the DataStage interview questions that are beneficial for freshers and experienced professionals alike.
The DataStage interview questions are classified into three levels based on difficulty as mentioned below:
1. Basic
2. Intermediate
3. Advanced
Basic DataStage Interview Questions
1. Mention DataStage characteristics.
Criteria |
Characteristics |
Support for Big Data Hadoop |
Access Big Data on a distributed file system, JSON support, and JDBC integrator |
Ease of use |
Improve speed, flexibility, and efficacy for data integration |
Deployment |
On-premise or cloud as the need dictates |
2. What is IBM DataStage?
DataStage is an extract, transform, and load tool that is part of the IBM Infosphere suite. It is a tool that is used for working with large data warehouses and data marts for creating and maintaining a data repository.
3. How is a DataStage source file filled?
We can develop a SQL query or we can use a row generator extract tool through which we can fill the source file in DataStage.
4. How is merging done in DataStage?
In DataStage, merging is done when two or more tables are expected to be combined based on their primary key column.
5. What are data and descriptor files?
Both these files are serving different purposes in DataStage. A descriptor file contains all the information or description, while a data file is the one that just contains data.
6. How is DataStage different from Informatica?
DataStage and Informatica are both powerful ETL tools, but there are a few differences between the two. DataStage has parallelism and partition concepts for node configuration; whereas in Informatica, there is no support for parallelism in node configuration. Also, DataStage is simpler to use as compared to Informatica.
Interested in Power BI? Here is the Power BI Training provided by Intellipaat.
7. What is a routine in DataStage?
DataStage Manager defines a collection of functions within a routine. There are basically three types of routines in DataStage, namely, job control routine, before/after subroutine, and transform function.
8. What is the process for removing duplicates in DataStage?
Duplicates in DataStage can be removed using the sort function. While running the sort function, we need to specify the option which allows for duplicates by setting it to false.
9. What is the difference between join, merge, and lookup stages?
The fundamental difference between these three stages is the amount of memory they take. Other than that how they treat the input requirement and the various records are also factors that differentiate one another. Based on the memory usage, the lookup stage uses a very less amount of memory. Both lookup and merge stages use a huge amount of memory.
Get 100% Hike!
Master Most in Demand Skills Now!
10.Explain how a source file is populated?
We can populate a source file in several ways like by using a row generator extract tool, or creating a SQL query in Oracle, etc.
11.How to convert a server job to a parallel job in DataStage?
Using a Link collector and an IPC collector we can convert the server job to a parallel job.
12.What is an HBase connector?
An HBase connector in DataStage is a tool used to connect databases and tables present in the HBase database. It is used to perform tasks like:
- Reading data in the parallel mode.
- Read and write data from and to the HBase database.
- Using HBase as a view table
Intermediate DataStage Interview Questions
13. What is the quality state in DataStage?
The quality state is used for cleansing the data with the DataStage tool. It is a client-server software tool that is provided as part of the IBM Information Server.
14. What is job control in DataStage?
This tool is used for controlling a job or executing multiple jobs in a parallel manner. It is deployed using the Job Control Language within the IBM DataStage tool.
15. How to do DataStage jobs performance tuning?
First, we have to select the right configuration files. Then, we need to select the right partition and buffer memory. We have to deal with the sorting of data and handling null-time values. We need to try to use modify, copy, or filter instead of the transformer. Reduce the propagation of unnecessary metadata between various stages.
16. What is a repository table in DataStage?
The term ‘repository’ is another name for a data warehouse. It can be centralized or distributed. The repository table is used for answering ad-hoc, historical, analytical, or complex queries.
17. Compare massive parallel processing with symmetric multiprocessing.
In massive parallel processing, many computers are present in the same chassis. While in the symmetric multiprocessing, there are many processors that share the same hardware resources. Massive parallel processing is called ‘shared nothing’ as there is no aspect between various computers. And it is faster than the symmetric multiprocessing.
18. How can we kill a DataStage job?
To kill a DataStage job, we need to first kill the individual processing ID so that this ensures that the DataStage is killed.
19. How do we compare the Validated OK with the Compiled Process in DataStage?
The Compiled Process ensures that the important stage parameters are mapped and these are correct such that it creates an executable job. Whereas in the Validated OK, we make sure that the connections are valid.
20. Explain the feature of data type conversion in DataStage.
If we want to do data conversion in DataStage, then we can use the data conversion function. For this to be successfully executed, we need to ensure that the input or the output to and from the operator is the same, and the record schema needs to be compatible with the operator.
21. What is the significance of the exception activity in DataStage?
Whenever there is an unfamiliar error happening while executing the job sequencer, all the stages after the exception activity are run. So, this makes the exception activity so important in the DataStage.
22.What is the difference between Datastage 7.5 and 7.0?
Datastage 7.5 is more robust and performs smoothly due to many new stages which are added, such as Procedure Stage, Generate Report, Command Stage, etc.
23.Describe the DataStage architecture briefly.
IBM DataStage preaches a client-server model as its architecture and has different types of architecture for its various versions. The different components of the client-server architecture are :
- Client components
- Servers
- Stages
- Table definitions
- Containers
- Projects
- Jobs
24. What are the main features of the Flow Designer?
The main features of the Flow Designer are:
- There is no need to migrate the jobs to use the flow designer.
- It is very useful to perform jobs with a large number of stages.
- We can use the provided palette to add and remove connectors and operators on the designer canvas using the drag and drop feature.
Learn more about DataStage from this insightful DataStage Tutorial for Beginners blog post!
Advanced DataStage Interview Questions and Answers for Experienced Professionals
25. Name the command line functions to import and export the DS jobs?
The dsimport.exe function is used to import the DS jobs, and to export the DS jobs, dsexport.exe is used.
26. What are the various types of lookups in DataStage?
There are different types of lookups in DataStage. These include normal, sparse, range, and caseless lookups.
27. How can we run a job using the command line in DataStage?
The command for running a job using the command line in DataStage: dsjob -run -jobstatus <projectname> <jobname>
28. When do we use a parallel job and a server job?
Using the parallel job or a server job depends on the processing need, functionality, time to implement, and cost. The server job usually runs on a single node, it executes on a DataStage Server Engine and handles small volumes of data. The parallel job runs on multiple nodes; it executes on a DataStage Parallel Engine and handles large volumes of data.
29. What is Usage Analysis in DataStage?
If we want to check whether a certain job is part of the sequence, then we need to right-click on the Manager on the job and then choose the Usage Analysis.
30. How to find the number of rows in a sequential file?
For counting the number of rows in a sequential file, we should use the @INROWNUM variable.
31. What is the difference between a sequential file and a hash file?
The hash file is based on a hash algorithm, and it can be used with a key value. The sequential file, on the other hand, does not have any key-value column. The hash file can be used as a reference for a lookup, while a sequential file cannot be used for a lookup. Due to the presence fo the hash key, the hash file is easier to search than a sequential file.
32. How do we clean a DataStage repository?
For cleaning a DataStage repository, we have to go to DataStage Manager > Job in the menu bar > Clean Up Resources.
If we want to further remove the logs, then we need to go to the respective jobs and clean up the log files.
33. How do we call a routine in DataStage?
Routines are stored in the Routine branch of the DataStage repository. This is where we can create, view, or edit all the Routines. The Routines in DataStage could be the following: Job Control Routine, Before-after Subroutine, and Transform function.
34. What is the difference between an Operational DataStage and a Data Warehouse?
An Operational DataStage can be considered as a staging area for real-time analysis for user processing; thus it is a temporary repository. Whereas, the data warehouse is used for long-term data storage needs and has the complete data of the entire business.
35. What does NLS mean in DataStage?
NLS means National Language Support. This means we can use this IBM DataStage tool in various languages like multi-byte character languages (Chinese or Japanese). We can read and write in any language and process it as per the requirement.
36. In Datastage, how you can fix the truncated data error?
The truncated data error can be fixed by using ENVIRONMENT VARIABLE ‘ IMPORT_REJECT_STRING_FIELD_OVERRUN’.
37. What is a Hive connector?
A Hive connector is a tool to support partition modes while reading the data. This can be done in two ways:
- modulus partition mode
- minimum-maximum partition mode