• Articles
  • Tutorials
  • Interview Questions

Creating Advanced Transformations and Jobs

Steps to create Pentaho Advanced Transformation and Creating a new Job

Updating a file with news about examinations by setting a variable with the name of the file:

  • Copy the examination files you used in Chapter 2 to the input files and folder defined in your kettle.properties file. If you don’t have them, download them from the Packt website.
  • Open Spoon and create a new transformation.
  • Use a Get System Info step to get the first command-line argument. Name the field as filename.
  • Add a Filter rows step and create a hop from the Get System Info step to this step.
  • From the Flow category drag an Abort step to the canvas, and from the Job category of steps drag a Set Variables step.
  • From the Filter rows step, create two hops—one to the Abort step and the other to the Set Variables step. Double-click the Abort step. As Abort message, put File name is mandatory.
  • Double-click the Set Variables step and click on Get Fields. The window will be filled as shown here:

 
1
 
 

  • Click on OK.
  • Double-click the Filter rows step. Add the following filter: filename IS NOT NULL. In the drop-down list to the right of Send ‘true’ data to step, select the Set Variables step, whereas in the drop-down list to the right of Send ‘false’ data to step, select the Abort step.
  • The final transformation looks like this:

 
1
 
 

  • Save the transformation in the transformations folder under the name getting_filename.ktr.
  • Open the transformation named examinations.ktr that was created in Chapter 2 or download it from the Packt website. Save it in the transformations folder under the name examinations_2.ktr.
  • Delete the Get System Info step.
  • Double-click the Text file input step.
  • In the Accept filenames from previous steps frame, uncheck the Accept filenames from previous step option.
  • Under File/Directory in the Selected files grid, type ${FILENAME}. Save the transformation.
  • Create a new job.
  • From the General category, drag a START entry and a Transformation entry to the canvas and link them.
  • Save the job as examinations.kjb.

Learn Pentaho

  • Double-click the Transformation entry. As Transformation filename, put the name of the first transformation that you created: ${Internal.Job.Filename.Directory}/transformations/getting_filename.ktr.
  • Click on OK.
  • From the Conditions category, drag a File Exists entry to the canvas and create a hop from the Transformation entry to this new one.
  • Double-click the File Exists entry.
  • Write ${FILENAME} in the File name textbox and click on OK.
  • Add a new Transformation entry and create a hop from the File Exists entry to this one.
  • Double-click the entry and, as Transformation filename, put the name of the second transformation you created:${Internal.Job.Filename.Directory}/transformations/examinations_2.ktr.
  • Add a Write To Log entry, and create a hop from the File Exists entry to this. The hop should be red, to indicate when execution fails. If not, right-click the hop and change the evaluation condition to Follow when result is false.
  • Double-click the entry and fill all the textboxes as shown:

 
1
 
 

  • Add two entries—an abort and a success. Create hops to these new entries as shown next:

 
1
 
 

  • Save the job.
  • Press F9 to run the job.
  • Set the logging level to Minimal logging and click on Launch.
  • The job fails. The following is what you should see in the Logging tab in the Execution results window:

 
1
 
 

  • Press F9 again. This time set Basic logging as the logging level.
  • In the arguments grid, write the name of a fictitious file—for example, c:/pdi_files/input/nofile.txt.
  • Click on Launch. This is what you see now in the Logging tab window:

 
1
 
 

  • Press F9 for the third time. Now provide a real examination filename such as c:/pdi_files/input/exam1.txt.
  • Click on Launch. This time you see no errors. The examination file is appended to the global file:

 
1

Generating files with top scores:

  • Create a new transformation and save it in the transformations folder under the name top_scores.ktr.
  • Use a Text file input step to read the global examination file generated in the previous tutorial.
  • After the Text file input step, add the following steps and link them in the same order:

A Select values step to remove the unused fields—file_processed and process_date.
A Split Fields to split the name of the students in two—name and last name.
A Formula step to convert name and last name to uppercase.
With the same Formula step, change the scale of the scores. Replace each skill field writing, reading, speaking, and listening with the same value divided by 20—for example, [writing]/20.

  • Do a preview on completion of the final step to check that you are doing well. You should see this:

 
1
 
 

  • Af‍ter the last Formula step, add and link in this order the following steps:

A Sort rows step to order the rows in descending order by the writing field.
A JavaScript step to filter the first 10 rows. Remember that you learned to do this in the chapter devoted to JavaScript. You do it by typing the following piece of code:

trans_Status = CONTINUE_TRANSFORMATION;
if (getProcessCount('r')>10) trans_Status = SKIP_TRANSFORMATION;

An Add sequence step to add a field named seq_w. Leave the defaults so that the field contains the values 1, 2, 3 …
A Select values step to rename the field seq_w as position and the field writing as score. Specify this change in the Select & Alter tab, and check the option Include unspecified fields, ordered.
A Text file output step to generate a file named writing_top10. txt at the location specified by the ${LABSOUTPUT} variable. In the Fields tab, put the following fields— position, student_code, student_name, student_lastname, and score.
 

  • Save the transformation, as you’ve added a lot of steps and don’t want to lose your work.
  • Repeat step number 5, but this time sort by the reading field, rename the sequence seq_r as position and the field reading as score, and send the data to the reading_top10.txt file.
  • Repeat the same procedure for the speaking field and the listening field.
  • This is how the transformation looks like:

 
1
 
 

  • Save the transformation.
  • Run the transformation. Four files should have been generated. All the files should look similar. Let’s check the writing_top10.txt file (the names and values may vary depending on the examination files that you have appended to the global file):

 
1

Creating a job as a process flow

With the implementation of a subtransformation, you simplify much of the transformation. But you still have some reworking to do. In the main transformation, you basically do two things. First you read the source data from a file and prepare it for further processing. And then, after the preparation of the data, you generate the files with the top scores. To have a clearer vision of these two tasks, you can split the transformation in two, creating a job as a process flow.

Splitting the generation of top scores by copying and getting rows:

  • Open the transformation in the previous tutorial. Select all steps related to the preparation of data, that is, all steps from the Text file input step upto the Formula step.
  • Copy the steps and paste them in a new transformation.
  • Expand the Job category of steps.
  • Select a Copy rows to result step, drag it to the canvas, and create a hop from the last step to this new one. Your transformation looks like this:

 
1
 
 

  • Save the transformation in the transformations folder with the name top_scores_flow_preparing.ktr.
  • Go back to the original transformation and select the rest of the steps, that is, the Mapping and the Text file output steps.
  • Copy the steps and paste them in a new transformation.
  • From the Job category of steps select a Get rows from result step, drag it to the canvas, and create a hop     from this step to each of the Mapping steps

 
1
 
 

  • Save the transformation in the transformations folder with the name top_ scores_flow_processing.ktr.
  • In the top_scores_flow_preparing transformation , right-click the step Copy rows to result and select Show output fields.
  • The grid with the output dataset shows up.

 
1
 
 

  • Select all rows. Press Ctrl+C to copy the rows.
  • In the top_scores_flow_processing transformation, double-click the step Get rows from result.
  • Press Ctrl+V to paste the values.

 
1
 
 

  • Save the transformation.
  • Create a new Job.
  • Add a START and two transformation entries to the canvas and link them one after the other.
  • Double-click the first transformation. Put ${Internal.Job.Filename. Directory}/transformations/top_scores_flow_preparing.ktr as the name of the transformation.
  • Double-click the second transformation. Put ${Internal.Job.Filename. Directory}/transformations/top_scores_flow_processing.ktr as the name of the transformation.
  • Save the job. Press F9 to open the Job properties window and click on Launch. Again, the four files should have been generated, with the very same information.

 
1

Nesting jobs:

Suppose that every time you append a file with examination results, you want to generate updated files with the top 10 scores. You can do it manually, running one job after the other, or you can nest jobs.

Generating the files with top scores by nesting jobs:

  • Open the examinations job you created in the first tutorial of this chapter.
  • After the last transformation job entry, add a job entry as Job. You will find it under the General category of entries.
  • Double-click the Job job entry.
  • Type ${Internal.Job.Filename.Directory}/top_scores_flow.kjb as Job filename.
  • Click on OK.
  • Save the job.
  • Pick an examination that you have not yet appended to the global file—for example, exam5.txt.
  • Press F9.
  • In the Arguments grid, type the full path of the chosen file: c:/pdi_files/input/exam5.txt.
  • Click on Launch.
  • In the Job metrics tab of the Execution results window, you will see the following:

 
1
 
 

  1. Also the chosen file should have been added to the global file, and updated files with top scores should have been generated.

Iterating jobs and transformations:

It may happen that you develop a job or a transformation to be executed several times, once for each different row of your data. The same applies to transformations. If you have to execute the same transformation several times, once for each row of a set of data, you can do it by iterating the execution.

Generating custom files by executing a transformation for every input row:

  • Create a new transformation.
  • Drag a Text file input, a Filter rows, and a Select values step to the canvas and link them in that order.
  • Use the Text file input step to read the global examination file.
  • Use the Filter rows step to keep only those students with a writing score below 60.
  • With the Select values step, keep just the student_code and name values.
  • After this last step, add a Copy rows to result step.
  • Do a preview on this last step.

 
1
 
 

  • Save the transformation in the transformations folder with the name students_list.ktr.

Let’s create a job that puts everything together:

  • Create a job.
  •  Drag a START, a Delete files, and two transformation entries to the canvas, and link them one after the    other

 
1
 
 

  • Save the job.
  • Double-click the Delete files step. Fill the Files/Folders: grid with a single row—under File/Folder type ${LABSOUTPUT} and under Wilcard (RegExp) type hello.*.txt. This regular expression includes all .txt files whose name start with the string “hello” in the ${LABSOUTPUT} folder.
  • Double-click the first transformation entry. As Transformation filename, put ${Internal.Job.Filename.Directory}/transformations/student_ list.ktr and click on OK.
  • Double-click the second transformation entry. As Transformation filename, put ${Internal.Job.Filename.Directory}/transformations/ hello_each.ktr.
  • Check the option Execute for every input row? and click on OK.
  • Save the job and press F9 to run it.
  • When the execution finishes, explore the folder pointed by your ${LABSOUTPUT} variable. You should see one file for each student in the list. The files are named hello_<hhmmddss>.txt where <hhmmddss> is the time in your system at the moment that the file was generated. The generated files look like the following:

 
1

Course Schedule

Name Date Details
Power BI Training 16 Nov 2024(Sat-Sun) Weekend Batch View Details
23 Nov 2024(Sat-Sun) Weekend Batch
30 Nov 2024(Sat-Sun) Weekend Batch

About the Author

Data Analyst & Machine Learning Associate

As a Data Analyst and machine learning associate, Nishtha combines her analytical skills and machine learning knowledge to interpret complicated datasets. She is also a passionate storyteller who transforms crucial findings into gripping tales that further influence data-driven decision-making in the business frontier.