Getting Started With Transformations

Pentaho Data Integration Transformation

Reading data from files:

Despite being the most primitive format used to store data, files are broadly used and they exist in several flavors as fixed width, comma-separated values, spreadsheet, or even free format files. PDI has the ability to read data from all types of files.

  • Create the folder named pdi_files. Inside it, create the input and output subfolders.
  • By using any text editor, type the file shown and save it under the name group1.txt in the folder named input, which you just created. You can also download the file from Packt’s official website.
  • Start Spoon.
  • From the main menu select File | New Transformation.
  • Expand the Input branch of the steps tree.
  • Drag the Text file input icon to the canvas.
  • Double-click the text input file icon and give a name to the step.
  • Click the Browse… button and search the file group1.txt.
  • Select the file. The textbox File or directory will be temporarily populated with the full path of the file—for example, C:pdi_filesinputgroup1.txt.
  • Click the Add button. The full text will be moved from the File or directory textbox to the grid. The configuration window should look as follows:

1

  1. Select the Content tab and fill it like this:

1

12. Select the Fields tab. Click the Get Fields button.

1

13. In the small window that proposes you a number of sample lines, click OK.
14. Close the scan results window.
15. Change the second row. Under the Type column select Date, and under the Format column, type dd/MMM.
16. The result value is text, not a number, so change the fourth row too. Under the Type column select String.
17. Click the Preview rows button, and then the OK button.
18. The previewed data should look like the following

1

19.  Expand the Transform branch of the steps tree.
20. Drag the Select values icon to the canvas.
21. Create a hop from the Text file input step to the Select values step.
22. Double-click the Select values step icon and give a name to the step.
23. Select the Remove tab.
24. Click the Get fields to remove button.
25. Delete every row except the first and the last one by left-clicking them and pressing Delete.
26. The tab window looks like this:

1

27. Click OK.
28. From the Flow branch of the steps tree, drag the Dummy icon to the canvas.
29. Create a hop from the Select values step to the Dummy step. :

1

30. Configure the transformation by pressing Ctrl+T and giving a name and a description to the transformation.
31. Save the transformation by pressing Ctrl+S.
32. Select the Dummy step.
33. Click the Preview button located on the transformation toolbar:

1

34. Click the Quick Launch button.
35. The following window appears, showing the final data:

1

  • Input files:

Files are one of the most used input sources. PDI can take data from several types of files, with very few limitations.

  • Input steps:

There are several steps that allow you to take a file as the input data. All those steps such as Text file input, Fixed file input, Excel Input, and so on are under the Input step category.

  • Name of the step:

It is mandatory and must be different for every step in the transformation.

  • Name and location of the file:

These must be specified of course. At the moment you create the transformation, it’s not mandatory that the file exists. However, if it does, you will find it easier to configure this step.

Learn Pentaho

  • Content type:

This data includes delimiter character, type of encoding, whether a header is present, and so on. The list depends on the kind of file chosen. In every case, Kettle propose default values, so you don’t have to enter too much data.

  • Fields:

Kettle has the facility to get the definitions automatically by clicking the Get Fields button. However, Kettle doesn’t always guess the data types, size, or format as expected. So, after getting the fields you may change what you consider more appropriate, as you did in the tutorial.

  • Filtering:

Some steps allow you to filter the data—skip blank rows, read only the first n rows, and soon.

Reading several files at once:

1.Open the transformation, double-click the input step, and add the other files in the same way you added the first.
2.After Clicking the Preview rows button, you will see this:
1

Text file input step and regular expressions:

1.Open the transformation and edit the configuration windows of the input step.
2.Delete the lines with the names of the files.
3.In the first row of the grid, type C:pdi_filesinput under the File/Directory column, and group[1-4].txt under the Wildcard (Reg.Exp.) column.
4.Click the Show filename(s)… button. You’ll see the list of files that match the expression.

1

  1. Close the tiny window and click Preview rows to confirm that the rows shown belong to the four files that match the expression you typed.
  • Regular expressions:

There are many places inside Kettle where you may or have to provide a regular expression. A regular expression is much more than specifying the known wildcards ? and *.

  • Grids:

Grids are tables used in many Spoon places to enter or display information. You already saw grids in several configuration windows—Text file input, Text file output, and Select values.

Sending data to files:

1.  Create a new transformation.

  1. Drag a Text file input step to the canvas and configure it just as you did in the previous tutorial.
  2. Drag a Select values step to the canvas and create a hop from the Text file input step to the Select values step.
  3. Double-click the Select values step.
  4. Click the Get fields to select button.
  5. Modify the fields as follows:

1

7. Expand the Output branch of the steps tree.
8. Drag the Text file output icon to the canvas.
9. Create a hop from the Select values step to the Text file output step.
10.Double-click the Text file output step and give it a name.
11.In the file name type: C:/pdi_files/output/wcup_first_round.
12.In the Content tab, leave the default values.
13.Select the Fields tab and configure it as follows:

1

14.Click OK.
15.Give a name and description to the transformation.

1

16.Save the transformation.
17.Click Run and then Launch.
18.Once the transformation is finished, check the file generated. It should have been created as C:/pdi_files/output/wcup_first_round.txt and should look like this:

Match Date;Home Team;Away Team;Result
02/06;Italy;France;2-1
02/06;Argentina;Hungary;2-1

Some data definitions:

  • Rowset:

Transformations deals with datasets, that is, data presented in a tabular form, where:

  1. Each column represents a field. A f‍ield has a name and a data type. The data type can be any of the common   data types—number (float), string, date, Boolean, integer, or big number.
  2. Each row corresponds to a given member of the dataset. All rows in a dataset have the same structure that is, all rows have the same fields, in the same order. A field in a row may be null, but it has to be present.
  3. The dataset is called rowset. The following is an example of rowset. It is the rowset generated in the World Cup tutorial:

1

  • Streams:

Right-click on the Select values step of the transformation you created. In the contextual menu select Show output fields. You’ll see this:

1

  • Running transformations from a terminal window:
  1. Open a terminal window and go to the directory where Kettle is installed.

On Windows systems type:

C:\pdi-ce>pan.bat /file:c:\pdi_labs\examinations.ktr c:\ pdi_files\input\exam3.txt

On Unix, Linux, and other Unix-based systems type:
/home/yourself/pdi-ce/pan.sh /file:/home/yourself/pdi_labs/ examinations.ktr c:/pdi_files/input/exam3.txt

If your transformation is in another folder, modify the command accordingly.

  1. You will see how the transformation runs, showing you the log in the terminal.

1

3.Check the output file. The contents of exam3.txt should be at the end of the file.

  • XML files:

XML files or documents are not only used to store data, but also to exchange data between heterogeneous systems over the Internet.

  1. If you work under Windows, open the properties file located in the C:/Documents and Settings/yourself/.kettle folder and add the following line:

LABSOUTPUT=c:/pdi_files/output 

On the other hand, if you work under Linux (or similar), open the kettle.properties file located in the /home/yourself/.kettle folder and add the following line:
LABSOUTPUT=/home/yourself/pdi_files/output

  1. Make sure that the directory specified in kettle.properties exists.
  2. Save the file.
  3. Restart Spoon.
  4. Create a new transformation.
  5. Give a name to the transformation and save it in the same directory you have all the other transformations.
  6. From the Packt website, download the resources folder containing a file named countries.xml. Save the folder in your working directory. For example, if your transformations are in pdi_labs, the file will be in pdi_labs/resources/.
  7. Take a look at the file. You can edit it with any text editor, or you can double-click it to see it within an explorer
  8. From the Input steps, drag a Get data from XML step to the canvas.
  9. Open the configuration window for this step by double-clicking it.
  10. In the File or directory textbox, press Ctrl+Space. A drop-down list appears as shown in the next screenshot:

1

  1. Select Internal. Transformation. Filename. Directory. The textbox gets filled with this text.
  2. Complete the text so that you can read ${Internal. Transformation. Filename. Directory}/resources/countries. xml.
  3. Click on the Add button. The full path is moved to the grid.
  4. Select the Content tab and click Get XPath nodes.
  5. In the list that appears, select /world/country/language.
  6. Select the Fields tab and fill the grid as follows:

1

18.Click Preview rows, and you should see something like this:

1

19. Click OK.

  1. From the Output steps, drag an Excel Output step to the canvas.
  2. Create a hop from the Get data from XML step to the Excel Output step.
  3. Open the configuration window for this step by double-clicking it.
  4. In the Filename textbox press Ctrl+Space.
  5. From the drop-down list, select ${LABSOUTPUT}.
  6. By the side of that text type /countries_info. The complete text should be ${LABSOUTPUT}/countries_info.
  7. Select the Fields tab and click the Get Fields button to fill the grid.
  8. Click OK. This is your final transformation.

1

  1. Save the transformation.
  2. Run the transformation.
  3. Check that the countries_info.xls file has been created in the output directory and contains the information you previewed in the input step.

Our Business Intelligence Courses Duration and Fees

Program Name
Start Date
Fees
Cohort starts on 1st Feb 2025
₹17,043
Cohort starts on 25th Jan 2025
₹17,043
Cohort starts on 18th Jan 2025
₹17,043

About the Author

Data Analyst & Machine Learning Associate

As a Data Analyst and machine learning associate, Nishtha combines her analytical skills and machine learning knowledge to interpret complicated datasets. She is also a passionate storyteller who transforms crucial findings into gripping tales that further influence data-driven decision-making in the business frontier.