• Articles
  • Tutorials
  • Interview Questions

Validating Data and Handling Errors

Steps to Validate and Handle Errors in Pentaho

Capturing errors while calculating the age of a film:

  • Get the file with the films. You can take the transformation that denormalized the data and generate the file with a Text file output step, or you can take a sample file from the Packt website.
  • Create a new transformation and read the file with a Text file input step.
  • Do a preview of the data. You will see the following:

1

  • After the Text file input step, add a Get System Info step.
  • Edit the step, add a new field named today, and choose Today 00:00:00 as its value.
  • Add a JavaScript step.
  • Edit the step and type the following piece of code:
var diff;
film_date = str2date('01/01/' + Year, 'dd/MM/yyyy');
diff = dateDiff(film_date,today,”y”);
  • Click on Get variables to add diff as a new field.
  • Add a Number range step, edit it, and fill its window as follows:

1

  • With a Sort rows step, sort the data by diff.
  • Finally, add a Group by step and double-click to edit it.
  • As group field put age_of_film. In the Aggregates grid create a field named number_of_films to hold the number of films with that age. Put film as the Subject and select Number of values (N) as the Type.
  • Add a Dummy step at the end and do a preview. You will be surprised by an error like this:

1

  • Look at the logging window. It looks like this:

1

  • Now drag Write to log step to the canvas from the Utility category.
  • Create a hop from the JavaScript step to this new step.
  • Select the JavaScript step, right-click it to bring up a contextual menu, and select Define error handling….
  • The error handling settings window appears. Fill it like shown:

1

  • Click on OK.
  • Save the transformation and do a new preview on the Dummy step. You will see this:

1

  • The logging window will show you this:
... - Bad rows.0 -
... - Bad rows.0 - ------------> Linenr 1-------------------------
... - Bad rows.0 – null
  • Now do a preview on the Write to log step. This is what you see:

1

Aborting when there are too many errors:

We can handle the errors by detecting and sending bad rows to an extra stream. But when the errors are too many or when the errors are severe, the best option is to cancel the whole transformation.
Let’s see how to force the abortion of a transformation in such a situation.

  • Open the transformation from the previous tutorial and save it under a different name.
  • From the Flow category, drag an Abort step to the canvas.
  • Create a hop from the Write to log step to the Abort step.
  • Double-click the Abort step. Enter 5 as Abort threshold. As Abort message, type Too many errors calculating age of film!.
  • Click OK.
  • Select the Dummy step and do a preview. As a result, a warning window shows up informing you that there were no rows to display. In the Step Metrics tab, the Abort after 5 errors line becomes red to show you that there was an error:

1

  • The log looks like this:
... - Bad rows.0 -
... - Bad rows.0 - ====================

Avoiding unexpected errors by validating data
To avoid unexpected errors that happen or just to meet your requirements is a common practice to validate your data before processing it.

Learn more about Row Set Transformation in Pentaho from this insightful Pentaho Tutorial!

Avoiding unexpected errors by validating data:

To avoid unexpected errors that happen or just to meet your requirements is a common practice to validate your data before processing it. Let’s do some validations.

Learn Pentaho

Validating genres with a Regex Evaluation step:

  • Create a new transformation.
  • Read the modified films file just as you did in the previous tutorial.
  • In the Content tab, check the Rownum in output? option and fill the Rownum fieldname with the text rownum.
  • Do a preview. You should see this:

1

  • After the Text file input step, add a Regex Evaluation step. You will find it under the Scripting category of steps.
  • Under the Step settings box, select Genres as the Field to evaluate, and type genres_ok as the Result Fieldname.
  • In the Regular expression textbox type [A-Za-zs-]*(|[A-Za-zs-]*)* .
  • Add the Filter rows step, an Add constants step, and two Text file output steps and link them as shown next:

1

  • Edit the Add constants step.
  • Add a String constant named err_code with value GEN_INV and a String constant named err_desc with value Invalid list of genres.
  • Configure the Text file output step after the Add constant step to create the ${LABSOUTPUT}/films_err.txt file, with the fields rownum, err_code, and err_desc.
  • Configure the other Text file output step to create the ${LABSOUTPUT}/films_ ok.txt file, with the fields film, Year, Genres, Director, and Actors.
  • Double-click the Filter rows step and add the condition genres_ok = Y, Y being a Boolean value. Send true data to the stream that generates the films_ok.txt file. Send false data to the other stream.
  • Run the transformation.
  • Check the generated files. The films_err.txt file looks like the following:
rownum;err_code;err_desc
12;GEN_INV;Invalid list of genres
18;GEN_INV;Invalid list of genres
20;GEN_INV;Invalid list of genres
21;GEN_INV;Invalid list of genres
22;GEN_INV;Invalid list of genres
33;GEN_INV;Invalid list of genres
34;GEN_INV;Invalid list of genres.

Validating data:

As said, you would validate data mainly for two reasons:
1. To prevent the transformation from aborting because of unexpected errors
2. To check that your data meets some pre-existing requirements.

Checking films file with the Data Validator:

Let’s validate not only the Genres field, but also the Year field.

  • Open the last transformation and save it under a new name.
  • Delete all steps except the Text file input and Text file output steps.
  • In the Fields tab of the Text file input step, change the Type of the Year from Integer to String.
  • From the Validation category add a Data Validator step. Also add a Select values step. Link all steps as follows:

1

  • Double-click the Data Validator step.
  • Check the Report all errors, not only the first option found on at the top of the window. This will enable the Output one row, concatenate errors with separator option. Check this option too, and fill the textbox to the right with a slash /. Click on New validation and type genres as the name of the validation rule.
  • Click on OK.
  • Click on genres. The right half of the window is filled with checkboxes and textboxes where you will define the rule.
  • Fill the header of the rule definition as follows:

1

  • In the Regular expression expected to match textbox, type [A-Za-zs-]*(|[A-Za-zs-]*)*
  • Click on New validation and type year as the name of the validation rule.
  • Click on OK.
  • Click on year and fill the header of the rule definition as follows:

1

  • In the data block, select the Only numeric data expected checkbox option.
  • Click on OK.
  • Right-click the Data Validator step and select Define error handling….
  • Fill the error handling settings window as follows: As Target step, select the step that generates the file with invalid rows. Check the Enable the error handling? checkbox. Type err_desc as Error description field name, err_field as Error fields, and err_code as Error codes. Click on OK.
  • Use the Select values step to change the metadata for the Year from String to Integer.
  • Save the transformation and run it.
  • . Check the generated files. The films_err.txt file now has more detail, as you validated two fields.rownum;err_code;err_desc9;YEAR_NULL;Year invalid or absent 12;GEN_INV;Invalid list of genres18;GEN_INV;Invalid list of genres
    20;GEN_INV;Invalid list of genres
    21;GEN_INV;Invalid list of genres
    22;GEN_INV;Invalid list of genres
    33;GEN_INV;Invalid list of genres
    34;GEN_INV;Invalid list of genres
    47;YEAR_NULL/GEN_INV;Year invalid or absent/Invalid list of genres
    48;YEAR_NULL/GEN_INV;Year invalid or absent/Invalid list of genres
    49;YEAR_NULL;Year invalid or absent
  • The films_ok.txt file should have less rows instead, as the films with year invalid or absent are no longer sent to this file.

About the Author

Data Analyst & Machine Learning Associate

As a Data Analyst and machine learning associate, Nishtha combines her analytical skills and machine learning knowledge to interpret complicated datasets. She is also a passionate storyteller who transforms crucial findings into gripping tales that further influence data-driven decision-making in the business frontier.