• Articles
  • Tutorials
  • Interview Questions

Developing and Implementing a Simple Datamart

Steps to build a Data Mart with Pentaho Data Integration

Loading dimensions for the sales datamart:

  •  Check that the database engine is up and that both the js and the js_dw databases are accessible from PDI.
  • If your time dimension table, lk_time, has data, truncate the table. You may do it
    by using the Truncate table [lk_time] option in the database explorer.
  1. Create a new transformation and use it to load the manufacturer dimension. This is a Type I SCD dimension. The data for the dimension comes from the manufacturers table in the js database. The dimension table in js_dw is lk_manufacturer. Use the following screenshot as a guide:

1

  1. Save the transformation the lk_transformations.
  2. Create a new transformation and use it to load the regions dimension.
  1. The region dimension is a Type II SCD dimension. The data for the dimension comes from the city and country tables. The information about regions is in Excel files that you can download from the Packt web site. The dimension table in js_dw is lk_regions_2. Use the following screenshot as a guide:

1

  1. Save the transformation in the lk_transformations folder.
  2. Create a new transformation and use it to load the time dimension.
  3. Save the transformation in the lk_transformations folder. Now you will create a job to put it all together:
  1. Create a new job and save it in the same folder where you created the lk_transformations folder.
  2. Drag a START entry and two Transformation job entries to the canvas.
  3. Create a hop from the startentry to each of the transformation entries. You have the following:

1

  1. Use one of the transformation entries to execute the transformation that loads the manufacturer dimension.
  2. Use the other transformation entry to execute the transformation that loads the region dimension.
  3. Add an Evaluate rows number in a table entry to the canvas. You’ll find it under the Conditions category.
  4. Create a hop from the startentry towards this new entry.
  5. Double-click the new entry and fill it like shown:

1

  1. After this entry, add another transformation entry and use it to execute the transformation that loads the time dimension.
  2. Finally, from the General category add a Success entry.
  3. Create a hop from the Evaluate… step to this entry. The hop should be red, meaning that this step executes when the evaluation fails.
  4. Run the job. The manufacturer and regions dimensions should be loaded. You can verify it by exploring the tables from the PDI explorer or in MySQL query browser.
  5. In the logging window, you’ll see that the evaluation succeeded and so the time dimension is also loaded:

1
21. You can check it by exploring the table.

1
22. Run the transformation again. This time the evaluation fails and the transformation that loads the time dimension is not executed this time.

1

Loading the fact table using a range of dates obtained from the command line:

  1. Create a new transformation.
  2. With a Get system info step, get the first two arguments from the command line and name them date_from and date_to.
  3. By using a couple of steps, check that the arguments are not null, have the proper format (yyyy-mm-dd), and are valid dates.
  4. If something is wrong with the arguments, abort.
  5. If the arguments are valid, use a Set variables step to set two variables named DATE_FROM and DATE_TO.
  6. Save the transformation in the same folder you saved the transformation that loads the fact table.
  7. Test the transformation by providing valid and invalid arguments to see that it works as expected.
  8. Create a job and save it in the same folder you saved the job that loads the dimensions.
  9. Drag to the canvas a START and two transformation job entries, and link them one after the other.
  10. Use the first transformation entry to execute the transformation you just created.
  11. Use the second transformation entry to execute the transformation that loads the fact table.
  12. This is how your job should look like:

1

  1. Save the job.
  2. Press F9 to run the job.
  3. Fill the job settings window as follows:

1

  1. Click on Launch.
  2. When the execution finishes, explore the database to check that the data for the given dates was loaded in the fact table. You will see this:

1

Check out the Pentaho Tutorial to learn more about validating data, Error Handling in Pentaho and various steps involve in it.

Loading the sales star:

  1. Create a new job in the same folder in which you saved those jobs. Name this job load_dm_sales.kjb.
  2. Drag to the canvas a START and two job entries, and link them one after the other.
  3. Use the first job entry to execute the job that loads the dimensions.
  4. Use the second Job entry to execute the job you just created for loading the fact table.

5. Save the job.

  1. Press F9 to run the job.
  2. As arguments, provide a new range of dates: 2009-09-01, 2009-09-30. Then press Launch.
  3. The dimensions will be loaded first, followed by the loading of the fact table.
  4. The Job metrics tab in the Execution results window shows you the whole process running:

1

  1. Exploring the database, you’ll see once again the data updated:

1

Automating the loading of the sales datamart:

  1. Create a new transformation.
  2. Drag to the canvas a Get system data step and fill it like here:

1

  1. With a Select values step, change the metadata of both fields: As type put String and as format, yyyy-MM-dd.
  2. Add a Set variables step and use the two fields to create two variables named START_DATE and END_DATE.
  3. Save the transformation in the same folder you saved the transformation that loads the fact.
  4. Modify the job that loads the fact so that instead of executing, the transformation that takes the range of dates from the command line executes this one.

PDI best practices:

Here are some guidelines that will help you go in the right direction.
Outline your ideas on paper before creating a transformation or a job:
Don’t drop steps randomly on the canvas trying to get things working. You could end up with a transformation or job that is difficult to understand and even useless.
Document your work:
Write at least a simple description in the transformations and jobs setting windows. Replace the default names of steps and job entries with meaningful ones. Use notes to clarify the purpose of the transformations and jobs. Doing this, your work will be quite self documented.
Make your jobs and transformations clear to understand:
Arrange the elements in the canvas so that it doesn’t look like a puzzle to solve. Memorize the shortcuts for arrangement and alignment, and use them regularly. You’ll find a full list in Appendix D, Spoon shortcuts.

Learn Pentaho

Organize PDI elements in folders:
Don’t save all the transformations and jobs in the same folder. Organize them according to their purpose.
Make your work flexible and reusable:
Make use of arguments, variables, and named parameters. If you identify tasks that are going to be used in several situations, create subtransformations.
Make your work portable (ready for deployment):
This involves making sure even if you move your work to another machine or another folder, or the paths to source or destination files change, or the connection properties to the databases change, everything should work either with minimal changes or without changes. In order to make ensure that, don’t use fixed names but variables. If you know the values for the variables beforehand, define the variables in the kettle.properties file. For the name of the transformations and jobs, use relative paths—use the ${Internal.Job.Filename.Directory} and ${Internal.Transformation.Filename.Directory} variables.
Avoid overloading your transformations:
A transformation should do a precise task. If it doesn’t, think of splitting it in two or more, or create subtransformations. Doing this will make your transformation clearer and also reusable in the case of subtransformations.
Handle errors:
Try to figure out the kind of errors that may happen and trap them by validating and handling errors, and taking appropriate actions such as fixing data, taking alternative paths, sending friendly message to the log files, and so on
Keep track of jobs and transformations history:
You can use a versioning system such as subversion. Doing so, you could recover older versions of your jobs and transformations or examine the history of how they changed.
The following is the main PDI forum page:

1

About the Author

Data Analyst & Machine Learning Associate

As a Data Analyst and machine learning associate, Nishtha combines her analytical skills and machine learning knowledge to interpret complicated datasets. She is also a passionate storyteller who transforms crucial findings into gripping tales that further influence data-driven decision-making in the business frontier.