HBase MapReduce Integration Examples
One of the great features of HBase is its tight integration with Hadoop’s MapReduce framework.
7.1.1 MapReduce Introduction
MapReduce as a process was designed to solve the problem of processing in excess of terabytes of data in a scalable way. There should be a way to build such a system that increases in performance linearly with the number of physical machines added. That is what MapReduce strives to do. It follows a divide-and-conquer approach by splitting the data located on a distributed filesystem so that the servers (or rather CPUs, or more modern “cores”) available can access these chunks of data and process them as fast as they can. The problem with this approach is that you will have to consolidate the data at the end. Again, MapReduce has this built right into it.
Map reduce process figure also shows you the classes that are involved in the Hadoop implementation of MapReduce.
First it splits the input data, and then it returns a RecordReader instance that defines the classes of the key and value objects, and provides a next() method that is used to iterate over each input record.
In this step, each record read using the RecordReader is processed using the map() method.
The Reducer stage and class hierarchy is very similar to the Mapper stage. This time we get the output of a Mapper class and process it after the data has been shuffled and sorted.
The final stage is the OutputFormat class, and its job is to persist the data in various locations. There are specific implementations that allow output to files, or to HBase tables in the case of the TableOutputFormat class. It uses a TableRecord Writer to write the data into the specific HBase output table.
7.1.3 Supporting Classes
The MapReduce support comes with the TableMapReduceUtil class that helps in setting up MapReduce jobs over HBase. It has static methods that configure a job so that you can run it with HBase as the source and/or the target.
7.2 MapReduce over HBase
To run a MapReduce job that needs classes from libraries not shipped with Hadoop or the MapReduce framework, you’ll need to make those libraries available before the job is executed. You have two choices: static preparation of all task nodes, or supplying everything needed with the job.
For a library that is used often, it is useful to permanently install its JAR file(s) locally on the task tracker machines, that is, those machines that run the MapReduce tasks. This is done by doing the following:
- Copy the JAR files into a common location on all nodes.
- Add the JAR files with full location into the hadoop-env.sh configuration file, into the HADOOP_CLASSPATH variable:
# Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH="<extra_entries>:$HADOOP_CLASSPATH"
- Restart all task trackers for the changes to be effective.
Obviously this technique is quite static, and every update (e.g., to add new libraries) requires a restart of the task tracker daemons.
If you want to learn about Advanced API, refer to this insightful Blog!
In case you need to provide different libraries to each job you want to run, or you want to update the library versions along with your job classes, then using the dynamic provisioning approach is more useful.
7.2.2 Data Source and Sink
The source or target of a MapReduce job can be a HBase table, but it is also possible for a job to use HBase as both input and output. In other words, a third kind of MapReduce template uses a table for the input and output types. This involves setting the TableInputFormat and TableOutputFormat classes into the respective fields of the job configuration.
This blog will help you get a better understanding of Hbase!