Performance Tunning Technique
It includes various advanced techniques for tuning a cluster and testing it repeatedly to verify its performance.
8.1 Garbage Collection Tuning
One of the lower-level settings you need to adjust is the garbage collection parameters for the region server processes. Note that the master is not a problem here as it does not handle any heavy loads, and data does not pass through it. These parameters only need to be added to the region servers.
8.2 Memstore-Local Allocation Buffer
Version 0.90 of HBase introduced an advanced mechanism to mitigate the issue of heap fragmentation due to too much churn on the memstore instances of a region server: the memstore-local allocation buffers, or MSLAB for short.
The MSLABs are buffers of fixed sizes containing KeyValue instances of varying sizes. Whenever a buffer cannot completely fit a newly added KeyValue, it is considered full and a new buffer is created, once again of the given fixed size.
8.3 Compression
HBase comes with support for a number of compression algorithms that can be enabled at the column family level. For every other use case, compression usually will yield overall better performance, because the overhead of the CPU performing the compression and decompression is less than what is required to read more data from disk.
8.3.1 Available Codecs
You can choose from a fixed list of supported compression algorithms. They have different qualities when it comes to compression ratio, as well as CPU and installation requirements.
8.3.2 Verifying Installation
Once you have installed a supported compression algorithm, it is highly recommended that you check if the installation was successful. There are a few mechanisms in HBase to do that.
HBase includes a tool to test if compression is set up properly. To run it, type ./bin/ hbase org.apache.hadoop.hbase.util.CompressionTest. This will return information on how to run the tool:
$ ./bin/hbase org.apache.hadoop.hbase.util.CompressionTest
Usage: CompressionTest <path> none|gz|lzo|snappy
For example:
hbase class org.apache.hadoop.hbase.util.CompressionTest file:///tmp/testfile gz
8.3.3 Enabling Compression
Enabling compression requires installation of the JNI and native compression libraries (unless you only want to use the Java code-based GZIP compression)
hbase(main):001:0> create 'testtable', { NAME => 'colfam1', COMPRESSION => 'GZ' }
0 row(s) in 1.1920 seconds
hbase(main):012:0> describe 'testtable'
DESCRIPTION ENABLED
{NAME => 'testtable', FAMILIES => [{NAME => 'colfam1', true
BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS
=> '3', COMPRESSION => 'GZ', TTL => '2147483647', BLOCKSIZE
=> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
1 row(s) in 0.0400 seconds
The describe shell command is used to read back the schema of the newly created table. You can see the compression is set to GZIP (using the shorter GZ value as required). Another option to enable—or change, or disable—the compression algorithm is to use the alter command for existing tables.
Changing the compression format to NONE will disable the compression for the given column family.
The master has a built-in feature, called the balancer. By default, the balancer runs every five minutes, and it is configured by the hbase.balancer.period property. Once the balancer is started, it will attempt to equal out the number of assigned regions per region server so that they are within one region of the average number per server. The call first determines a new assignment plan, which describes which regions should be moved where. Then it starts the process of moving the regions by calling the unassign() method of the administrative API iteratively.
The balancer has an upper limit on how long it is allowed to run, which is configured using the hbase.balancer.max.balancing property and defaults to half of the balancer period value, or two and a half minutes.
While it is much more common for regions to split automatically over time as you are adding data to the corresponding table, sometimes you may need to merge regions— for example, after you have removed a large amount of data and you want to reduce the number of regions hosted by each server.
HBase ships with a tool that allows you to merge two adjacent regions as long as the cluster is not online. You can use the command-line tool to get the usage details:
$ ./bin/hbase org.apache.hadoop.hbase.util.Merge
Usage: bin/hbase merge <table-name> <region-1> <region-2>
- Client API: Best Practices
When reading or writing data from a client using the API, there are a handful of optimizations you should consider to gain the best performance. Here is a list of the best practice options:
When performing a lot of put operations, make sure the auto-flush feature ofHTable is set to false, using the setAutoFlush(false) method.
If HBase is used as an input source for a MapReduce job, for example, make sure the input Scan instance to the MapReduce job has setCaching() set to something greater than the default of 1.
Whenever a Scan is used to process large numbers of rows (and especially when used as a MapReduce source), be aware of which attributes are selected.
This isn’t so much about improving performance, but rather avoiding performance problems.
Scan instances can be set to use the block cache in the region server via the setCacheBlocks() method.
- Optimal loading of row keys
8.7 Configuration
Many configuration properties are available for you to use to fine-tune your cluster setup.
- Decrease ZooKeeper timeout
- Increase blocking store files
- Increase block multiplier
- Decrease maximum logfiles
8.8 Load Tests
After installing your cluster, it is advisable to run performance tests to verify its functionality. These tests give you a baseline which you can refer to after making changes to the configuration of the cluster, or the schemas of your tables. Doing a burn-in of your cluster will show you how much you can gain from it, but this does not replace a test with the load as expected from your use case.
Get 100% Hike!
Master Most in Demand Skills Now!
8.8.1 Performance Evaluation
HBase ships with its own tool to execute a performance evaluation. It is aptly named Performance Evaluation (PE) and its usage details can be gained from using it with no command-line parameters:
$./bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation
Usage: java org.apache.hadoop.hbase.PerformanceEvaluation \
[--miniCluster] [--nomapred] [--rows=ROWS] <command> <nclients>
To run a single evaluation client:
$ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 1
8.8.2 YCSB
The Yahoo! Cloud Serving Benchmark* (YCSB) is a suite of tools that can be used to run comparable workloads against different storage systems. While primarily built to compare various systems, it is also a reasonable tool for performing an HBase cluster burn-in—or performance test.
YCSB installation
YCSB is available in an online repository only, and you need to compile a binary version yourself. The first thing to do is to clone the repository:
$ git clone http://github.com/brianfrankcooper/YCSB.git
Initialized empty Git repository in /private/tmp/YCSB/.git/
…
Resolving deltas: 100% (475/475), done.
This will create a local YCSB directory in your current path. The next step is to change into the newly created directory, copy the required libraries for HBase, and compile the executable code:
$ cd YCSB/
$ cp $HBASE_HOME/hbase*.jar db/hbase/lib/
$ cp $HBASE_HOME/lib/*.jar db/hbase/lib/
$ ant
Buildfile: /private/tmp/YCSB/build.xml
...
makejar:
[jar] Building jar: /private/tmp/YCSB/build/ycsb.jar
BUILD SUCCESSFUL
Total time: 1 second
$ ant dbcompile-hbase
...
BUILD SUCCESSFUL
Total time: 1 second
This process only takes seconds and leaves you with an executable JAR file in the build directory.
YCSB can hardly emulate the workload, but it can still be useful to test a varying set of loads on your cluster. Use the supplied workloads, or create your own, to emulate cases that are bound to read, write, or both kinds of operations.