Advanced Features of HBase Client API
6.1 Filters
HBase filters are a powerful feature that can greatly enhance your effectiveness when working with data stored in tables.
The two prominent read functions for HBase are get() and scan(), both supporting either direct access to data or the use of a start and end key, respectively. You can limit the data retrieved by progressively adding more limiting selectors to the query. These include column families, column qualifiers, timestamps or ranges, as well as version number.
It shows how the filters are configured on the client, then serialized over the network, and then applied on the server.
The lowest level in the filter hierarchy is the Filter interface, and the abstract Filter Base class that implements an empty shell, or skeleton, that is used by the actual filter classes to avoid having the same boilerplate code in each of them. Most concrete filter classes are direct descendants of FilterBase, but a few use another, intermediate ancestor class. They all work the same way: you define a new instance of the filter you want to apply and hand it to the Get or Scan instances, using:
setFilter(filter)
While you initialize the filter instance itself, you often have to supply parameters for whatever the filter is designed for. There is a special subset of filters, based on CompareFilter, that ask you for at least two specific parameters, since they are used by the base class to perform its task. You will learn about the two parameter types next so that you can use them in context.
As CompareFilter-based filters add one more feature to the base FilterBase class, namely the compare() operation, it has to have a user-supplied operator type that defines how the result of the comparison is interpreted.
The second type that you need to provide to CompareFilter-related classes is a comparator, which is needed to compare various values and keys in different ways. They are derived from WritableByteArrayComparable, which implements Writable, and Comparable.
6.2 Comparison Filters
The first type of supplied filter implementations are the comparison filters. They take the comparison operator and comparator instance as described earlier. The constructor of each of them has the same signature, inherited from CompareFilter:
CompareFilter(CompareOp valueCompareOp, WritableByteArrayComparable valueComparator)
You need to supply this comparison operator and comparison class for the filters to do their work. Next you will see the actual filters implementing a specific comparison.
This filter gives you the ability to filter data based on row keys.
This filter works very similar to the RowFilter, but applies the comparison to the column families available in a row—as opposed to the row key. Using the available combinations of operators and comparators you can filter what is included in the retrieved data on a column family level.
This allows you to filter specific columns from the table.
This filter makes it possible to include only columns that have a specific value. Combined with the RegexStringComparator, for example, this can filter using powerful expression syntax.
Dependent column—or reference column—that controls how other columns are filtered. It uses the timestamp of the reference column and includes all other columns that have the same timestamp.
6.3 Dedicated Filters
The second type of supplied filters are based directly on FilterBase and implement more specific use cases. Many of these filters are only really applicable when performing scan operations since they filter out entire rows. For get() calls, this is often too restrictive and would result in a very harsh filter approach: include the whole row or nothing at all.
You can use this filter when you have exactly one column that decides if an entire row should be returned or not. You need to first specify the column you want to track, and then some value to check against.
- SingleColumnValueExcludeFilter
The SingleColumnValueFilter we just discussed is extended in this class to provide slightly different semantics: the reference column, as handed into the constructor, is omitted from the result.
Given a prefix, specified when you instantiate the filter instance, all rows that match this prefix are returned to the client. The constructor is:
public PrefixFilter(byte[] prefix)
You paginate through rows by employing this filter. When you create the instance, you specify a pageSize parameter, which controls how many rows per page should be returned.
The KeyOnlyFilter provides this functionality by applying the filter’s ability to modify the processed columns and cells, as they pass through. It does so by applying the KeyValue.convertToKeyOnly(boolean) call that strips out the data part.
If you need to access the first column—as sorted implicitly by HBase—in each row, this filter will provide this feature.
The row boundaries of a scan are inclusive for the start row, yet exclusive for the stop row. You can overcome the stop row semantics using this filter, which includes the specified stop row.
When you need fine-grained control over what versions are included in the scan result, this filter provides the means. You have to hand in a List of timestamps:
TimestampsFilter(List<Long> timestamps)
You can use this filter to only retrieve a specific maximum number of columns per row. You can set the number using the constructor of the filter:
ColumnCountGetFilter(int n)
Similar to the PageFilter, this one can be used to page through columns in a row. Its constructor has two parameters:
ColumnPaginationFilter(int limit, int offset)
It skips all columns up to the number given as offset, and then includes limit columns afterward.
Analog to the PrefixFilter, which worked by filtering on row key prefixes, this filter does the same for columns. You specify a prefix when creating the filter:
ColumnPrefixFilter(byte[] prefix)
All columns that have the given prefix are then included in the result.
Finally, there is a filter that shows what is also possible using the API: including random rows into the result. The constructor is given a parameter named chance, which represents a value between 0.0 and 1.0:
RandomRowFilter(float chance)
Get 100% Hike!
Master Most in Demand Skills Now!
6.4 Decorating Filters
It can be useful to modify, or extend, the behavior of a filter to gain additional control over the returned data. Some of this additional control is not dependent on the filter itself, but can be applied to any of them. This is what the decorating filter group of classes is about.
This filter wraps a given filter and extends it to exclude an entire row, when the wrapped filter hints for a KeyValue to be skipped. In other words, as soon as a filter indicates that a column in a row is omitted, the entire row is omitted.
This second decorating filter type works somewhat similarly to the previous one, but aborts the entire scan once a piece of information is filtered. This works by checking the wrapped filter and seeing if it skips a row by its key, or a column of a row because of a KeyValue check.
We may want to have more than one filter being applied to reduce the data returned to your client application. This is what the FilterList is for.
Eventually, you may exhaust the list of supplied filter types and need to implement your own. This can be done by either implementing the Filter interface, or extending the provided FilterBase class. The latter provides default implementations for all methods that are members of the interface.
6.5 Counters
Many applications that collect statistics—such as clicks or views in online advertising—were used to collect the data in logfiles that would subsequently be analyzed. Using counters offers the potential of switching to live accounting, foregoing the delayed batch processing step completely. HBase also has a mechanism to treat columns as counters. Otherwise, you would have to lock a row, read the value, increment it, write it back, and eventually unlock the row for other writers to be able to access it subsequently.
You should not initialize counters, as they are automatically assumed to be zero when you first use a new counter, that is, a column qualifier that does not yet exist. The first increment call to a new counter will return 1—or the increment value, if you have specified one—as its result.
The first type of increment call is for single counters only: you need to specify the exact column you want to use.
Example – Using the single counter increment methods
HTable table = new HTable(conf, "counters"); 1
long cnt1 = table.incrementColumnValue(Bytes.toBytes("20110101"),
Bytes.toBytes("daily"), Bytes.toBytes("hits"), 1); 2
long cnt2 = table.incrementColumnValue(Bytes.toBytes("20110101"),
Bytes.toBytes("daily"), Bytes.toBytes("hits"), 1); 3
long current = table.incrementColumnValue(Bytes.toBytes("20110101"),
Bytes.toBytes("daily"), Bytes.toBytes("hits"), 0); 4
long cnt3 = table.incrementColumnValue(Bytes.toBytes("20110101"),
Bytes.toBytes("daily"), Bytes.toBytes("hits"), -1);
- Increase the counter by one.
- Increase the counter by one a second time.
- Get the current value of the counter without increasing it.
- Decrease the counter by one.
The output on the console is:
cnt1: 1, cnt2: 2, current: 2, cnt3: 1
Another way to increment counters is provided by the increment() call of HTable. It works similarly to the CRUD-type operations, using the following method to do the increment:
Result increment(Increment increment) throws IOException
You must create an instance of the Increment class and fill it with the appropriate details—for example, the counter coordinates. The constructors provided by this class are:
Increment() {}
Increment(byte[] row)
Increment(byte[] row, RowLock rowLock)
You must provide a row key when instantiating an Increment, which sets the row containing all the counters that the subsequent call to increment() should modify.
The optional parameter rowLock specifies a custom row lock instance, allowing you to run the entire operation under your exclusive control—for example, when you want to modify the same row a few times while protecting it against updates from other writers.
6.6 Coprocessors
With the coprocessor feature in HBase, you can even move part of the computation to where the data lives. A coprocessor enables you to run arbitrary code directly on each region server. More precisely, it executes the code on a per-region basis, giving you trigger-like functionality—similar to stored procedures in the RDBMS world. From the client side, you do not have to take specific actions, as the framework handles the distributed nature transparently.
6.6.1 The Coprocessor Class
All coprocessor classes must be based on this interface. It defines the basic contract of a coprocessor and facilitates the management by the framework itself. The interface provides two enumerations, which are used throughout the framework: Priority and State.
Coprocessors are managed by the framework in their own life cycle. To that effect, the Coprocessor interface offers two calls:
void start(CoprocessorEnvironment env) throws IOException;
void stop(CoprocessorEnvironment env) throws IOException;
These two methods are called when the coprocessor class is started, and eventually when it is decommissioned. The provided CoprocessorEnvironment instance is used to retain the state across the lifespan of the coprocessor instance. A coprocessor instance is always contained in a provided environment.
6.6.2 Coprocessor Loading
Coprocessors are loaded in a variety of ways. You can either configure coprocessors to be loaded in a static way, or load them dynamically while the cluster is running. The static method uses the configuration files and table schemas.
- Loading from the configuration
You can configure globally which coprocessors are loaded when HBase starts. This is done by adding one, or more, of the following to the hbase-site.xml configuration file:
<property>
<name>hbase.coprocessor.region.classes</name>
<value>coprocessor.RegionObserverExample, coprocessor.AnotherCoprocessor</value>
</property>
<property>
<name>hbase.coprocessor.master.classes</name>
<value>coprocessor.MasterObserverExample</value>
</property>
<property>
<name>hbase.coprocessor.wal.classes</name>
<value>coprocessor.WALObserverExample, bar.foo.MyWALObserver</value>
</property>
- Loading from the table descriptor
The other option to define what coprocessors to load is the table descriptor. As this is per table, the coprocessors defined here are only loaded for regions of that table—and only by the region servers. In other words, you can only use this approach for region related coprocessors, not for master or WAL-related ones.
6.6.3 The RegionObserver Class
The first subclass of Coprocessor we will look into is the one used at the region level: the RegionObserver class.
6.6.4The MasterObserver Class
The second subclass of Coprocessor discussed handles all possible callbacks the master server may initiate.
6.6.5 HTablePool
Instead of creating an HTable instance for every request from your client application, it makes much more sense to create one initially and subsequently reuse them.
The primary reason for doing so is that creating an HTable instance is a fairly expensive operation that takes a few seconds to complete. In a highly contended environment with thousands of requests per second, you would not be able to use this approach at all—creating the HTable instance would be too slow. You need to create the instance at startup and use it for the duration of your client’s life cycle.
Clients can solve this problem using the HTablePool class. It only serves one purpose, namely to pool client API instances to the HBase cluster.