Top Answers to HBase Interview Questions
1. Compare HBase & Cassandra
|Basis for the cluster||Hadoop||Peer-to-peer|
|Best suited for||Batch Jobs||Data writes|
2. What is Apache HBase?
It is a column-oriented database which is used to store the sparse data sets. It is run on the top of Hadoop file distributed system. Apache HBase is a database that runs on a Hadoop cluster. Clients can access HBase data through either a native Java API or through a Thrift or REST gateway, making it accessible by any language. Some of the key properties of HBase include:
- NoSQL: HBase is not a traditional relational database (RDBMS). HBase relaxes the ACID (Atomicity, Consistency, Isolation, Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.
- Wide-Column: HBase stores data in a table-like format with the ability to store billions of rows with millions of columns. Columns can be grouped together in “column families” which allows physical distribution of row values onto different cluster nodes.
- Distributed and Scalable: HBase group rows into “regions” which define how table data is split over multiple nodes in a cluster. If a region gets too large, it is automatically split to share the load across more servers.
- Consistent: HBase is architected to have “strongly-consistent” reads and writes, as opposed to other NoSQL databases that are “eventually consistent”. This means that once a write has been performed, all read requests for that data will return the same value.
Learn more about Apache HBase through this what is Apache HBase blog.
3. Give the name of the key components of HBase
The key components of HBase are Zookeeper, RegionServer, Region, Catalog Tables and HBase Master.
Check out this video on HBase Tutorial for Beginners
4. What is S3?
S3 stands for simple storage service and it is a one of the file system used by hbase.
5. What is the use of get() method?
get() method is used to read the data from the table.
6. What is the reason of using HBase?
HBase is used because it provides random read and write operations and it can perform a number of operation per second on a large data sets.
7. In how many modes HBase can run?
There are two run modes of HBase i.e. standalone and distributed.
8. Define the difference between hive and HBase?
HBase is used to support record level operations but hive does not support record level operations.
9. Define column families?
It is a collection of columns whereas row is a collection of column families.
10. Define standalone mode in HBase?
It is a default mode of HBase. In standalone mode, HBase does not use HDFS—it uses the local filesystem instead—and it runs all HBase daemons and a local ZooKeeper in the same JVM process.
11. What is decorating Filters?
It is useful to modify, or extend, the behavior of a filter to gain additional control over the returned data.
12. What is the full form of YCSB?
YCSB stands for Yahoo! Cloud Serving Benchmark.
13. What is the use of YCSB?
It can be used to run comparable workloads against different storage systems.
Learn more about the use of YCSB in HBase in this HBase Tutorial.
14. Which operating system is supported by HBase?
HBase supports those OS which supports java like windows, Linux.
15. What is the most common file system of HBase?
The most common file system of HBase is HDFS i.e. Hadoop Distributed File System.
16. Define Pseudodistributed mode?
A pseudodistributed mode is simply a distributed mode that is run on a single host.
17. What is regionserver?
It is a file which lists the known region server names.
18. Define MapReduce.
MapReduce as a process was designed to solve the problem of processing in excess of terabytes of data in a scalable way.
19. What are the operational commands of HBase?
Operational commands of HBase are Get, Delete, Put, Increment, and Scan.
20. Which code is used to open the connection in Hbase?
Following code is used to open a connection:
Configuration myConf = HBaseConfiguration.create(); HTableInterface usersTable = new HTable(myConf, “users”);
21. Which command is used to show the version?
Version command is used to show the version of HBase.
Syntax – hbase> version
22. What is use of tools command?
This command is used to list the HBase surgery tools.
23. What is the use of shutdown command?
It is used to shut down the cluster.
24. What is the use of truncate command?
It is used to disable, recreate and drop the specified tables.
25. Which command is used to run HBase Shell?
$ ./bin/hbase shell command is used to run the HBase shell.
26. Which command is used to show the current HBase user?
The whoami command is used to show HBase user.
27. How to delete the table with the shell?
To delete table first disable it then delete it.
28. What is use of InputFormat in MapReducr process?
InputFormat the input data, and then it returns a RecordReader instance that defines the classes of the key and value objects, and provides a next() method that is used to iterate over each input record.
29. What is the full form of MSLAB?
MSLAB stands for Memstore-Local Allocation Buffer.
30. Define LZO?
Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm that is focused on decompression speed, and written in ANSIC.
31. What is HBaseFsck?
HBase comes with a tool called hbck which is implemented by the HBaseFsck class. It provides various command-line switches that influence its behavior.
32. What is REST?
Rest stands for Representational State Transfer which defines the semantics so that the protocol can be used in a generic way to address remote resources. It also provides support for different message formats, offering many choices for a client application to communicate with the server.
33. Define Thrift?
Apache Thrift is written in C++, but provides schema compilers for many programming languages, including Java, C++, Perl, PHP, Python, Ruby, and more.
34. What are the fundamental key structures of HBase?
The fundamental key structures of HBase are row key and column key.
35. What is JMX?
The Java Management Extensions technology is the standard for Java applications to export their status.
36. What is nagios?
Nagios is a very commonly used support tool for gaining qualitative data regarding cluster status. It polls current metrics on a regular basis and compares them with given thresholds.
37. What is the syntax of describe Command?
The syntax of describe command is –
hbase> describe tablename
38. What the use is of exists command?
The exists command is used to check that the specified table is exists or not.
39. What is the use of MasterServer?
MasterServer is used to assign a region to the region server and also handle the load balancing.
40. What is HBase Shell?
HBase shell is a java API by which we communicate with HBase.
41. What is the use of ZooKeeper?
The zookeeper is used to maintain the configuration information and communication between region servers and clients. It also provides distributed synchronization.
42. Define catalog tables in HBase?
Catalog tables are used to maintain the metadata information.
43. Define cell in HBase?
The cell is the smallest unit of HBase table which stores the data in the form of a tuple.
44. Define compaction in HBase?
Compaction is a process which is used to merge the Hfiles into the one file and after the merging file is created and then old file is deleted. There are different types of tombstone markers which make cells invisible and these tombstone markers are deleted during compaction.
Become Master of Apache HBase by going through this online HBase Course.
45. What is the use of HColumnDescriptor class?
HColumnDescriptor stores the information about a column family like compression settings , Number of versions etc.
46. What is the function of HMaster?
It is a MasterServer which is responsible for monitoring all regionserver instances in a cluster.
47. How many compaction types are in HBase?
There are two types of Compaction i.e. Minor Compaction and Major Compaction.
48. Define HRegionServer in HBase
It is a RegionServer implementation which is responsible for managing and serving regions.
49. Which filter accepts the pagesize as the parameter in HBase?
PageFilter accepts the pagesize as the parameter.
50. Which method is used to access HFile directly without using HBase?
HFile.main() method used to access HFile directly without using HBase.
51. Which type of data HBase can store?
HBase can store any type of data that can be converted into the bytes.
52. What is the use of Apache HBase?
Apache HBase is used when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
53. What are the features of Apache HBase?
- Linear and modular scalability.
- Strictly consistent reads and writes.
- Automatic and configurable sharding of tables
- Automatic failover support between RegionServers.
- Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
- Easy to use Java API for client access.
- Block cache and Bloom Filters for real-time queries.
- Query predicate push down via server side Filters
- Thrift gateway and an REST-ful Web service that supports XML, Protobuf, and binary data encoding options
- Extensible JRuby-based (JIRB) shell
- Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
54. How do I upgrade Maven-managed projects from HBase 0.94 to HBase 0.96+?
In HBase 0.96, the project moved to a modular structure. Adjust your project’s dependencies to rely upon the HBase-client module or another module as appropriate, rather than a single JAR. You can model your Maven depency after one of the following, depending on your targeted version of HBase. See Section 3.5, “Upgrading from 0.94.x to 0.96.x” or Section 3.3, “Upgrading from 0.96.x to 0.98.x” for more information.
- Maven Dependency for HBase 0.98
- Maven Dependency for HBase 0.96
- Maven Dependency for HBase 0.94
55. How should I design my schema in HBase?
HBase schemas can be created or updated using ‘The Apache HBase Shell’ or by using ‘Admin in the Java API’.
Tables must be disabled when making ColumnFamily modifications, for example:
Configuration config = HBaseConfiguration.create(); Admin admin = new Admin(conf); String table = “myTable”; admin.disableTable(table); HColumnDescriptor cf1 = …; admin.addColumn(table, cf1); // adding new ColumnFamily HColumnDescriptor cf2 = …; admin.modifyColumn(table, cf2); // modifying existing ColumnFamily admin.enableTable(table);
56. What is the Hierarchy of Tables in Apache HBase?
The hierarchy for tables in HBase is as follows:
When a table is created, one or more column families are defined as high-level categories for storing data corresponding to an entry in the table. As is suggested by HBase being “column-oriented”, column family data for all table entries, or rows, are stored together. For a given (row, column family) combination, multiple columns can be written at the time the data is written. Therefore, two rows in an HBase table need not necessarily share the same columns, only column families. For each (row, column-family, column) combination HBase can store multiple cells, with each cell associated with a version, or timestamp corresponding to when the data was written. HBase clients can choose to only read the most recent version of a given cell, or read all versions.
57. How can I troubleshoot my HBase cluster?
Always start with the master log (TODO: Which lines?). Normally it’s just printing the same lines over and over again. If not, then there’s an issue. Google or search-hadoop.com should return some hits for those exceptions you’re seeing.
An error rarely comes alone in Apache HBase, usually when something gets screwed up what will follow may be hundreds of exceptions and stack traces coming from all over the place. The best way to approach this type of problem is to walk the log up to where it all began, for example, one trick with RegionServers is that they will print some metrics when aborting so grapping for Dump should get you around the start of the problem.
RegionServer suicides are ‘normal’, as this is what they do when something goes wrong. For example, if ulimit and max transfer threads (the two most important initial settings, see [ulimit] and dfs.datanode.max.transfer.threads) aren’t changed, it will make it impossible at some point for DataNodes to create new threads that from the HBase point of view is seen as if HDFS was gone. Think about what would happen if your MySQL database was suddenly unable to access files on your local file system, well it’s the same with HBase and HDFS.
Another very common reason to see RegionServers committing seppuku is when they enter prolonged garbage collection pauses that last longer than the default ZooKeeper session timeout. For more information on GC pauses, see the 3 part blog post by Todd Lipcon and Long GC pauses above.
Interested in learning HBase? Click here
58. Compare HBase with Cassandra?
Both Cassandra and HBase are NoSQL databases, a term for which you can find numerous definitions. Generally, it means you cannot manipulate the database with SQL. However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL.
Both are designed to manage extremely large data sets. HBase documentation proclaims that an HBase database should have hundreds of millions or — even better — billions of rows. Anything less, and you’re advised to stick with an RDBMS.
Both are distributed databases, not only in how data is stored but also in how the data can be accessed. Clients can connect to any node in the cluster and access any data.
In both Cassandra and HBase, the primary index is the row key, but data is stored on disk such that column family members are kept in close proximity to one another. It is, therefore, important to carefully plan the organization of column families. To keep query performance high, columns with similar access patterns should be placed in the same column family. Cassandra lets you create additional, secondary indexes on column values. This can improve data access in columns whose values have a high level of repetition — such as a column that stores the state field of a customer’s mailing address.
HBase lacks built-in support for secondary indexes but offers a number of mechanisms that provide secondary index functionality. These are described in HBase’s online reference guide and on HBase community.
59. Compare HBase with Hive?
Hive can help the SQL savvy to run MapReduce jobs. Since its JDBC compliant, it also integrates with existing SQL-based tools. Running Hive queries could take a while since they go over all of the data in the table by default. Nonetheless, the amount of data can be limited via Hive’s partitioning feature. Partitioning allows running a filter query over data that is stored in separate folders, and only read the data which matches the query. It could be used, for example, to only process files created between certain dates, if the files include the date format as part of their name.
HBase works by storing data as key/value. It supports four primary operations: put to add or update rows, scan to retrieve a range of cells, get to return cells for a specified row, and delete to remove rows, columns or column versions from the table. Versioning is available so that previous values of the data can be fetched (the history can be deleted every now and then to clear space via HBase compactions). Although HBase includes tables, a schema is only required for tables and column families, but not for columns, and it includes increment/counter functionality.
Hive and HBase are two different Hadoop-based technologies – Hive is an SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database on Hadoop. But hey, why not use them both? Just like Google can be used for search and Facebook for social networking, Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from Hive to HBase and back again.
60. What version of Hadoop do I need to run HBase?
Different versions of HBase require different versions of Hadoop. Consult the table below to find which version of Hadoop you will need:
HBase Release Number Hadoop Release Number
0.90.4 (current stable)
Releases of Hadoop can be found here. We recommend using the most recent version of Hadoop possible, as it will contain the most bug fixes. Note that HBase-0.2.x can be made to work on Hadoop-0.18.x. HBase-0.2.x ships with Hadoop-0.17.x, so to use Hadoop-0.18.x you must recompile Hadoop-0.18.x, remove the Hadoop-0.17.x jars from HBase, and replace them with the jars from Hadoop-0.18.x.
Also note that after HBase-0.2.x, the HBase release numbering schema will change to align with the Hadoop release number on which it depends.