HBase: The Hadoop Database
It is an open source platform and is horizontally scalable. It is the database which distributed based on the column oriented. It is built on top most of the Hadoop file system. It is based on the non relational database system (NoSQL). HBase is truly and faithful, open source implementation devised on Google’s Bigtable.
Watch this video on Hadoop before going further on this Hadoop tutorial
Column oriented databases are those databases which store the data tables in terms of sections of columns of data instead of rows of data. It is specified based on distribution, persistent, strictly consistent storage system with near-optimal write in terms of Input/output channel saturation and excellent reading performance which make use makes use of efficient disk space by supporting pluggable compression algorithms that can be chosen based on the nature of the data in particular set of column families.
HBase manages shifting the load and failures elegantly and clearly to the client side. Scalability is built in and clusters can be grown or shrunk while the system is still production stage. Changing the cluster does not involve any difficult rebalancing or resharding procedure but is fully automated as per the customer requirements.
Why we need HBase?
There are number of limitations in RDBMS are as follows–
- Not preferable for unstructured data.
- Works very well for a limited number of records
- Doesn’t contain de-normalized data.
- Schema oriented database.
Watch this video on PIG by Intellipaat:
Features of HBase
The features of HBase are as follows–
- Easy java API for client for better understanding.
- Integrates with Hadoop, both as a source and destination.
- It is schema-less so it doesn’t follow the concept of fixed columns schema and defines only column families.
- Good only for semi-structured as well as structured data.
- Automatic failure support.
- Provides data replication or copy across clusters.
- It is linearly scalable.
- HBase provides fast lookups for larger table’s contents.
- Provides low latency access to single rows from a collection of billions records (Random access).
- Implicitly uses the Hash tables and gives random access and it saves the data in indexed HDFS files for faster ways of lookups.
Architecture of HBase Cluster
It contains following components:
- Zookeeper –Centralized service which are used to preserve configuration information for Hbase.
- Catalog Tables – Keep track of locations region servers.
- Master – Monitors all the region server instances in the single cluster
- Region Servers – It is responsible for serving and managing regions
- Region – A set of table belonging to the table column and it holds a subset of table’ rows based on partition.