Installation - HBase Tutorial

Installation Guide and Requirements for HBase

2.1 Requirements
2.1.1 Hardware
It is difficult to specify a particular server type that is recommended for HBase. In fact, the opposite is more appropriate, as HBase runs on many, very different hardware configurations. The usual description is commodity hardware. HBase is written in Java, so need support for a current Java Runtime, and since the majority of the memory needed per region server is for internal structures— for example, the memstores and the block cache—you will have to install a 64-bit operating system to be able to address enough memory, that is, more than 4 GB. We can separate the hardware requirements into two categories: servers and networking.

  • Servers

In HBase and Hadoop there are two types of machines: masters (the HDFS NameNode, the MapReduce JobTracker, and the HBase Master) and slaves (the HDFS DataNodes, the MapReduce TaskTrackers, and the HBase RegionServers). They do benefit from slightly different hardware specifications when possible. It is also quite common to use exactly the same hardware for both (out of convenience), but the master does not need that much storage, so it makes sense to not add too many disks. And since the masters are also more important than the slaves, you could beef them up with redundant hardware components.

  • CPU
Node type Recommendation
Master Dual quad-core CPUs, 2.0-2.5 GHz
Slave Dual quad-core CPUs, 2.0-2.5 GHz
  • Memory
Node type Recommendation
Master 24 GB
Slave 24 GB (and up)
  • Disks

The data is stored on the slave machines, and therefore it is those servers that need plenty of capacity. Depending on whether you are more read/write- or processing oriented, you need to balance the number of disks with the number of CPU cores available. Typically, you should have at least one core per disk, so in an eight-core server, adding six disks is good, but adding more might not be giving you optimal performance.

Node type Recommendation
Master 4 × 1 TB SATA, RAID 0+1 (2 TB usable)
Slave 6 × 1 TB SATA, JBOD

2.1.2 Networking

In a data center, servers are typically mounted into 19” racks or cabinets with 40U or more in height. Switches often have 24 or 48 ports, and with the aforementioned channel-bonding or two-port cards, you need to size the networking large enough to provide enough bandwidth. Installing 40 1U servers would need 80 network ports; so, in practice, you may need a staggered setup where you use multiple rack switches and then aggregate to a much larger core aggregation switch (CaS). This results in a two-tier architecture, where the distribution is handled by the ToR switch and the aggregation by the CaS.

2.2 Software

This can range from the operating system itself to files ystem choices and configuration of various auxiliary services.

  • Operating system

It is work on different versions of linux that are redhat, Ubuntu, Fedora, Debian etc. and run on those OS which supports Java.

  • Java

We need java for Hbase. Not just any version of Java, but version 6, a.k.a. 1.6, or later. You also should make sure the java binary is executable and can be found on your path. Try entering java -version on the command line and verify that it works and that it prints out the version number indicating it is version 1.6 or later—for example, java version “1.6.0_22”.
If you do not have Java on the command-line path or if HBase fails to start with a warning that it was not able to find it, edit the conf/hbase-env.sh file by commenting out the JAVA_HOME line and changing its value to where your Java is installed.

  • Hadoop

HBase depends on Hadoop, it bundles an instance of the Hadoop JAR under its lib directory. The bundled Hadoop was made from the Apache branch-0.20-append branch at the time of HBase’s release. It is critical that the version of Hadoop that is in use on your cluster matches what is used by HBase. Replace the Hadoop JAR found in the HBase lib directory with the hadoop-xyz.jar you are running on your cluster to avoid version mismatch issues. Make sure you replace the JAR on all servers in your cluster that run HBase. Version mismatch issues have various manifestations, but often the result is the same: HBase does not throw an error, but simply blocks indefinitely.

Certification in Bigdata Analytics

  • SSH

ssh must be installed and sshd must be running if you want to use the supplied scripts to manage remote Hadoop and HBase daemons. A commonly used software package providing these commands is OpenSSH. The supplied shell scripts make use of SSH to send commands to each server in the cluster.

  • Domain Name Service

HBase uses the local hostname to self-report its IP address. Both forward and reverse DNS resolving should work. You can verify if the setup is correct for forward DNS lookups by running the following command:

$ ping -c 1 $(hostname)

  • Synchronized time

The clocks on cluster nodes should be in basic alignment. Some skew is tolerable, but wild skew can generate odd behaviors. Even differences of only one minute can cause unexplainable behavior. Run NTP on your cluster, or an equivalent application, to synchronize the time on all servers.

  • File handles and process limits

HBase is a database, so it uses a lot of files at the same time. The default ulimit -n of 1024 on most Unix or other Unix-like systems is insufficient. Any significant amount of loading will lead to I/O errors stating the obvious: java.io.IOException: Too many open files.
Example: Setting File Handles on Ubuntu
If you are on Ubuntu, you will need to make the following changes-
In the file /etc/security/limits.conf add this line:

hadoop - nofile 32768

Replace hadoop with whatever user is running Hadoop and HBase. If you have separate users, you will need two entries, one for each user.
In the file /etc/pam.d/common-session add the following as the last line in the file:
session required pam_limits.so

Otherwise, the changes in /etc/security/limits.conf won’t be applied. Don’t forget to log out and back in again for the changes to take effect!
2.2 Filesystems for HBase
The most common filesystem used with HBase is HDFS. But you are not locked into HDFS because the FileSystem used by HBase has a pluggable architecture and can be used to replace HDFS with any other supported system. The primary reason HDFS is so popular is its built-in replication, fault tolerance, and scalability.

the filesystem negotiating transparently where data is stored

  • Local

The local filesystem actually bypasses Hadoop entirely, that is, you do not need to have an HDFS or any other cluster at all. It is handled all in the FileSystem class used by HBase to connect to the filesystem implementation. The supplied ChecksumFileSystem class is loaded by the client and uses local disk paths to store all the data.

  • HDFS

The Hadoop Distributed File System (HDFS) is the default filesystem when deploying a fully distributed cluster. For HBase, HDFS is the filesystem of choice, as it has all the required features. HDFS is built to work with MapReduce, taking full advantage of its parallel, streaming access support. The scalability, fail safety, and automatic replication functionality is ideal for storing files reliably. HBase adds the random access layer missing from HDFS and ideally complements Hadoop. Using MapReduce, you can do bulk imports, creating the storage files at disk-transfer speeds.
The URI to access HDFS uses the following scheme:

hdfs://<namenode>:<port>/<path>

  • S3

Amazon’s Simple Storage Service (S3)# is a storage system that is primarily used in combination with dynamic servers running on Amazon’s complementary service named Elastic Compute Cloud (EC2). S3 can be used directly and without EC2, but the bandwidth used to transfer data in and out of S3 is going to be cost-prohibitive in practice.
2.3 Run modes
HBase has two run modes: standalone and distributed. This is the default mode. In standalone mode, HBase does not use HDFS—it uses the local filesystem instead—and it runs all HBase daemons and a local ZooKeeper in the same JVM process. ZooKeeper binds to a well-known port so that clients may talk to HBase.
The distributed mode can be further subdivided into pseudodistributed—all daemons run on a single node—and fully distributed—where the daemons are spread across multiple, physical servers in the cluster. Distributed modes require an instance of the Hadoop Distributed File System (HDFS).
2.4 Configuration
Configuring  an HBase setup entails editing a file with environment variables, namedconf/hbase-env.sh, which is used mostly by the shell scripts to start or stop a cluster. You also need to add configuration
properties to an XML file* named conf/hbase-site.xml to, for example, override HBase defaults, tell HBase what filesystem to use, and tell HBase the location of the ZooKeeper ensemble.

  • hbase-site.xml and hbase-default.xml
  • hbase-env.sh
  • regionserver
  • log4j.properties

A basic example hbase-site.xml file for client applications might contain the following properties:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>zk1.foo.com,zk2.foo.com,zk3.foo.com</value>
</property>
</configuration>

2.5 Deployment
After the configuration deployment is performed.

  • Script Based
  • Apache Whirr
  • Puppet and Chef

Become a Big Data Architect

2.6 Operating a cluster

  • Running and Confirming Your Installation

Make sure HDFS is running first. Start and stop the Hadoop HDFS daemons by running bin/start-dfs.sh over in the HADOOP_HOME directory. You can ensure that it started properly by testing the put and get of files into the Hadoop filesystem. HBase does not normally use the MapReduce daemons. you start a fully distributed HBase with the following command:

bin/start-hbase.sh

Run the preceding command from the HBASE_HOME directory. You should now have a running HBase instance.

  • Web-based UI Introduction

HBase also starts a web-based user interface (UI) listing vital attributes. By default, it is deployed on the master host at port 60010 (HBase region servers use 60030 by default). If the master is running on a host named master.foo.com on the default port, to see the master’s home page you can point your browser at http://master.foo.com: 60010.
From this page you can access a variety of status information about your HBase cluster. The page is separated into multiple sections. The top part has the attributes pertaining to the cluster setup. You can see the currently running tasks—if there are any. The catalog and user tables list details about the available tables. For the user table you also see the table schema.
The lower part of the page has the region servers table, giving you access to all the currently registered servers. Finally, the region in transition list informs you about regions that are currently being maintained by the system. After you have started the cluster, you should verify that all the region servers have registered themselves with the master and appear in the appropriate table with the expected hostnames (that a client can connect to). Also verify that you are indeed running the correct version of HBase and Hadoop.

  • Shell Introduction

You can start the shell with the following command:

$ $HBASE_HOME/bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.91.0-SNAPSHOT, r1130916, Sat Jul 23 12:44:34 CEST 2011
hbase(main):001:0>

Type help and then press Return to see a listing of shell commands and options. Browse at least the paragraphs at the end of the help text for the gist of how variables and command arguments are entered into the HBase Shell; in particular, note how table names, rows, and columns, must be quoted.

  • Stopping the Cluster

To stop HBase, enter the following command. Once you have started the script, you will see a message stating that the cluster is being stopped, followed by “.” (period) characters printed in regular intervals (just to indicate that the process is still running, not to give you any percentage feedback, or some other hidden meaning):

$ ./bin/stop-hbase.sh
stopping hbase...............

Shutdown can take several minutes to complete. It can take longer if your cluster is composed of many machines. If you are running a distributed operation, be sure to wait until HBase has shut down completely before stopping the Hadoop daemons.

About the Author

Data Engineer

As a skilled Data Engineer, Sahil excels in SQL, NoSQL databases, Business Intelligence, and database management. He has contributed immensely to projects at companies like Bajaj and Tata. With a strong expertise in data engineering, he has architected numerous solutions for data pipelines, analytics, and software integration, driving insights and innovation.