How to Setup Hadoop Multi-Node Cluster in Ubuntu, Centos, Windows

Setting Up A Hadoop Multi-Node Cluster

Installing Java

First, the Installation of java is important to set up a Multi-node cluster.

Syntax of java version command

$ java -version

Following output is presented.

java version "1.7.0_71"<br>
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)<br>
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)<br>

Creating User Account

System user account on both master and slave systems should be created to use the Hadoop installation.

# useradd hadoop <br>
# passwd hadoop

Mapping the nodes

hosts file should be edited in /etc/ folder on all nodes and IP address of each system followed by their host names must be specified.

# vi /etc/hosts

Enter the following lines in the /etc/hosts file.

192.168.1.109 hadoop-master<br>
192.168.1.145 hadoop-slave-1<br>
192.168.56.1 hadoop-slave-2

Configuring Key Based Login

Ssh should be setup in each node such that they can converse with one another without any prompt for a password.

# su hadoop <br>
$ ssh-keygen -t rsa<br>
$ ssh-copy-id -i ~/.ssh/id_rsa.pub tutorialspoint@hadoop-master<br>
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp1@hadoop-slave-1<br>
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp2@hadoop-slave-2<br>
$ chmod 0600 ~/.ssh/authorized_keys<br>
$ exit

Learn more about Hadoop with the help of this YouTube tutorial:

Installing Hadoop

Hadoop should be downloaded to the master server.

# mkdir /opt/hadoop<br>
# cd /opt/hadoop/<br>
# wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.0.tar.gz <br>
# tar -xzf hadoop-1.2.0.tar.gz <br>
# mv hadoop-1.2.0 hadoop <br>
# chown -R hadoop /opt/hadoop <br>
# cd /opt/hadoop/hadoop/

Configuring Hadoop

Hadoop server must be configured

core-site.xml should be edited.

<br>
<configuration><br>
<property><br>
<name>fs.default.name</name><value>hdfs://hadoop-master:9000/</value><br>
</property><br>
<property><br>
<name>dfs.permissions</name><br>
<value>false</value><br>
</property><br>
</configuration><br>

hdfs-site.xml file should be edited.

<br>
<configuration><br>
<property><br>
<name>dfs.data.dir</name><br>
<value>/opt/hadoop/hadoop/dfs/name/data</value><br>
<final>true</final><br>
</property><br>
<property><br>
<name>dfs.name.dir</name><br>
<value>/opt/hadoop/hadoop/dfs/name</value><br>
<final>true</final><br>
</property><br>
<property><br>
<name>dfs.replication</name><br>
<value>1</value><br>
</property><br>
</configuration><br>

mapred-site.xml file should be edited.

<br>
<configuration><br>
<property><br>
<name>mapred.job.tracker</name><value>hadoop-master:9001</value><br>
</property><br>
</configuration><br>

JAVA_HOME, HADOOP_CONF_DIR, and HADOOP_OPTS should be edited.

<br>
export JAVA_HOME=/opt/jdk1.7.0_17<br>
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true<br>
export HADOOP_CONF_DIR=/opt/hadoop/hadoop/conf<br>

Get 100% Hike!
Master Most in Demand Skills Now!

By providing your contact details, you agree to our Terms of Use & Privacy Policy

Installing Hadoop on Slave Servers

Hadoop should be installed on all the slave servers

<br>
# su hadoop<br>
$ cd /opt/hadoop<br>
$ scp -r hadoop hadoop-slave-1:/opt/hadoop<br>
$ scp -r hadoop hadoop-slave-2:/opt/hadoop<br>

Configuring Hadoop on Master Server

The master server should be configured

<br>
# su hadoop<br>
$ cd /opt/hadoop/hadoop<br>

Master Node Configuration

<br>
$ vi etc/hadoop/masters<br>
hadoop-master<br>

Slave Node Configuration

<br>
$ vi etc/hadoop/slaves<br>
hadoop-slave-1<br>
hadoop-slave-2<br>

Name Node format on Hadoop Master

<br>
# su hadoop<br>
$ cd /opt/hadoop/hadoop<br>
$ bin/hadoop namenode –format<br>

<br>
11/10/14 10:58:07 INFO namenode.NameNode: STARTUP_MSG:<br>
/************************************************************<br>
STARTUP_MSG: Starting NameNode<br>
STARTUP_MSG: host = hadoop-master/192.168.1.109<br>
STARTUP_MSG: args = [-format]<br>
STARTUP_MSG: version = 1.2.0<br>
STARTUP_MSG: build =<br>
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1479473; compiled by 'hortonfo' on Mon May 6 06:59:37 UTC 2013<br>
STARTUP_MSG: java = 1.7.0_71<br>
************************************************************/<br>
11/10/14 10:58:08 INFO util.GSet: Computing capacity for map BlocksMap editlog=/opt/hadoop/hadoop/dfs/name/current/edits<br>
………………………………………………….<br>
………………………………………………….<br>
………………………………………………….<br>
11/10/14 10:58:08 INFO common.Storage: Storage directory /opt/hadoop/hadoop/dfs/name has been successfully formatted.<br>
11/10/14 10:58:08 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************<br>
SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/192.168.1.15<br>
************************************************************/<br>

Hadoop Services

Starting Hadoop services on the Hadoop-Master.

$ cd $HADOOP_HOME/sbin <br>
$ start-all.sh

Addition of a New DataNode in the Hadoop Cluster

Networking

Add new nodes to an existing Hadoop cluster with some suitable network configuration. suppose the following network configuration.

For New node Configuration:

IP address : 192.168.1.103<br>
netmask : 255.255.255.0 <br>
hostname : slave3.in

Adding a User and SSH Access

Add a User

“hadoop” user must be added and password of Hadoop user can be set to anything one wants.

useradd hadoop<br>
passwd hadoop

To be executed on master

mkdir -p $HOME/.ssh<br>
chmod 700 $HOME/.ssh <br>
ssh-keygen -t rsa -P '' -f $HOME/.ssh/id_rsa <br>
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys<br>
chmod 644 $HOME/.ssh/authorized_keys <br>
Copy the public key to new slave node in hadoop user $HOME directory <br>
scp $HOME/.ssh/id_rsa.pub [email protected]:/home/hadoop/

To be executed on slaves

<br>
su hadoop ssh -X [email protected]<br>

Content of public key must be copied into file “$HOME/.ssh/authorized_keys” and then the permission for the same must be changed.

<br>
cd $HOME<br>
mkdir -p $HOME/.ssh<br>
chmod 700 $HOME/.ssh<br>
cat id_rsa.pub >>$HOME/.ssh/authorized_keys<br>
chmod 644 $HOME/.ssh/authorized_keys<br>

ssh login must be changed from the master machine. The possibility of ssh to the new node without a password from the master must be verified.

ssh [email protected] or hadoop@slave3<br>

Set Hostname of New Node
Hostname is set in file /etc/sysconfig/network

<br>
On new slave3 machine<br>
NETWORKING=yes<br>
HOSTNAME=slave3.in<br>

The machine must be restarted or hostname command should be run to a new machine with the respective hostname to make changes effective.
On slave3 node machine:
hostname slave3.in
/etc/hosts must be updated on all machines of the cluster

<br>
192.168.1.102 slave3.in slave3<br>

ping the machine with hostnames to check whether it is resolving to IP.

<br>
ping master.in<br>

Start the DataNode on New Node
Datanode daemon should be started manually using $HADOOP_HOME/bin/hadoop-daemon.sh script. Master(NameNode) should join the cluster after being automatically contacted. A new node should be added to the conf/slaves file in the master server. A new node will be recognized by script-based commands.

Login to new node

su hadoop or ssh -X [email protected]

HDFS is started on a newly added slave node

./bin/hadoop-daemon.sh start datanode

jps command output must be checked on a new node.

$ jps <br>
7141 DataNode <br>
10312 Jps

Removing a DataNode

Node can be removed from a cluster as it is running, without any data loss. A decommissioning feature is made available by HDFS which ensures that removing a node is performed securely.
Step 1
Login to master machine user where Hadoop is installed.

<br>
$ su hadoop<br>

Step 2
Before starting the cluster an exclude file must be configured. A key named dfs.hosts.exclude should be added to our $HADOOP_HOME/etc/hadoop/hdfs-site.xmlfile.
NameNode’s local file system which contains a list of machines that are not permitted to connect to HDFS receives the full path by this key and the value associated with it.

<br>
<property><br>
<name>dfs.hosts.exclude</name><value>/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt</value><description>>DFS exclude</description><br>
</property><br>

Step 3
Hosts to decommission are determined.
Additions should be made to file recognized by the hdfs_exclude.txt for every machine to be decommissioned which will prevent them from connecting to the NameNode.

slave2.in<br>

Step 4
Force configuration reload.
“$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes” should be run

<br>
$ $HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes<br>

NameNode will be forced to re-read its configuration, this is inclusive of the newly updated ‘excludes’ file. Nodes will be decommissioned over a period of time, allowing time for each node’s blocks to be replicated onto machines that are scheduled to remain active.
jps command output should be checked on slave2.in. DataNode process will shut down automatically.

Step 5
Shutdown nodes.
The decommissioned hardware can be carefully shut down for maintenance after the decommission process has been finished.

<br>
$ $HADOOP_HOME/bin/hadoop dfsadmin -report<br>

Step 6
Excludes are edited again and once the machines have been decommissioned, they can be removed from the ‘excludes’ file. “$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes” will read the excluded file back into the NameNode; DataNodes will rejoin the cluster after the maintenance has been completed, or if additional capacity is needed in the cluster again.
To run/shutdown tasktracker

<br>
$ $HADOOP_HOME/bin/hadoop-daemon.sh stop tasktracker<br>
$ $HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker<br>

How to Setup Hadoop Multi-Node Cluster