HDFS Operations and Commands with Examples

Hadoop File System gives you a tremendous advantage, as it stores data in multiple copies for efficient storage at an economical price point. HDFS Operations serve as the key that opens these vaults so your information can be available from remote locations.

Starting HDFS

Format the configured HDFS file system and then open the namenode (HDFS server) and execute the following HDFS command.

$ hadoop namenode -format

Start the distributed file system and follow the command listed below to start the namenode as well as the data nodes in cluster.

$ start-dfs.sh

Watch this Big Data & Hadoop Full Course – Learn Hadoop In 12 Hours tutorial!

Read & Write Operations in HDFS

You can execute almost all operations on Hadoop Distributed File Systems that can be executed on the local file system. You can execute various reading, writing operations such as creating a directory, providing permissions, copying files, updating files, deleting, etc. You can add access rights and browse the file system to get the cluster information like the number of dead nodes, live nodes, spaces used, etc.

HDFS Operations to Read the file

To read any file from the HDFS, you have to interact with the NameNode as it stores the metadata about the DataNodes. The user gets a token from the NameNode and that specifies the address where the data is stored.

You can put a read request to NameNode for a particular block location through distributed file systems. The NameNode will then check your privilege to access the DataNode and allow you to read the address block if the access is valid.

$ hadoop fs -cat <file><br>

HDFS Operations to write in file

Similar to the read operation, the HDFS write operation is used to write the file to a particular address through the NameNode. This NameNode provides the slave address where the client/user can write or add data. After writing to the block location, the slave replicates that block and copies it to another slave location using factor 3 replication. The salve is then reverted back to the client for authentication.

The process for accessing a NameNode is pretty similar to that of a reading operation. Below is the HDFS write commence:

bin/hdfs dfs -ls <path>

Listing Files in HDFS

Finding the list of files in a directory and the status of a file using ‘ls’ command in the terminal. Syntax of ls can be passed to a directory or a filename as an argument which are displayed as follows:

$ $HADOOP_HOME/bin/hadoop fs -ls <args>

Inserting Data into HDFS

Below mentioned steps are followed to insert the required file in the Hadoop file system.

Step1: Create an input directory

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input

Step2: Use the Hadoop HDFS put Command transfer and store the data file from the local systems to the HDFS using the following commands in the terminal.

$ $HADOOP_HOME/bin/hadoop fs -put /home/intellipaat.txt /user/input

Step3: Verify the file using Hadoop HDFS ls Command

$ $HADOOP_HOME/bin/hadoop fs -ls /user/input

Retrieving Data from HDFS

For instance, if you have a file in HDFS called Intellipaat. Then retrieve the required file from the Hadoop file system by carrying out:

Step1: View the data using the HDFS cat command.

$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/intellipaat

Step2: Gets the file from HDFS to the local file system using get command as shown below

$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/

Shutting Down the HDFS

Shut down the HDFS files by following the below HDFS command

$ stop-dfs.sh

Multi-Node Cluster

Installing Java

Syntax of java version command

$ java -version

Following output is presented.

java version "1.7.0_71"<br>
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)<br>
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)<br>

Get 100% Hike!

Master Most in Demand Skills Now!

Creating User Account

System user account is used on both master and slave systems for the Hadoop installation.

# useradd hadoop<br>
# passwd hadoop

Mapping the nodes

Hosts files should be edited in /etc/ folder on each and every nodes and IP address of each system followed by their host names must be specified mandatorily.

# vi /etc/hosts

Enter the following lines in the /etc/hosts file.

192.168.1.109 hadoop-master<br>
192.168.1.145 hadoop-slave-1<br>
192.168.56.1 hadoop-slave-2

Configuring Key Based Login

Ssh should be set up in each node so they can easily converse with one another without any prompt for a password.

# su hadoop<br>
$ ssh-keygen -t rsa<br>
$ ssh-copy-id -i ~/.ssh/id_rsa.pub tutorialspoint@hadoop-master<br>
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp1@hadoop-slave-1<br>
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp2@hadoop-slave-2<br>
$ chmod 0600 ~/.ssh/authorized_keys<br>
$ exit

Installation of Hadoop

Hadoop should be downloaded in the master server using the following procedure.

# mkdir /opt/hadoop<br>
# cd /opt/hadoop/<br>
# wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.0.tar.gz<br>
# tar -xzf hadoop-1.2.0.tar.gz<br>
# mv hadoop-1.2.0 hadoop<br>
# chown -R hadoop /opt/hadoop<br>
# cd /opt/hadoop/hadoop/

Configuring Hadoop

Hadoop server must be configured in core-site.xml and should be edited wherever required.

<configuration><br>
<property><br>
<name>fs.default.name</name><value>hdfs://hadoop-master:9000/</value><br>
</property><br>
<property><br>
<name>dfs.permissions</name><br>
<value>false</value><br>
</property><br>
</configuration>

hdfs-site.xml file should be editted.<br>
<configuration><br>
<property><br>
<name>dfs.data.dir</name><br>
<value>/opt/hadoop/hadoop/dfs/name/data</value><br>
<final>true</final><br>
</property><br>
<property><br>
<name>dfs.name.dir</name><br>
<value>/opt/hadoop/hadoop/dfs/name</value><br>
<final>true</final><br>
</property><br>
<property><br>
<name>dfs.replication</name><br>
<value>1</value><br>
</property><br>
</configuration>

mapred-site.xml file should be edited as per the requirement example is being shown.

<configuration><br>
<property><br>
<name>mapred.job.tracker</name><value>hadoop-master:9001</value><br>
</property><br>
</configuration>

JAVA_HOME, HADOOP_CONF_DIR, and HADOOP_OPTS should be edited as follows:

export JAVA_HOME=/opt/jdk1.7.0_17<br>
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true<br>
export HADOOP_CONF_DIR=/opt/hadoop/hadoop/conf

Installing Hadoop on Slave Servers

Hadoop should be installed on all the slave servers

# su hadoop<br>
$ cd /opt/hadoop<br>
$ scp -r hadoop hadoop-slave-1:/opt/hadoop<br>
$ scp -r hadoop hadoop-slave-2:/opt/hadoop

Configuring Hadoop on Master Server

Master server configuration

# su hadoop<br>
$ cd /opt/hadoop/hadoop<br>
Master Node Configuration<br>
$ vi etc/hadoop/masters<br>
hadoop-master

Slave Node Configuration

$ vi etc/hadoop/slaves<br>
hadoop-slave-1<br>
hadoop-slave-2

Name Node format on Hadoop Master

# su hadoop<br>
$ cd /opt/hadoop/hadoop<br>
$ bin/hadoop namenode –format<br>
11/10/14 10:58:07 INFO namenode.NameNode: STARTUP_MSG:<br>
************************************************************<br>
STARTUP_MSG: Starting NameNode<br>
STARTUP_MSG: host = hadoop-master/192.168.1.109<br>
STARTUP_MSG: args = [-format]<br>
STARTUP_MSG: version = 1.2.0<br>
STARTUP_MSG: build =<br>
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1479473; compiled by 'hortonfo' on Monday May 6 06:59:37 UTC 2013<br>
STARTUP_MSG: java = 1.7.0_71<br>
************************************************************<br>
11/10/14 10:58:08 INFO util.GSet: Computing capacity for map BlocksMap editlog=/opt/hadoop/hadoop/dfs/name/current/edits<br>
………………………………………………….<br>
11/10/14 10:58:08 INFO common.Storage: Storage directory /opt/hadoop/hadoop/dfs/name has been successfully formatted.<br>
11/10/14 10:58:08 INFO namenode.NameNode: SHUTDOWN_MSG:<br>
************************************************************<br>
SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/192.168.1.15<br>
************************************************************

Hadoop Services

Starting Hadoop services on the Hadoop-Master procedure explains its setup.

$ cd $HADOOP_HOME/sbin<br>
$ start-all.sh

Addition of a New DataNode in the Hadoop Cluster is as follows:

Networking

Add new nodes to an existing Hadoop cluster with some suitable network configuration. Consider the following network configuration for new node Configuration:

IP address : 192.168.1.103<br>
netmask : 255.255.255.0<br>
hostname : slave3.in

Adding a User and SSH Access

Add a user working under “hadoop” domain and the user must have the access added and password of Hadoop user can be set to anything one wants.

useradd hadoop<br>
passwd hadoop

To be executed on master

mkdir -p $HOME/.ssh<br>
chmod 700 $HOME/.ssh<br>
ssh-keygen -t rsa -P '' -f $HOME/.ssh/id_rsa<br>
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys<br>
chmod 644 $HOME/.ssh/authorized_keys<br>
Copy the public key to new slave node in hadoop user $HOME directory<br>
scp $HOME/.ssh/id_rsa.pub [email protected]:/home/hadoop/

Execution done on slaves

su hadoop ssh -X [email protected]

Content of public key must be copied into file “$HOME/.ssh/authorized_keys” and then the permission for the same must be changed as per the requirement.

cd $HOME<br>
mkdir -p $HOME/.ssh<br>
chmod 700 $HOME/.ssh<br>
cat id_rsa.pub >>$HOME/.ssh/authorized_keys<br>
chmod 644 $HOME/.ssh/authorized_keys

ssh login must be changed from the master machine. It is possible that the ssh to the new node without a password from the master must be verified.

ssh [email protected] or hadoop@slave3

Setting Hostname for New Node

Hostname is setup in the file directory  /etc/sysconfig/network<br>
On new slave3 machine<br>
NETWORKING=yes<br>
HOSTNAME=slave3.in

Machine must be restarted again or hostname command should be run under new machine with the corresponding hostname to make changes effectively.

On slave3 node machine:

hostname slave3.in<br>
/etc/hosts must be updated on all machines of the cluster<br>
192.168.1.102 slave3.in slave3

ping the machine with hostnames to check whether it is resolving to IP address.

ping master.in

Start the DataNode on New Node

Datanode daemon should be started manually using $HADOOP_HOME/bin/hadoop-daemon.sh script. Master (NameNode) should correspondingly join the cluster after automatically contacted. New node should be added to the configuration/slaves file in the master server. New node will be identified by script-based commands.

su hadoop or ssh -X [email protected]

HDFS is started on a newly added slave node

./bin/hadoop-daemon.sh start datanode

jps command output must be checked on a new node.

$ jps<br>
7141 DataNode<br>
10312 Jps

Removing a DataNode

Node can be removed from a cluster while it is running, without any worries of data loss. A decommissioning feature is made available by HDFS which ensures that removing a node is performed securely.

Step 1

$ su hadoop

Step 2

Before starting the cluster an exclude file must be configured where a key named dfs.hosts.exclude should be added to our$HADOOP_HOME/etc/hadoop/hdfs-site.xmlfile.

NameNode’s local file system contains a list of machines which are not permitted to connect to HDFS receives full path by this key and the value associated with it as follows.

<property><br>
<name>dfs.hosts.exclude</name><value>/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt</value><description>>DFS exclude</description><br>
</property>

Step 3

Hosts with respect to decommission are determined.

File reorganization by the hdfs_exclude.txt for each and every machine to be decommissioned which will results in preventing them from connecting to the NameNode.

slave2.in

Step 4

Force configuration reloads.

“$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes” should be run<br>
$ $HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes

NameNode must reread its configuration due to changes in the ‘excludes’ file. At the same time, nodes will gradually be decommissioned so their blocks can be replicated to active machines, and their JPS command output should be verified on slave2.in for accuracy. Once complete, DataNode processes will cease automatically.

Step 5

Shutdown nodes.

The decommissioned hardware can be carefully shut down for maintenance purpose after the decommission process has been finished.

$ $HADOOP_HOME/bin/hadoop dfsadmin -report

Step 6

Excludes are edited again and once the machines have been decommissioned, they are removed from the ‘excludes’ file. “$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes” will read the excludes file back into the NameNode.

Data Nodes will rejoin the cluster after the maintenance has been completed, or if additional capacity is needed in the cluster again is being informed.

To run/shutdown tasktracker

$ $HADOOP_HOME/bin/hadoop-daemon.sh stop tasktracker<br>
$ $HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker

Add a new node with the following steps

1) Take a new system which gives access to create a new username and password

2) Install the SSH and with master node setup ssh connections

3) Add sshpublic_rsa id key having an authorized keys file

4) Add the new data node hostname, IP address and other informative details in /etc/hosts slaves file192.168.1.102 slave3.in slave3

5) Start the DataNode on the New Node

6) Login to the new node command like suhadoop or Ssh -X [email protected]

7) Start HDFS of newly added in the slave node by using the following command ./bin/hadoop-daemon.sh start data node

8) Check the output of jps command on a new node.

Advantages of learning HDFS Operations

Below are the major advantages of learning HDFS operations:

Highly Scalable, Big Data programs can easily adapt to user experiences and growing demand.
HDFS operations are intuitive and require less coding to understand, providing organizations of all sizes an economical storage solution with Hadoop that only requires them to pay for resources they are using during a certain time period.
Distributed file systems (DFSs) are composed of multiple cloud servers operating synchronously to quickly process large datasets within seconds, significantly speeding up data processing times.
HDFS is an exciting technology that many companies are adopting, so becoming proficient with HDFS could provide a substantial career advantage.
Once data is uploaded to a node, it is automatically replicated onto other nodes within the cluster so you have multiple copies of it if one node fails – protecting you against data loss in case of failure.

Summary

Hadoop Distributed File System is a highly scalable, flexible, fault-tolerant, and reliable system that stores the data across multiple nodes on different servers. It follows a master-slave architecture, where the NameNode acts as a master, and the DataNode as the slave. HDFS Operations are used to access these NameNodes and interact with the data. The files are broken down into blocks where the client can store the data, read, write, and perform various operations by completing the authentication process.