HDFS Installation: Setting up the Hadoop Cluster & shell Commands

Setting up of the Hadoop cluster:

Here you will learn how to successfully install Hadoop and configure the clusters which could range from just a couple of nodes to even tens of thousands over huge clusters. So for that, first you need to install Hadoop on a single machine. The requirement for that is you need to install Java if you don’t have it already on your system.

Getting Hadoop to work on the entire cluster involves getting the required software on all the machines that are tied to the cluster. As per the norms, one of the machines is associated with the Name Node and another is associated with the Resource Manager.

Watch this video on Hadoop before going further on this Hadoop tutorial.

The other services like The MapReduce Job History and the Web App Proxy Server can be hosted on specific machines or even on shared resources as per the requirement of the task or load. All the other nodes in the entire cluster with have the dual nature of being both the Node Manager and the Data Node. These are collectively termed as the slave nodes.

Getting Hadoop to work in the non-secure mode

The Java configuration of Hadoop has two important files:

Read-only default configuration -core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml.
Site-specific configuration -etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml, etc/hadoop/yarn-site.xml and etc/hadoop/mapred-site.xml.

It is possible to manage the Hadoop scripts in the bin/ directory of the distribution, by setting site-specific values via the etc/hadoop/hadoop-env.sh and etc/hadoop/yarn-env.sh.

For the Hadoop cluster configuration you first need to create the ecosystem in which the Hadoop daemons can execute and also the needed parameters for configuration.

The various daemons of Hadoop Distributed File System are listed below:

NodeManager
ResourceManager
WebAppProxy
NameNode
SecondaryNameNode
DataNode
YARN daemons

The Hadoop Daemons configuration environment

To get the Hadoop daemons’ the right site-specific customization the administrators need to use the etc/hadoop/hadoop-env.sh or the etc/hadoop/mapred-env.sh and etc/hadoop/yarn-env.sh scripts. The JAVA_HOME should be specified appropriately so that it is defined in the right manner on every remote node.

Configuration of the individual daemons

The list of Daemons along with the relevant environment variable

NameNode –HADOOP_NAMENODE_OPTS

DataNode – HADOOP_DATANODE_OPTS

Secondary NameNode – HADOOP_SECONDARYNAMENODE_OPTS

ResourceManager – YARN_RESOURCEMANAGER_OPTS

NodeManager – YARN_NODEMANAGER_OPTS

WebAppProxy – YARN_PROXYSERVER_OPTS

Map Reduce Job History Server – HADOOP_JOB_HISTORYSERVER_OPTS

Customization of other important configuration parameters:

HADOOP_PID_DIR – the process ID files of the daemons is contained in this directory.
HADOOP_LOG_DIR – the log files of the daemons are stored in this directory.
HADOOP_HEAPSIZE / YARN_HEAPSIZE – the heapsize is measured in MB and if you have the variable that is set to 1000 then automatically the heap is also set to 1000 MB. By default it is set to 1000.

The HDFS Shell Commands

Learn about the most important operations of Hadoop Distributed File System using the shell commands that are used for file management in the cluster.

Directory creation in HDFS for a specific given path.

hadoopfs-mkdir<paths>

Example:

hadoopfs-mkdir/user/saurzcode/dir1/user/saurzcode/dir2

Listing of the directory contents.

hadoopfs-ls<args>

Example:

hadoopfs-ls/user/saurzcode

HDFS file Upload/download.

Upload:

hadoopfs -put:

Copy single src file, or multiple src files from local file system to the Hadoop data file system

hadoopfs-put<localsrc> ... <HDFS_dest_Path>

Example:

hadoopfs-put/home/saurzcode/Samplefile.txt/user/saurzcode/dir3/

Download:

hadoopfs -get:

Copies/Downloads files to the local file system

hadoopfs-get<hdfs_src><localdst>

Example:

hadoopfs-get/user/saurzcode/dir3/Samplefile.txt/home/

Viewing of file content

Same as unix cat command:

hadoopfs-cat<path[filename]>

Example:

hadoopfs-cat/user/saurzcode/dir1/abc.txt

File copying from source to destination

hadoopfs-cp<source><dest>

Example:

hadoopfs-cp/user/saurzcode/dir1/abc.txt/user/saurzcode/dir2

Copying of file to HDFS from a local file and vice-versa

copyFromLocal

hadoopfs-copyFromLocal<localsrc>URI

Example:

hadoopfs-copyFromLocal/home/saurzcode/abc.txt/user/saurzcode/abc.txt

copyToLocal

Usage:

hadoopfs-copyToLocal [-ignorecrc] [-crc] URI<localdst>

File moving from source to destination.

But remember, you cannot move files across filesystem.

hadoopfs-mv<src><dest>

Example:

hadoopfs-mv/user/saurzcode/dir1/abc.txt/user/saurzcode/dir2

File or directory removal in HDFS.

hadoopfs-rm<arg>

Example:

hadoopfs-rm/user/saurzcode/dir1/abc.txt

Repetitive version of delete.

hadoopfs-rmr<arg>

Example:

hadoopfs-rmr/user/saurzcode/

Showing the file’s final few lines.

hadoopfs-tail<path[filename]>

Example:

hadoopfs-tail/user/saurzcode/dir1/abc.txt

Showing the aggregate length of a file.

hadoopfs-du<path>

Example:

hadoopfs-du/user/saurzcode/dir1/abc.txt

HDFS Installation and Shell Commands

Setting up of the Hadoop cluster:

Watch this video on Hadoop before going further on this Hadoop tutorial.

Getting Hadoop to work in the non-secure mode

The various daemons of Hadoop Distributed File System are listed below:

The Hadoop Daemons configuration environment

Configuration of the individual daemons

Customization of other important configuration parameters:

The HDFS Shell Commands

About the Author