Setting up of the Hadoop cluster:
Here you will learn how to successfully install Hadoop and configure the clusters which could range from just a couple of nodes to even tens of thousands over huge clusters. So for that, first you need to install Hadoop on a single machine. The requirement for that is you need to install Java if you don’t have it already on your system.
Getting Hadoop to work on the entire cluster involves getting the required software on all the machines that are tied to the cluster. As per the norms one of the machines is associated with the Name Node and another is associated with the Resource Manager.
Watch this video on Hadoop before going further on this Hadoop tutorial.
The other services like The MapReduce Job History and the Web App Proxy Server can be hosted on specific machines or even on shared resources as per the requirement of the task or load. All the other nodes in the entire cluster with have the dual nature of being both the Node Manager and the Data Node. These are collectively termed as the slave nodes.
Getting Hadoop to work in the non-secure mode
The Java configuration of Hadoop has two important files:
- Read-only default configuration -core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml.
- Site-specific configuration -etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml, etc/hadoop/yarn-site.xml and etc/hadoop/mapred-site.xml.
It is possible to manage the Hadoop scripts in the bin/ directory of the distribution, by setting site-specific values via the etc/hadoop/hadoop-env.sh and etc/hadoop/yarn-env.sh.
For the Hadoop cluster configuration you first need to create the ecosystem in which the Hadoop daemons can execute and also the needed parameters for configuration.
The various daemons of Hadoop Distributed File System are listed below:
- YARN daemons
The Hadoop Daemons configuration environment
To get the Hadoop daemons’ the right site-specific customization the administrators need to use the etc/hadoop/hadoop-env.sh or the etc/hadoop/mapred-env.sh and etc/hadoop/yarn-env.sh scripts. The JAVA_HOME should be specified appropriately so that it is defined in the right manner on every remote node.
Configuration of the individual daemons
The list of Daemons along with the relevant environment variable
DataNode – HADOOP_DATANODE_OPTS
Secondary NameNode – HADOOP_SECONDARYNAMENODE_OPTS
ResourceManager – YARN_RESOURCEMANAGER_OPTS
NodeManager – YARN_NODEMANAGER_OPTS
WebAppProxy – YARN_PROXYSERVER_OPTS
Map Reduce Job History Server – HADOOP_JOB_HISTORYSERVER_OPTS
Customization of other important configuration parameters:
- HADOOP_PID_DIR – the process ID files of the daemons is contained in this directory.
- HADOOP_LOG_DIR – the log files of the daemons are stored in this directory.
- HADOOP_HEAPSIZE / YARN_HEAPSIZE – the heapsize is measured in MB and if you have the variable that is set to 1000 then automatically the heap is also set to 1000 MB. By default it is set to 1000.
The HDFS Shell Commands
Learn about the most important operations of Hadoop Distributed File System using the shell commands that are used for file management in the cluster.
- Directory creation in HDFS for a specific given path.
- Listing of the directory contents.
- HDFS file Upload/download.
Copy single src file, or multiple src files from local file system to the Hadoop data file system
hadoopfs-put<localsrc> ... <HDFS_dest_Path>
Copies/Downloads files to the local file system
- Viewing of file content
Same as unix cat command:
- File copying from source to destination
- Copying of file to HDFS from a local file and vice-versa
hadoopfs-copyToLocal [-ignorecrc] [-crc] URI<localdst>
- File moving from source to destination.
But remember, you cannot move files across filesystem.
- File or directory removal in HDFS.
Repetitive version of delete.
- Showing the file’s final few lines.
- Showing the aggregate length of a file.