Hadoop Installation Prerequisites
Hadoop is supported by Linux platform and its facilities. So install a Linux OS for setting up Hadoop environment. If you own an operating system than Linux then you can install virtual machine and have Linux inside the virtual machine.
Hadoop is written in Java programming, so there exists the necessity of Java installed on the machine and version should be 1.6 or later.
Watch this video on Hadoop before going further on this Hadoop tutorial
It is easy to understand and to run Hadoop on a single machine using your own user account. From the http://www.eu.apache.org/dist/hadoop/common/, download a stable release, which is packed under zipped tar file and then unpack it somewhere on your file system:
% tar xzf hadoop-x.y.z.tar.gz
Before compilation and execution of Hadoop the location is required where java is been installed.
If Java has been installed, the below window should be display where the version in detailed are illustrated as follows:
You can setup the Java installation that Hadoop uses mainly for editing conf/hadoop-env.sh and specifying the JAVA_HOME variable. For example on Mac you can change the line to read:
On Ubuntu use:
Easy to produce an environment variable that points directly to the Hadoop installation directory naming HADOOP_INSTALL and provides the Hadoop binary directory on command-line path. In Hadoop 2.0 and is required to set the sbin directory on the path also. Consider the following example:
% export HADOOP_INSTALL=/home/tom/hadoop-x.y.z
% export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
Check whether the Hadoop runs by typing the following commands:
% hadoop version Hadoop 1.0.0 Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1214675 Compiled by hortonfo on Thu Dec 15 16:36:35 UTC 2011
Based on XML files every component in Hadoop is been configured. MapReduce properties are mostly found in mapred-site.xml, and there’s common properties found in core-site.xml and HDFS properties are found in hdfs-site.xml and these files are placed in the configuration sub directory.
In Hadoop 2.0 and afterward MapReduce runs on YARN and there is
This configuration file called yarn-site.xml. Hence every configuration files must go in etc/ hadoop sub directory. Hadoop basically runs in one of the three modes:
- Fully distributed mode – The Hadoop run on a machine’s cluster.
- Standalone or local mode– There are no existence of daemons which runs behind and all runs in a single JVM (Java Virtual Machine). It is appropriate to run MapReduce programs in entire development process and it is simple to test and debug them.
- Pseudo distributed mode – The Hadoop daemons runs mainly on your local machine which provides in simulating a cluster on a small scale.
To run Hadoop in a particular mode you require two methods to follow. Set the properties for the better development.
- Start the Hadoop daemons
Below Diagram clearly demonstrates the minimum set of properties to configure every mode. Only standalone mode the local files and the local MapReduce job runner are used in the distributed modes MapReduce or YARN daemons and the HDFS are started.
This mode there is no scope of additional action to perform and by default certain properties are set for standalone mode and there are no daemons to run.
Pseudo distributed Mode
The configuration files should be created taking up the following contents and should be placed in the conf directory or else you can place configuration files in any directory as long as you can start the daemons with the –config option.
<?xml version="1.0"?> <!-- core-site.xml --> <configuration> <property> <value>hdfs://localhost/</value> </property> </configuration>
<?xml version="1.0"?> <!-- hdfs-site.xml --> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
<?xml version="1.0"?> <!-- mapred-site.xml --> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> </configuration>
If you are running YARN, use the yarn-site.xml file:
<?xml version="1.0"?> <!-- yarn-site.xml --> <configuration> <property> <name>yarn.resourcemanager.address</name> <value>localhost:8032</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce.shuffle</value> </property> </configuration>
In pseudo distributed mode you have to start using the daemons and which enables you required to have installed mandatory SSH. Started daemons on the set of hosts in the cluster which is defined as slaves file by SSH-ing to every host and starting a daemon process.
Pseudo distributed mode especially designed for the purpose of distribution of modes in which the host is localhost where it ensures that you can SSH to localhost and log in without entering the password. First clarify that SSH is installed and a server is running in the background. On Ubuntu this is achieved by using:-
% sudo apt-get install ssh
To enable the password-less login then generate a new SSH key with an empty passphrase which is defined below:
% ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa % cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Test this with:
% ssh localhost
If successful you should not have to type in a password.
Formatting the HDFS Filesystem
The formatting process creates an empty filesystem by creating the storage directories and the primary versions requirements of the namenode’s persistent data structures. In the initial process formatting process Data nodes are not involved since the name node provides to manage all of the file system’s metadata and data nodes can leave or join the cluster dynamically.
Formatting HDFS is a fast operation. Type the steps mentioned below to get access permission:
% hadoop namenode –format
Starting and stopping the daemons using the MapReduce Algorithm.
To start MapReduce daemons and the HDFS, type the following commands:
% start-dfs.sh % start-mapred.sh
The following daemons will be started automatically on our local machine: a name node, a data node, a jobtracker, Secondary namenode, and a task tracker. Check whether the daemons started successfully by verifying at the logfiles stored in the logs directory (in the Hadoop installation directory) or even verifying at the web UIs, at http://localhost:50030/ for the jobtracker and at http://localhost:50070/ for the namenode. Java’s JPs Command is used to check whether they are running at the background.
Stopping the daemons is the possible way shown below:
% stop-dfs.sh % stop-mapred.sh
Start and stop the daemons (MapReduce 2)
To start the process first for the HDFS and YARN daemons, type the following:
% start-dfs.sh % start-yarn.sh
These commands will start the HDFS daemons, and for YARN, a node manager and a resource manager is built as per the requirement. The resource manager web UI website address is http://localhost:8088/.
To stop the daemons use the below commands:
% stop-dfs.sh % stop-yarn.shPrevious Next