Hadoop is supported by Linux platform and its flavors. So you have to install a Linux OS for setting up Hadoop environment. If you have another operating system than Linux then in this case you can install virtual machine and have Linux inside the virtual machine.
Hadoop is written in Java, so there is a need of Java installed on the machine and version should be 1.6 or later.
It is easy to run Hadoop on a single machine using your own user account. From the http://www.eu.apache.org/dist/hadoop/common/, download a stable release, which is packaged as a gzipped tar file and then unpack it somewhere on your filesystem:
% tar xzf hadoop-x.y.z.tar.gz
Before run Hadoop location is needed where java is installed.
If Java has been installed, this should display the version details as illustrated in the following image:
You can set the Java installation that Hadoop uses by editing conf/hadoop-env.sh and specifying the JAVA_HOME variable. For example on Mac you changed the line to read:
It is easy to produce an environment variable that is used to point to the Hadoop installation directory say HADOOP_INSTALL and to put the Hadoop binary directory on command-line path. In Hadoop 2.0 It is require to set the sbin directory on the path also. For example:
% export HADOOP_INSTALL=/home/tom/hadoop-x.y.z % export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
Check that Hadoop runs by typing:
% hadoop version Hadoop 1.0.0 Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1214675 Compiled by hortonfo on Thu Dec 15 16:36:35 UTC 2011
By using an XML file every component in Hadoop is configure. MapReduce properties are found in mapred-site.xml, common properties are found in core-site.xml and HDFS properties are found in hdfs-site.xml and. These files are placed in the conf subdirectory.
In Hadoop 2.0 and afterward MapReduce runs on YARN and there is also a configuration file called yarn-site.xml. Every configuration files must go in etc/ hadoop subdirectory. Hadoop can be run in one of the three modes:
- Fully distributed mode – The Hadoop daemons run on a machine’s cluster.
- Standalone or local mode – There are no daemons running and all runs in a single JVM. It is appropriate for running MapReduce programs throughout development and it is simple to test and debug them.
- Pseudodistributed mode – The Hadoop daemons run on the local machine so simulating a cluster on a small scale.
To run Hadoop in a particular mode you need to do two things:
- Set the appropriate properties
- Start the Hadoop daemons
Below Diagram demonstrates the least set of properties to configure every mode. In standalone mode the local file system and the local MapReduce job runner are used while in the distributed modes MapReduce or YARN daemons and the HDFS are started.
In this mode there is no additional action to perform and the default properties are set for standalone mode and there are no daemons to run.
The configuration files should be created with the following contents and placed in the conf directory (although you can place configuration files in any directory as long as you start the daemons with the –config option):
<?xml version="1.0"?> <!-- core-site.xml --> <configuration> <property> <value>hdfs://localhost/</value> </property> </configuration>
<?xml version="1.0"?> <!-- hdfs-site.xml --> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
<?xml version="1.0"?> <!-- mapred-site.xml --> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> </configuration>
If you are running YARN, use the yarn-site.xml file:
<?xml version="1.0"?> <!-- yarn-site.xml --> <configuration> <property> <name>yarn.resourcemanager.address</name> <value>localhost:8032</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce.shuffle</value> </property> </configuration>
In pseudodistributed mode you have to start daemons and for this reason you require to have SSH installed. It simply starts daemons on the set of hosts in the cluster which is defined by the slaves file by SSH-ing to every host and starting a daemon process.
Pseudodistributed mode is a special case of fully distributed mode in which the host is localhost so you need to ensures that you can SSH to localhost and log in without enter a password. First ensures that SSH is installed and a server is running. On Ubuntu this is achieved by using:-
% sudo apt-get install ssh
Then to enable password-less login generate a new SSH key with an empty passphrase:
% ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa % cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Test this with:
% ssh localhost
If successful you should not have to type in a password.
Formatting the HDFS Filesystem
The formatting process makes an empty filesystem by creating the storage directories and the primary versions of the namenode’s persistent data structures. Datanodes are not involved in the initial formatting process since the namenode is used to manage all of the filesystem’s metadata and datanodes can leave or join the cluster dynamically.
Formatting HDFS is a fast operation. Just type the following:
% hadoop namenode –format
Starting and stopping the daemons (MapReduce 1)
To start MapReduce daemons and the HDFS, type:
% start-dfs.sh % start-mapred.sh
The following daemons will be started on our local machine: a namenode, a a datanode, a jobtracker, Secondary namenode, and a tasktracker. You can check whether the daemons started successfully by looking at the logfiles in the logs directory (in the Hadoop installation directory) or by looking at the web UIs, at http://localhost:50030/ for the jobtracker and at http://localhost:50070/ for the namenode. You can also use Java’s jps command to see whether they are running.
Stopping the daemons is done in the obvious way:
% stop-dfs.sh % stop-mapred.sh
Starting and stopping the daemons (MapReduce 2)
To start the HDFS and YARN daemons, type:
% start-dfs.sh % start-yarn.sh
These commands will start the HDFS daemons, and for YARN, a node manager and a resource manager. The resource manager web UI is at http://localhost:8088/.
You can stop the daemons with:
% stop-dfs.sh % stop-yarn.sh