Hadoop is an open-source framework which is used to store and process big data in a distributed environment across multiple computers called clusters by using very simple programming models. It is designed in a way which allows it to scale up from single servers to thousands of computers, where each nodes(device) offers local computation and storage. It is provided by Apache for processing and analyzing very huge volume of data. It is written in Java. Some of the major companies which are using Hadoop are-
- Google
- LinkedIn
- Facebook
- Twitter etc.
Its main power lies in the MapReduce algorithm which is used to run Hadoop applications. In this algorithm the task is divided into smaller parts and those parts are assigned to many computers (nodes) connected over the network. Thus the data is processed and analyzed in parallel on different nodes making this framework capable enough perform complete statistical analysis for huge amounts of data easily.
Criteria |
Result |
Hadoop Hardware |
Commodity hardware |
Hadoop Scalability |
Excellent |
Hadoop Economic Value |
Very Good |
Get enrolled in Big Data Hadoop Online Course to command top jobs!
Get 100% Hike!
Master Most in Demand Skills Now!
How to setup the Hadoop Multi-Node cluster?
As discussed above Hadoop uses “Divide and rule” policy to deal with big data. Tasks are divided on various nodes. But how to set-up the multi-node cluster?
Before learning to setup the framework, you must have the fundamental knowledge of java programming language as Java is the main prerequisite for working on it.
Java installation
If java is not installed, then the first step is installation of java. For this follow the below steps-
- Download java- java file can be downloaded from oracle official website.
- Extract thejdk file from the downloaded folder.
- For making the java available to all the users it has to be moved to the location “/usr/local/”.
Want to learn more? Read this extensive Hadoop Tutorial!
Now follow the below steps –
Hadoop installation – Download a stable version of from Apache mirrors.
The installation of this framework typically involves unpacking of the software on all the nodes in the cluster. Hardwares are divided based upon their functions.
Typically one of the devices in the cluster is designated as the NameNode and another device as ResourceManager. These are the master nodes. Other services (which may include Web App Proxy Server and MapReduce Job History server) are usually run either on dedicated hardware or sometimes on shared infrastrucutre, depending upon the volume of data.
The other devices in the cluster act as both DataNode and NodeManager. These are the known as slaves.
Configuring – There are two types of major configuration files. Which are as follow:
- Read-only default configuration – Which includescore-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml.
- Site-specific configuration – Which includesetc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml, etc/hadoop/yarn-site.xml and etc/hadoop/mapred-site.xml.
In order to configure it you have to configure the environment in which the daemons execute and also the configuration parameters for its daemons.
It needs configuring the entire environment of the daemons.
One must also specify the JAVA_HOME in order to correctly define it on each remote node. Individual daemons can be configured using the configuration options shown below in the table:
Daemon |
Environment Variable |
NameNode |
HADOOP_NAMENODE_OPTS |
DataNode |
HADOOP_DATANODE_OPTS |
ResourceManager |
YARN_RESOURCEMANAGER_OPTS |
NodeManager |
YARN_NODEMANAGER_OPTS |
Go through these Hadoop Interview Questions to know what is expected from big data professionals!
Individual daemons can be configured by the administrator using different configuration options available in the cluster set up. For instance; if the user is going to configure NameNode for using the parallelGC, the statement in the hadoop-end.sh would be as follows.
Export HADOOP_NAMENODE OPT=”-xx:+UseParallelGC”
Monitoring Health of NodeManagers- This helps the administrators to determine whether the node is healthy by using the script.
Slaves File- Helper scripts uses the etc/hadoop/workers file in order to run commands on many hosts at once. For using this functionality, ssh trusts must be established for the accounts which are used to run this technology.
Rack Awareness- It is highly recommended configuring rack awareness prior to starting HDFS. The daemons would obtain rack information of slaves in the cluster using the administrator configured module.
Logging- Apache log4j is used by Hadoop via the Apache Commons Logging framework for logging. etc/hadoop/log4j.properties has to be edited for customizing the Hadoop daemons’ logging configuration (log-formats and so on).
Operating the Cluster- Once the necessary configuration is completed, the files need to be distributed on all the machines on the HADOOP_CONF_DIR directory. This should be the same directory on all nodes. In general, it is recommended that HDFS and YARN should run as separate users.
Even Learn how to install Single Node in Hadoop through Hadoop Single Node Installation blog.
Hadoop Startup
To start it we have to start both the HDFS and YARN cluster.
The first time we call HDFS, it has to be formatted. Format a new distributed file system as hdfs.