How to get started with Hadoop?

By Shailesh | Last updated on May 12, 2025 | 90724 Views

Apache Hadoop software library is basically a framework. It allows distributed processing, especially for the large data sets stored across the multiple clusters of computers. All it gets processed with the simple programming models. Several machines are functioned together here with the objective of analysis of the large sets of data. The concept of Hadoop was basically developed based on whitepapers released by Google.

Components of Hadoop:

Hadoop has basically two core components.

HDFS or Hadoop Distributed File System
MapReduce

Hadoop Cluster – A Hadoop cluster is a set of machines which run these two components called HDFS and MapReduce. It consists of one to multiple nodes, which are also understood as the individual machines.

Hadoop Structure:

The structure consists of Master Node and Slave nodes. Master node consists of NameNode and Job Trackers. The slave nodes consist of DataNode and Task Tracker.

Working of Hadoop:

Initially, data is passed from the client to the Hadoop. The n it would be distributed to the NameNodes. Then the program is run on Hadoop and process the data.

Hadoop Process:

Step 1: Initially the data is broken into the blocks of 64 Mb or 128 Mb and then are moved to the nodes.

Step 2:Then the program is passed by the Hadoop frameworks to run.

Step 3:The programs are then scheduled on the individual nodes by the Job Tracker.

Step 4:Once the program is executed, the output is returned.

HDFS

During the process of Hadoop, the data is loaded onto the Hadoop Filesystem called HDFS or Hadoop Distributed File System. This file system is based on the Google’s GFS. It is responsible for storage of data in the clusters. Blocks of 64 Mb or 128 Mb are replicated thrice to ensure that data is safe and not lost. This number of replication can be configured.

HDFS is the best suitable for the larger files in a small number. It is taken care by the NameNodes and DataNodes.

Working of HDFS:

The data replication into three nodes can be seen in the above picture.

MapReduce

– For processing data in the Hadoop system, MapReduce data processing is used

– It is a data processing component used by Hadoop

– The data processing task is attained by task distribution across the nodes

– This process is enabled in two phases called:

Map
Reduce

– In between these two phases Map and Reduce, there is another phase would come into the picture called, Shuffle and Sort

Process of Mapping

In this process, the text in the text file is given as the input to it and output is given accordingly.

Process of Shuffle and Sort

In the process of Shuffle and Sort, all of the values are considered and brought together to shuffle and sort. So, the output is shows would be the keys and number of instances in the text file.

Process of Reducer:

The reducer process is executed according to the program written by you. For example, it sorts and provides the final reducer output formulated by Emit (k, sum)

Finally, the entire job process of Hadoop can be understood by the following flowchart.

Note: Click To enlarge Images

Enroll in a Data Engineering course to be able to automate data processes and improve efficiency.

About the Author

Shailesh

Senior Associate - Digital Marketing

Shailesh is a Senior Editor in Digital Marketing with a passion for storytelling. His expertise lies in crafting compelling brand stories; he blends his expertise in marketing with a love for words to captivate audiences worldwide. His projects focus on innovative digital marketing ideas with strategic thought and accuracy.