• Articles
  • Tutorials
  • Interview Questions

How to get started with Hadoop?

How to get started with Hadoop?

Apache Hadoop software library is basically a framework. It allows distributed processing, especially for the large data sets stored across the multiple clusters of computers. All it gets processed with the simple programming models. Several machines are functioned together here with the objective of analysis of the large sets of data. The concept of Hadoop was basically developed based on whitepapers released by Google.

Components of Hadoop:

Hadoop has basically two core components.

  1. HDFS or Hadoop Distributed File System
  2. MapReduce

Hadoop Cluster – A Hadoop cluster is a set of machines which run these two components called HDFS and MapReduce. It consists of one to multiple nodes, which are also understood as the individual machines.

Hadoop Structure:


The structure consists of Master Node and Slave nodes. Master node consists of NameNode and Job Trackers. The slave nodes consist of DataNode and Task Tracker.

For the MapReduce installation process, refer to our blog on How to Install MapReduce.

Working of Hadoop:


Initially, data is passed from the client to the Hadoop. The n it would be distributed to the NameNodes. Then the program is run on Hadoop and process the data.

Hadoop Process:

Step 1: Initially the data is broken into the blocks of 64 Mb or 128 Mb and then are moved to the nodes.

Step 2:Then the program is passed by the Hadoop frameworks to run.

Step 3:The programs are then scheduled on the individual nodes by the Job Tracker.

Step 4:Once the program is executed, the output is returned.

In Hadoop Online Training- Be the Master With Virtual Classes blog, know how to master virtual classes.

Certification in Bigdata Analytics


During the process of Hadoop, the data is loaded onto the Hadoop Filesystem called HDFS or Hadoop Distributed File System. This file system is based on the Google’s GFS. It is responsible for storage of data in the clusters. Blocks of 64 Mb or 128 Mb are replicated thrice to ensure that data is safe and not lost. This number of replication can be configured.

HDFS is the best suitable for the larger files in a small number. It is taken care by the NameNodes and DataNodes.

Working of HDFS:


The data replication into three nodes can be seen in the above picture.

Grab high-paying Big Data jobs with these Top Hadoop Interview Questions!


– For processing data in the Hadoop system, MapReduce data processing is used

– It is a data processing component used by Hadoop

– The data processing task is attained by task distribution across the nodes

– This process is enabled in two phases called:

  • Map
  • Reduce

– In between these two phases Map and Reduce, there is another phase would come into the picture called, Shuffle and Sort

To know more about Hadoop, enroll in this Hadoop Course in Bangalore and get to learn from professionals.

Process of Mapping


In this process, the text in the text file is given as the input to it and output is given accordingly.


Process of Shuffle and Sort


In the process of Shuffle and Sort, all of the values are considered and brought together to shuffle and sort. So, the output is shows would be the keys and number of instances in the text file.

Process of Reducer:


The reducer process is executed according to the program written by you. For example, it sorts and provides the final reducer output formulated by Emit (k, sum)

Finally, the entire job process of Hadoop can be understood by the following flowchart.


Note: Click To enlarge Images

If you are looking forward to know more about Hadoop and Big Data, please visit our Hadoop Developer Page, or contact us if you want to enroll into our Online Hadoop Training Program. You can also check out our free Hadoop Tutorial for getting familiar with Hadoop.

Course Schedule

Name Date Details
Big Data Course 25 May 2024(Sat-Sun) Weekend Batch
View Details
Big Data Course 01 Jun 2024(Sat-Sun) Weekend Batch
View Details
Big Data Course 08 Jun 2024(Sat-Sun) Weekend Batch
View Details

About the Author

Senior Content Manager - Content and Content Marketing

Mohammad Waseem is a Senior Content Manager with a passion for crafting profound narratives. With a background in content and content marketing, he blends creativity with strategy to captivate audiences. His experience in content creation resonates across various platforms and gets lakhs of views, showcasing his expertise in content strategizing.