Flat 10% & upto 50% off + Free additional Courses. Hurry up!

Introduction to Hadoop



It is a framework which is based on java programming. It is intended to level up from single server to thousands of machines each offering local computation and storage. It supports the huge data set in a distributed computing environment.

The Apache Hadoop software library is a framework that permits for the distributed processing of huge data sets across clusters of computers using easy programming models.

apache hadoop ecosystem



Architecture of Hadoop

architecture of hadoop


  • Hadoop Mapreduce (Processing/Computation layer) – MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data, on large clusters.
  • Hadoop HDFS (Storage layer) – The Hadoop Distributed File System or HDFS is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware. It is highly fault-tolerant and helps incorporate low-cost hardware. It gives high throughput access to application data and is suitable for applications with large datasets.
  • Hadoop Common – Includes Java libraries and utilities which provide those java files which are essential to start Hadoop.
  • Task Tracker – It is a node which is used to accept the tasks like shuffle and mapreduce form jobtracker.
  • Job Tracker – It is a service which runs mapreduce jobs on cluster.
  • Name Node – It is a node where Hadoop stores all file location information in Hadoop distributed file system.
  • Data Node – It stores data in the Hadoop distributed file system.


How does Hadoop Work?

To execute large scale processing one can connect together multiple commodity computers to a single-CPU, as a single functional distributed system and have the clustered machines read the dataset in parallel and provide intermediate, and after integration desired output.

Hadoop runs code across a cluster of computers and performs the following tasks:

  • Data is primarily divided into files and directories. Files are divided into consistent sized blocks of 128M and 64M.
  • Then files are distributed across various cluster nodes for further processing
  • Job tracker then starts scheduling programs on individual nodes.
  • Once all the nodes are done then the output is return back.


Advantages of Hadoop

  • It permits the user to rapidly write and test the distributed systems and then automatically distributes the data and works across the machines and in turn utilizes the primary parallelism of the CPU cores.
  • Hadoop library has been developed to find and handle the failures at the application layer.
  • Servers can be added or removed from the cluster dynamically.
  • It is open source and compatible on all the platforms as it is Java based.


Features of Hadoop

Features of Hadoop are as follows:

  • Scalable
  • Resilient to failure
  • Flexible
  • Cost Effective
  • Fast

A detailed understanding of Hadoop Technology Before Hadoop Download is available in this blog post for your perusal!


History of Hadoop

Hadoop was created by Doug Cutting which was the creator of Apache Lucene It is the widely used text search library. Hadoop has its origins in Apache Nutch which is an open source web search engine itself a part of the Lucene project.

Get to know more about Big Data Certification that can help you grow in your career.

"0 Responses on Introduction to Hadoop"

Leave a Message

100% Secure Payments. All major credit & debit cards accepted Or Pay by Paypal.

Sales Offer

  • To avail this offer, enroll before 21st January 2017.
  • This offer cannot be combined with any other offer.
  • This offer is valid on selected courses only.
  • Please use coupon codes mentioned below to avail the offer

Sign Up or Login to view the Free Introduction to Hadoop.