Introduction to Hadoop
It is a framework which is based on java programming. It is intended to level up from single server to thousands of machines each offering local computation and storage. It supports the huge data set in a distributed computing environment.
The Apache Hadoop software library is a framework that permits for the distributed processing of huge data sets across clusters of computers using easy programming models.
Architecture of Hadoop
- Hadoop Mapreduce (Processing/Computation layer) – MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data, on large clusters.
- Hadoop HDFS (Storage layer) – The Hadoop Distributed File System or HDFS is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware. It is highly fault-tolerant and helps incorporate low-cost hardware. It gives high throughput access to application data and is suitable for applications with large datasets.
- Hadoop YARN – Hadoop YARN is a framework which is used for job scheduling and cluster resource management.
- Hadoop Common – Includes Java libraries and utilities which provide those java files which are essential to start Hadoop.
- Task Tracker – It is a node which is used to accept the tasks like shuffle and mapreduce form jobtracker.
- Job Tracker – It is a service which runs mapreduce jobs on cluster.
- Name Node – It is a node where Hadoop stores all file location information in Hadoop distributed file system.
- Data Node – It stores data in the Hadoop distributed file system.
How does Hadoop Work?
To execute large scale processing one can connect together multiple commodity computers to a single-CPU, as a single functional distributed system and have the clustered machines read the dataset in parallel and provide intermediate, and after integration desired output.
Hadoop runs code across a cluster of computers and performs the following tasks:
- Data is primarily divided into files and directories. Files are divided into consistent sized blocks of 128M and 64M.
- Then files are distributed across various cluster nodes for further processing
- Job tracker then starts scheduling programs on individual nodes.
- Once all the nodes are done then the output is return back.
Advantages of Hadoop
- It permits the user to rapidly write and test the distributed systems and then automatically distributes the data and works across the machines and in turn utilizes the primary parallelism of the CPU cores.
- Hadoop library has been developed to find and handle the failures at the application layer.
- Servers can be added or removed from the cluster dynamically.
- It is open source and compatible on all the platforms as it is Java based.
Features of Hadoop
Features of Hadoop are as follows:
- Resilient to failure
- Cost Effective
A detailed understanding of Hadoop Technology Before Hadoop Download is available in this blog post for your perusal!
History of Hadoop
Hadoop was created by Doug Cutting which was the creator of Apache Lucene It is the widely used text search library. Hadoop has its origins in Apache Nutch which is an open source web search engine itself a part of the Lucene project.
Get to know more about Big Data Certification that can help you grow in your career.