Big Data:
A Collection of large and complex datasets which are difficult to store and process using the traditional database and data processing tool is considered as big data.
Hadoop:
Hadoop is an open source framework, or we can say a software platform that lets us write and run applications that process a huge amount of data. It allows distributed processing of large datasets designed to run on the cluster of commodity hardware.
It Includes:
- HDFS – Hadoop Distributed File System
- Map Reduce – Offline Computing Engine
Figure 1: Hadoop Components
Description of Hadoop components:
- HDFS (Hadoop Distributed File System) :
HDFS is the storage system for a Hadoop cluster. As the data arrives into the cluster, the HDFS software divides it into parts and distributes those parts among the various servers participating in the cluster. Only a small fragment of the complete dataset gets stored on each server. Also, to keep the data safe in case of hardware failures, data replicate on more than one server.
- MapReduce (Distributed data processing framework):
As mentioned above in Hadoop a dataset is divided into parts and distributed over various servers, so are the jobs that are used to refine and analyze these data sets. These jobs can run in parallel, and data processing on all the subsets happens simultaneously. Each server processes the data and reports back the result. These jobs are the Map-Reduce jobs.
Hadoop ecosystem and analytics:
Let us try and understand the Hadoop ecosystem. The Hadoop framework consists of various modules.
Hadoop ecosystem and analytics
Data dump in HDFS:
Tools mentioned below will help in bringing the external data from the Hadoop cluster into HDFS.
- Chukwa: This is an open source data collection system which is on top of HDFS and MapReduce framework. It monitors large distributed systems. Chukwa has the capability to display, control and analyze the results from the collected data.
- Kafka: It is a partitioned commit log service which provides distributed messaging service. So, in simple terms, the producer sends a message to Kafka cluster, which then transmit them further to the consumer. It is on the top of HDFS so that it can store and process the data.
Kafka
- Zookeeper: It is an open source software project, which provides distributed configuration services, synchronization service and naming registry for large distributed systems.
Compute Framework:
- MapReduce : It is software framework which is used to write applications which can process an enormous volume of data in parallel on large clusters of commodity hardware. A MapReduce job splits the input file into small parts, and the processing occurs in parallel. The output of the map is then sorted and further processed by the reduce task.
- YARN : It is better known as next generation MapReduce, or we can say, successor. It separates the functionality of job scheduler and resource management. The framework consists of the resource manager and node manager. The resource manager is responsible for resource allocation for different applications and node manager monitors the resource usage. YARN, also known as MapReduce version2 and applications currently working on MapReduce will work on YARN also with a recompile.
Querying data in HDFS:
- Hive: Hive is a warehouse, built on top of Hadoop (HDFS), lets us retrieve the desired data by writing SQL (Structured Query Language) queries called HQL (Hive Query language) rather than writing complex codes in Java.
- Pig: This is a platform for analysis of large datasets. It was initially developed by Yahoo, to support the users of Apache Hadoop in a way that they can focus more on analyzing the data rather than spending time in writing complex codes. As the name suggests, it can handle any data.
- Avro: It is an open source project, or we can say the data serialization system. In Avro, the data definition is stored in JSON format. The file schema (layout) and data are in the same file which makes it easy to understand and process it.