In this article we are going to understand the concepts of:
-What is big data?
-What is Hadoop?
-What is HDFS and how it works?
-What is Map reduce?
-What is Hadoop ecosystem?
-HDFS &Map reduce deeper live.
Big data is the amount of data that is beyond the storage in processing capabilities of a single physical machine. Big data consists of three dimensions.
First dimension – Extra-large volume
Second dimension – variety
Third dimension – Velocity
Characteristics of Big data
1. Volume: For data to be refereed as big data generally the volume of data has to be massive.
2. Variety: For data to be referred as big data generally the data would be semi-structured & may originated from variety of source & in variety of formats.
3. Velocity: For data to be referred as big data the rate at which data comes into the system is really fast.
Visualizing big data
How big data looks like?
There is no defined act as to how big data looks like.Big data can look anything like.
What is the value in taking pain of analyzing this data?
It’s in this huge heap of data,the golden nuggets of vital information stay finalizing this data can give business an edge over the competitions & help them serving the user in a more personalized manner.
Amazon is an E-commerce website. They analyses your browsing history.If you search Xbox so they will give you all recommendations.All the recommendations according to your choice.
Some use cases of analyzing big data consists of
-Identify customers who are most important.
– Identifying the best time to perform maintenance based on the usage patterns.
– Analyzing your brands reputation by analyzing social media parts & so on.
How can such a huge data be analyzed.
As the volume is beyond the processing & the storage ability of a single physical machine.Some distributed way would have to be use.
Distributed system like MPI have been around from more than decade. So do we even need to see this presentation future?
Does the world need another distributed system?
The answer is yes, the world need another distributed system.
Typical distributed system
In typical distributed system
1. Programs run on each application server
2. All the data is on SAN
3. Before execution each server gets the data from SAN
4. After execution each server writes the output to SAN
Problems with typical distributed system
1. Huge dependency on network & huge band width/ demands
2. Scaling up & down is not a smooth process
3. Partial failures are difficult to handle.
4. A lot of processing power is spent on transporting data.
5. Data synchronization is required during exchange.
6. There is where Hadoop comes in.
What is Hadoop?
Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming models.
In more simplistic terms Hadoop is a framework that facilitates functioning of several machines together to achieve the goal of analyzing large sets of data.
Google created its own distributed computing framework & published papers about the same. Hadoop was developed on the basis of papers released by google.
Core Hadoop consists of 2 core components.
1. Hadoop distributed file system(HDFS)
2. Map reduce
A set of machines running HDFS &map reduce is known as Hadoop cluster
1. Individual machines are known as nodes.
2. A cluster can have as many as node to several thousand nodes.
HDFS is a file system that is different from Linux file system but it is sitting on the top of Linux.
Hadoop & Hadoop ecosystem
HDFS: Hadoop distribution file system
For Hadoop to be able to process the files, the files have to be in the HDFS i.e. Hadoops own file system. HDFS is responsible for storing data on the cluster of machines. Data is normally split into blocks of 64mb to 128 mb & spread across the cluster. Bydefault each block is replicated 3 times. The replication factor can be lowered or increased through configuration settings. To ensure that data is not lost similar blocks are always replicated on different nodes.
File system based on Google’s GFS Operates on top of native UNIX file system provides replicated storage for data using cheap commodity hardware.HDFS file system is read only random writers are not allowed. Performs best with small numbers of large files.It’s the name node & Data node process that takes care of HDFS. For HDFS to be accessible name node has to be up all the time.
How HDFS works
Map reduce is the data processing component of Hadoop. It attains the task of data processing by distributing tasks across the nodes. Tasks on each node processes data present locally
Consists of 2 phases
In between map & reduce there is small phase called shuffle & sort
Shuffle & sort