• Articles
  • Tutorials
  • Interview Questions

Introduction to HDFS and Map Reduce

Introduction to HDFS and Map Reduce

By this time the regular Intellipaat blog readers are pretty knowledgeable about what exactly Hadoop is, what are the various Hadoop skills needed, the diverse job opportunities Hadoop offers, and so on. Naturally it’s time now you deep dive into the two most important components of the Hadoop cluster – The Apache MapReduce and Apache HDFS. First let’s understand as to how the MapReduce and HDFS have their connection to search giant Google.

  • HDFS owns its existence thanks to the Google File System that was developed by the search giant.
  • MapReduce is directly taken from Google MapReduce which was created to parse web pages.

These two viz. the HDFS and MapReduce form the two major pillars of the Apache Hadoop framework. HDFS is more of an infrastructural component whereas the MapReduce is more of a computational framework.

Deep Dive into Hadoop Distributed File System

As the name suggests, HDFS is a storage system for very large amounts of files. It has some distinct advantages like its scalability and distributed nature that make so good to work with Big Data. HDFS stores data on commodity hardware and can run on huge clusters with the opportunity to stream data for instant processing.

Though HDFS is similar to various other distributed file systems, it has some very distinctive features and advantages that make it so universally deployable for Big Data Hadoop projects. Here are some of the most important:

Working on very large sets of data

HDFS is built for scale. This is one of its greatest strengths. Where other distributed files systems fail miserably, HDFS triumphs. It can store petabytes of data and retrieve it on demand making it a best fit for Big Data applications. The way it is able to store huge amounts of data is by spreading it on hundreds or even thousands of commodity hardware that are cheap and readily available. The data bandwidth of HDFS is unlike any other competing file system.

Certification in Bigdata Analytics

Storing on cheap commodity hardware

HDFS is built from the ground up to work on Big Data applications. One of the biggest constraints when working with Big Data is the cost overruns. The hardware and infrastructure if not properly managed can run into the millions. This is where HDFS comes as a blessing since it can successfully run on cheap commodity hardware. Hadoop can be easily installed even on normal personal computers and HDFS works just fine in such an ecosystem. All this leads to reducing the costs drastically and power to scale at will.

Ability to write once and read many times

The files in HDFS can be written once and can be read as many times as needed. The basic premise that is followed is that once a file is written it will not be overwritten and hence it can be accessed multiple times with a hitch. This directly contributes to HDFS having such high throughout and also the issue of data coherency is also resolved.

Providing access to streaming data

When working on Big Data applications it becomes extremely important to fetch the data in a streaming manner. This is what HDFS does so effortlessly due to its ability to provide streaming access to data. The emphasis is on providing high throughput to large amounts of data rather than providing low latency in accessing a single file. A great amount of importance is given to achieving streaming data or fetching it at lightning speeds and not so much significance is given to how this data is stored.

Get 100% Hike!

Master Most in Demand Skills Now!

Extreme throughput

One of the hallmarks of Big Data applications is that the throughput is very high. HDFS achieves this with some distinct features and capabilities. The task is divided into multiple smaller tasks and it is shared by various systems. Due to this the various components work independently and in parallel in order to achieve the given task. Since data is read in parallel it drastically reduces the time and hence high throughput is achieved regardless of the size of the data files.

Moving computation rather than data

This is a distinct feature of the Hadoop Distributed File System which lets you move the processing of data to the source of data rather than moving the data around the network. When you are dealing with huge amounts of data it becomes particularly cumbersome to move it leading to overwhelmed networks and slower processing of data. HDFS overcomes this problem by letting you have interfaces for applications near the place of data storage for faster computation.

Fault tolerance and data replication

Since the data is stored on cheap commodity hardware there has to be a trade-off somewhere and that occurs in the frequent failure of the nodes or commodity hardware. But HDFS gets around this problem by storing the data in at least three nodes. Two nodes are on the same rack while the third node is on a different rack so as to be resilient in the event of fault on nodes. Due to this unique nature of HDFS file storing it provides the whole system with enough fault recovery mechanism, easy data replication, enhanced scalability and data accessibility.

Deep Dive into Apache MapReduce

MapReduce is the data processing engine of Hadoop clusters deployed for Big Data applications. The basic framework of a MapReduce program consists of the two functions the Mapper and Reducer. These two pillars of MapReduce can ensure that any developer can create programs to process data that is stored in a distributed file environment like HDFS.

The Mapper is assigned with the function of organizing data into bocks and aggregating it for taking it to the Reducer function. The Reducer then aggregates the Mapper output in order to remove the redundancy and reduce it while keeping a count on the number of times it is received from the Mapper using WordCount. There are some distinct advantages of MapReduce and we have listed some of the most important below:

Become a Big Data Architect

Highly economical

Due to its open source nature, anybody can work with MapReduce in order to process huge amounts of data. Also MapReduce does not need any specialized hardware to implement it. It can process raw data thanks to Mapping and Reducing functions on very large scales making it extremely economical to work on humungous amounts of data. Unlike other database systems, you don’t need to delete the raw data for lack of processing power. With MapReduce any amounts of data can be crunched at breakneck speeds due to massively parallel computing thus saving huge amounts of money to enterprises.

Flexible for multitudinous data

This is a highly beneficial feature of MapReduce. There are various data processing tools that are very good at working on a specific type of data but not able to crunch a different type of data. One of the important features of Big Data is that it can come in many forms like structured, unstructured, and semi-structured. The MapReduce can take all of this data in order to get meaningful insights from it. This extreme versatility of MapReduce means there is no need to switch the databases management system with each variety of data type.

Extremely fast processing

MapReduce is a distributed processing system and as such can work on multiple servers in order to provide extreme speed and efficacy. This extreme parallel processing ensures that tasks are divided into multiple chunks and processed independently. MapReduce works perfectly with HDFS on local servers in order to make sense of all the data thus reducing the time to move data around. In comparison to other processing systems, MapReduce is extremely fast and delivers the output in record time for any Big Data applications.

Extreme Scalability

MapReduce was a tool that was developed at Google for the sake of making sense of large volumes of data in order to parse web pages. This same quality of MapReduce works extremely well for Big Data applications as well. Unlike other database management systems, MapReduce has the unique quality of scalability to run on thousands of nodes if the need arises without compromising on performance. It is all about adding an extra server to improve the processing power and MapReduce is accommodating enough to deploy the extra server into its fold.

Heightened resilience

MapReduce works on a hardware setup that is bound to fail since they are mostly made of commodity hardware. But MapReduce is highly resilient in this aspect since it can immediately shift the processing to another node as soon as one node in the cluster goes down thus ensuring the processing goes on without any disruption in the performance and the overall throughput.  

Highly secure system

When you are dealing with large volumes of data spread over hundreds or even thousands of nodes of commodity hardware it becomes very important to ensure that the data does not fall into the wrong hands. But MapReduce can work in coordination with HDFS to ensure that the authentication for users working on Hadoop jobs is foolproof and there is no illegal access to data.

Programming simplicity

Unlike various proprietary processing frameworks, it is very easy to program and run MapReduce in record time due to its open source nature. Also the coding for MapReduce can be done easily with Java programming which is so universally accepted programming language. So due to this it is not hard to come up with a MapReduce program that perfectly suits your enterprise needs just with the help of basic coding languages like Java saving you huge amounts of money in the process.

About the Author

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.