Hadoop Distributed File System (HDFS) - Architecture, Working and Benefits

HDFS in Hadoop

So, what is HDFS? HDFS or Hadoop Distributed File System, which is completely written in Java programming language, is based on the Google File System (GFS). Google had only presented a white paper on this, without providing any particular implementation. It is interesting that around 90 percent of the GFS architecture has been implemented in HDFS.

HDFS was formerly developed as a storage infrastructure for the Apache Nutch web search engine project, and hence it was initially known as the Nutch Distributed File System (NDFS). Later on, the HDFS design was developed essentially for using it as a distributed file system.

HDFS is extremely fault-tolerant and can hold a large number of datasets, along with providing ease of access. The files in HDFS are stored across multiple machines in a systematic order. This is to eliminate all feasible data losses in the case of any crash, and it helps in making applications accessible for parallel processing. This file system is designed for storing a very large amount of files with streaming data access.

Before going further in this ‘What is HDFS in Hadoop?’ tutorial, let’s see what we shall be learning in this section:

HDFS in Hadoop
Why do you need another file system?
- Why does HDFS work very well with Big Data?
HDFS Architecture
Benefits of HDFS

To know ‘What is HDFS in Hadoop?’ in detail, let’s first see what a file system is? Well, a file system is one of the fundamental parts of all operating systems. It basically administers the storage in the hard disk.

Watch this Hadoop tutorial video:

Why do you need another file system?

Now, you know ‘What is HDFS in Hadoop?’ It is basically a file system. But, the question here is, why do you need another file system?

Have you ever used a file system before?

The answer would be yes!

When you’re using some portable device or laptop every single day for checking WhatsApp or Instagram or for reading this tutorial, you are actually using a file system, unknowingly.

Let’s say, a person has a book and another has a pile of unordered papers from the same book and both of them need to open Chapter 3 of the book. Who do you think would get to Chapter 3 faster?

The one with the book, right? Because, that person can simply go to the index, look for Chapter 3, check out the page number, and go to the page. Meanwhile, the one with the pile of papers has to go through the entire pile and if he is lucky enough, he might find Chapter 3.

Just like a well-organized book, a file system helps navigate data that is stored in your storage.

Without a file system, the information stored in your hard disk will be a large body of data in which there would be no way to tell where one piece of information stops and the next begins.

The file system manages how a dataset is saved and retrieved. So, when reading and writing of files is done on your hard disk, the request goes through a distinct file system. The file system has some amount of metadata of your files such as size, filename, created time, owner, modified time, etc.

When you want to write a file to a hard disk, the file system helps in figuring out where in the hard disk the file should be written and how efficiently it can do so. How do you think the file system manages to do that? Since it has all the details about the hard disk, including the empty spaces available in it, it can directly write that particular file there.

Now, we will talk about HDFS, by working with Example.txt which is a 514 MB file.

When you upload a file into HDFS, it will automatically be split into 128 MB fixed-size blocks (In the older versions of Hadoop, the file used to be divided into 64 MB fixed-size blocks). So basically, it takes care of placing the blocks in three different DataNodes by replicating each block three times.

Now that you have understood why you need HDFS, next in this section on ‘What is HDFS?’ let’s see why it is a perfect match for big data.

Why does HDFS work very well with Big Data?

HDFS is a perfect tool for working with big data. The following list of facts proves it.

HDFS uses the MapReduce method for accessing data, which is very fast.
HDFS follows the data coherency model, in which the data is synchronized across the server. It is very simple to implement and is highly robust and scalable.
HDFS is compatible with any kind of commodity hardware and operating system processors
As data is saved in multiple locations, it is safe enough.
It is conveniently accessible to use a web browser which makes it highly utilitarian.

HDFS Architecture

The following image gives the most important components present in the HDFS architecture. It has a Master-Slave architecture and has several components in it.

Let’s start with the basic two nodes in the HDFS architecture, i.e., the DataNode and the NameNode.

Get 100% Hike!

Master Most in Demand Skills Now!

DataNode

Nodes wherein the blocks are physically stored are known as DataNodes. Since these nodes hold the actual data of the cluster, they are termed as DataNodes. Every DataNode knows the blocks it is responsible for, but it might sometimes miss some major information.

Although the DataNode knows about the block it is responsible for, it doesn’t care to know about the other blocks and the other DataNodes. This is a problem for you as a user because you don’t know anything about the blocks other than the file name, and you should be able to work only with the file name in the Hadoop cluster.

So the question here is: if the DataNodes do not know which block belongs to which file, then who has the key information? The key information is maintained by a node called the NameNode.

NameNode

A NameNode keeps track of all the files or datasets in HDFS. It knows the list of blocks that are made up of files in HDFS, not only the list of blocks but also the location of them.

Why is a NameNode so important? Imagine that a NameNode is down in your Hadoop cluster. In this scenario, there would be no way you could look up for the files in the cluster because you won’t be able to figure out the list of blocks made up of the files. Also, you won’t be able to figure out the location of the blocks. Apart from the block locations, a NameNode also has the metadata of the files and folders in HDFS, which includes information like, the size, replication factor, created by, created on, last modified by, last modified on, etc.

Due to the significance of the NameNode, it is also called the master node, and the DataNodes are called slave nodes, and hence the master–slave architecture.

NameNode persists all the metadata information about the files and folder and hard disk, except for the block location.

Since NameNodes are in constant communication with each other, when a NameNode starts up, the DataNodes will try to connect with the NameNode and broadcast the list of blocks that each of them is responsible for. The NameNode will hold the block locations in memory and never persist the information in the hard disk. Because, in a busy cluster HDFS is constantly changing with the new data files coming into the cluster, and if the NameNode has to persist every change to the block by writing the information to a hard disk, it would be a bottleneck. Hence, with performance reasons in mind, the NameNode will hold the block locations in memory so that it can give a faster response to the clients. Therefore, it is clear that the NameNode is the most powerful node in the cluster in terms of capacity. A NameNode failure is clearly not an option.

Secondary NameNode

Other than the NameNode and the DataNodes, there is another component called the secondary NameNode. It works simultaneously with a primary NameNode as a helper. Although, the secondary NameNode is not a backup NameNode.
The functions of a secondary NameNode are listed below:

The secondary NameNode reads all files, along with the metadata, from the RAM of the NameNode. It also writes the metadata into the file system or to the hard disk.
The secondary NameNode is also responsible for combining EditLogs with fsImage present in the NameNode.
At regular intervals, the EditLogs are downloaded from the NameNode and are applied to fsImage by the secondary NameNode.
The secondary NameNode has periodic checkpoints in HDFS, and hence it is also called the checkpoint node.

Blocks

The data in HDFS is stored in the form of multiple files. These files are divided into one or more segments and are further stored in individual DataNodes. These file segments are known as blocks. The default block size is 128 MB in Apache Hadoop 2.x and 64 MB in Apache Hadoop 1.x, which can be modified as per the requirements from the HDFS configuration.

HDFS blocks are huge compared to disk blocks and they are designed this way for cost reduction.

By making a particular set of blocks large enough the time consumed for transferring data from the disk can be reduced. Therefore, with HDFS, the time consumed to transfer a huge file made up of multiple blocks works at a faster disk transfer rate.

In the next part of this ‘What is HDFS?’ tutorial, let’s look at the benefits of HDFS.

Benefits of HDFS

HDFS supports the concept of blocks: When uploading a file into HDFS, the file is divided into fixed-size blocks to support distributed computation. HDFS keeps track of all the blocks in the cluster.
HDFS maintains data integrity: Data failures or data corruption are inevitable in any big data environment. So, it maintains data integrity and helps recover from data loss by replicating the blocks and more than the node.
HDFS supports scaling: If you like to expand your cluster by adding more nodes, it’s very easy to do with HDFS.
No particular hardware required: There is no need for any specialized hardware to run or operate HDFS. It is basically built up to work with commodity computers.

Now, we come to the end of this section on ‘What is HDFS?’ of the Hadoop tutorial. We learned ‘What is HDFS?’, the need for HDFS, and its architecture. In the next section of this tutorial, we shall be learning about HDFS Commands.