0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

Hadoop is Consistent and partition tolerant, i.e. It falls under the CP category of the CAP theoram.

Hadoop is not available because all the nodes are dependent on the name node. If the name node falls the cluster goes down.

But considering the fact that the HDFS cluster has a secondary name node why cant we call hadoop as available. If the name node is down the secondary name node can be used for the writes.

What is the major difference between name node and secondary name node that makes hadoop unavailable.

1 Answer

0 votes
by (31.4k points)
edited by

Namenode holds the metadata for the HDFS like Namespace information, block information, etc. Here, as soon as the Namenode is used, all this information is stored in main memory and also in the disk for continuous storage.

In the disk, Namenode stores information in two different files:

  • Edit logs - It holds the sequence of changes made to the filesystem after Namenode started.

  • fsimage - Its the snapshot of the filesystem when the Namenode starts.

Only when the Namenode is restarted, edit logs are applied to fsimage to get the latest snapshot of the file system. But restarting of a Namenode occurs very rarely in production clusters which tells us that edit logs can grow very large for the clusters, whenever a Namenode runs for a long period of time. In such a situation we have to encounter the following problems:

  • Editlog turns out to be very large, which will be challenging to manage.

  • Restarting of a Namenode takes a lot of time because n number of changes has to be merged.

  • Whenever a Namenode crashes, we lose a huge amount of metadata since the fsimage will be very old at the time of crash.

So, to surpass these issues we need a mechanism which will help in reducing the edit log size and have up to date fsimage in order to reduce the load on Namenode.

Secondary Namenode helps to overcome the above issues by taking the responsibility of merging editlogs with fsimage from the Namenode.


Working of Secondary Namenode:

  • Secondary Namenode takes edit logs from the Primary Namenode, in regular intervals and updates it to fsimage.

  • Once it gets the updated fsimage, it copies back fsimage to the Namenode

  • So, now whenever the Namenode restarts, it will use this fsimage and the startup time will be reduced accordingly.

The actual purpose of Secondary Namenode is just to have a checkpoint in HDFS. That’s why it is also called Checkpoint node. It just helps Namenode to function in a better way.

For more understanding of Hadoop, refer the following video:

Welcome to Intellipaat Community. Get your technical queries answered by top developers !