Namenode holds the metadata for the HDFS like Namespace information, block information, etc. Here, as soon as the Namenode is used, all this information is stored in main memory and also in the disk for continuous storage.
In the disk, Namenode stores information in two different files:
Only when the Namenode is restarted, edit logs are applied to fsimage to get the latest snapshot of the file system. But restarting of a Namenode occurs very rarely in production clusters which tells us that edit logs can grow very large for the clusters, whenever a Namenode runs for a long period of time. In such a situation we have to encounter the following problems:
Editlog turns out to be very large, which will be challenging to manage.
Restarting of a Namenode takes a lot of time because n number of changes has to be merged.
Whenever a Namenode crashes, we lose a huge amount of metadata since the fsimage will be very old at the time of crash.
So, to surpass these issues we need a mechanism which will help in reducing the edit log size and have up to date fsimage in order to reduce the load on Namenode.
Secondary Namenode helps to overcome the above issues by taking the responsibility of merging editlogs with fsimage from the Namenode.
Working of Secondary Namenode:
Secondary Namenode takes edit logs from the Primary Namenode, in regular intervals and updates it to fsimage.
Once it gets the updated fsimage, it copies back fsimage to the Namenode
The actual purpose of Secondary Namenode is just to have a checkpoint in HDFS. That’s why it is also called Checkpoint node. It just helps Namenode to function in a better way.
For more understanding of Hadoop, refer the following video: