Special features of all new Hadoop 3.0!!

By Abhijit | Last updated on November 26, 2024 | 90539 Views

The release of Hadoop’s second and third versions has happened in consecutive manner in a very short span of time. Soon after the release of Hadoop 3.0.0- Alpha1 on September 3^rd, 2016, the Alpha2 version of Hadoop 3.0.0 was released on January 25^th, 2017. Hadoop has been a prime project running under the wide umbrella of Apache Software Foundation. This technology took the world by storm with its unparalleled features, which was open source and was able to be deployed on commodity hardware.

Over the years Apache Hadoop has launched various versions which came with amazing features. Like the second version which came with an additional feature called YARN, the third version possesses some of the advanced features. Let us take a look how it differs from its other versions :

Basis of Difference	Hadoop 2.0.0	Hadoop 3.0.0
Handling Fault-tolerance	Through replication	Through erasure coding
Storage	Consumes 200% in HDFS	Consumes just 50%
Scalability	Limited	Improved
File System	DFS, FTP and Amazon S3	All features plus Microsoft Azure Data Lake File System
Manual Intervention	Not needed	Not needed
Scalability	Up to 10,000 nodes in a cluster	Over 10,000 nodes in a cluster
Cluster Resource Management	Handled by YARN	Handled by YARN
Data Balancing	Uses HDFS balancer for this purpose	Uses Intra-data node balancer

Additional features in Hadoop 3.0.0

It is clear from the above table that Hadoop 3.0.0 is similar to its previous versions in various manners, however some of the additional features have been introduced in this version to overcome the loopholes of previous versions. Let us explore what are those special features in Hadoop 3.0.0.

HDFS Erasure Coding – In previous versions the Storage overhead has always been high going up to 200%, however the erasure coding feature has reduced this overhead considerably to 50% which is amazing. Moreover this quality is achieved with the better level of durability and scalability. Replication method consumes much of the storage space which is reduced drastically with the help of Erasure coding which was traditionally used for accessing less frequent data.

Rewriting Shell Script – Previously the shell scripts were facing bugs, documentations errors, etc., which is resolved in this version by rewriting the shell scripts.

YARN Timeline Service Version 2 – Hadoop 3.0.0 comes with a YARN Timeline Service V.2 which handles and manages the cluster in a better way. This service is equipped with the ability to improve the scalability, reliability and usability by means of flows and aggregation. This version of YARN Timeline Service contains metrics, application-specific information, container events, etc.

Supported by Java 8 – Another roadblock to a better Hadoop performance was the implementation of Java 7 which was not supported by Oracle. Hadoop 3.0.0 addresses this issue by using the advanced version of Java, i.e., Java 8 as many of the library files do not support Java 7 anymore and work better with Java 8.

Improved Fault-tolerance with Quorum Journal Manager – The fault-tolerance of big data cluster has been improved with the help of QuoramJournalManager which is composed of minimum three nodes that can recover the system even if a node fails. The degree of fault-tolerance can be further increased by increasing the number of nodes in Quorum. The reason behind this improved fault-tolerance is that it runs multiple standby NameNodes unlike previously which in turn increases the efficiency of HDFS.

Intra-DataNode Balancing – The feature of Intra-DataNode fixes the errors occurring while more storage spaces are added or removed. Usually while performing a write operation on a disk, it will be filled evenly however sometimes skews occur in the DataNode when adding or removing which was not handled by HDFS balancer. Hence Intra-DataNode balances this error.

Though Hadoop 3.0.0 Alpha1 was followed by Hadoop 3.0.0 Alpha2, released on January 25th, 2017, which was again followed by the latest version of Hadoop 2.8.0 launched on March 22nd, 2017, it is clear that we have a lot more to see in this technology as it is still under improvement by Apache Foundation.

About the Author

Abhijit

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.