The release of Hadoop’s second and third versions has happened in consecutive manner in a very short span of time. Soon after the release of Hadoop 3.0.0- Alpha1 on September 3rd, 2016, the Alpha2 version of Hadoop 3.0.0 was released on January 25th, 2017. Hadoop has been a prime project running under the wide umbrella of Apache Software Foundation. This technology took the world by storm with its unparalleled features, which was open source and was able to be deployed on commodity hardware.
Over the years Apache Hadoop has launched various versions which came with amazing features. Like the second version which came with an additional feature called YARN, the third version possesses some of the advanced features. Let us take a look how it differs from its other versions :
Basis of Difference |
Hadoop 2.0.0 |
Hadoop 3.0.0 |
Handling Fault-tolerance |
Through replication |
Through erasure coding |
Storage |
Consumes 200% in HDFS |
Consumes just 50% |
Scalability |
Limited |
Improved |
File System |
DFS, FTP and Amazon S3 |
All features plus Microsoft Azure Data Lake File System |
Manual Intervention |
Not needed |
Not needed |
Scalability |
Up to 10,000 nodes in a cluster |
Over 10,000 nodes in a cluster |
Cluster Resource Management |
Handled by YARN |
Handled by YARN |
Data Balancing |
Uses HDFS balancer for this purpose |
Uses Intra-data node balancer |
Additional features in Hadoop 3.0.0
It is clear from the above table that Hadoop 3.0.0 is similar to its previous versions in various manners, however some of the additional features have been introduced in this version to overcome the loopholes of previous versions. Let us explore what are those special features in Hadoop 3.0.0.
HDFS Erasure Coding – In previous versions the Storage overhead has always been high going up to 200%, however the erasure coding feature has reduced this overhead considerably to 50% which is amazing. Moreover this quality is achieved with the better level of durability and scalability. Replication method consumes much of the storage space which is reduced drastically with the help of Erasure coding which was traditionally used for accessing less frequent data.
Rewriting Shell Script – Previously the shell scripts were facing bugs, documentations errors, etc., which is resolved in this version by rewriting the shell scripts.
YARN Timeline Service Version 2 – Hadoop 3.0.0 comes with a YARN Timeline Service V.2 which handles and manages the cluster in a better way. This service is equipped with the ability to improve the scalability, reliability and usability by means of flows and aggregation. This version of YARN Timeline Service contains metrics, application-specific information, container events, etc.
Supported by Java 8 – Another roadblock to a better Hadoop performance was the implementation of Java 7 which was not supported by Oracle. Hadoop 3.0.0 addresses this issue by using the advanced version of Java, i.e., Java 8 as many of the library files do not support Java 7 anymore and work better with Java 8.
Improved Fault-tolerance with Quorum Journal Manager – The fault-tolerance of big data cluster has been improved with the help of QuoramJournalManager which is composed of minimum three nodes that can recover the system even if a node fails. The degree of fault-tolerance can be further increased by increasing the number of nodes in Quorum. The reason behind this improved fault-tolerance is that it runs multiple standby NameNodes unlike previously which in turn increases the efficiency of HDFS.
Intra-DataNode Balancing – The feature of Intra-DataNode fixes the errors occurring while more storage spaces are added or removed. Usually while performing a write operation on a disk, it will be filled evenly however sometimes skews occur in the DataNode when adding or removing which was not handled by HDFS balancer. Hence Intra-DataNode balances this error.
Though Hadoop 3.0.0 Alpha1 was followed by Hadoop 3.0.0 Alpha2, released on January 25th, 2017, which was again followed by the latest version of Hadoop 2.8.0 launched on March 22nd, 2017, it is clear that we have a lot more to see in this technology as it is still under improvement by Apache Foundation.