Hadoop File System was mainly developed for using distributed file system design. It is highly fault tolerant and holds huge amount of data sets and provides ease of access. The files are stored across multiple machines in a systematic order. These stored files are stored to eliminate all possible data losses in case of failure and helps make applications available for parallel processing. This file System is designed for storing very large amount of files with streaming data access.
HDFS is based on the Google File System (GFS) and written completely in Java programming language. Google provided only a white paper, without any implementation. Around 90 percent of the GFS architecture has been implementation in the form of HDFS.
HDFS was originally built and developed as a storage infrastructure for the Apache Nutch web search engine project. It was initially known as the Nutch Distributed File System (NDFS).
Hardware failure is basically the norm instead of exception. An HDFS instance consists of hundreds or thousands of server machines, each consuming the storing part of the file system’s data. The real fact is that there are a large number of components and that each component consists of non-trivial probability of failure which explains that some of the component of HDFS is always non-functional. Therefore, detection of faults is quick, automatic recovery from them is the main core architectural motto of HDFS.
Applications that run on HDFS need streaming access to every individual data set. They are not meant for general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing instead of interactive use by End-users. The emphasis is on high streams throughput of data access rather than low latency of data access. POSIX consists of many hard requirements which are not required for applications which come under HDFS. POSIX semantics are trending huge success in certain key areas to increase data throughput rates.
Applications that run on HDFS have massively large data sets. A typical file in HDFS ranging from gigabytes to terabytes in size. Thus, HDFS are built in such a way that it supports large files. It provides high aggregate data bandwidth and scale measuring hundreds of nodes in a single cluster. It should access to tons of millions of files at a same time.
HDFS applications need write-once-read-many access model for files. A file once created, written, and closed need not be changed later. This assumption helps to simplify the data coherency issues and enables us the high throughput data access. A MapReduce application or a web crawler application is perfectly fitted in this model. In future new innovative ideas been implemented to support appending-writes to files in the future.
A computation requested by an application is very more efficient when executed near its operated data. This is termed true when there is huge amount data set. This results in minimization of network congestion and increases the overall throughput of the system. The assumption will implies better if you migrate the computation to the closer data location rather than moving the data to where the application is running. HDFS provides interfaces for applications to move closely among themselves where exactly the data is being located.
HDFS has been designed in such a way for easy transport from one platform to another. This facilitates of HDFS as a platform of choice for a large set of applications.
It uses mainly the master slave architecture and contains the following elements:
The namenode is the commodity hardware that contains the GNU/Linux operating system and its library file setup and the namenode software. The system containing namenode acts as the master server and carries out following tasks:
The single NameNode in a cluster efficiently simplifies the architecture of the system. The NameNode is the main arbitrator and repository for all HDFS metadata sets. The system is designed in such a way that user data never flows through the NameNode.
Data is mainly stored in HDFS’s file. These files are segregated into one or more segments and further stored in individual data node. These file segments are namely known as block. The default block size is 64MB which can be modified as per the requirements from HDFS configuration.
HDFS blocks are huge compared to disk blocks and the main reason is cost reduction. By making a particular set of block large enough hence here the time consumed to transfer the data from the disk can be made larger than the time to seek from the beginning of the block. Thus the time consumed to transfer a large file made of multiple blocks operates at the disk transfer rate.
The above points can be achieved through following implementations:
HDFS has a permissions model accessing the files and directories.
There are three types of permission:
The mode is given the permissions for the user who is the owner, the permissions is given for the users who are members of the group, and the permissions is given for users who are neither the owners nor members of the group.
Setting up of the Hadoop cluster:
Successfully installation of Hadoop and configure them so that the clusters ranging from couple of nodes to even tens of thousands over huge amount of clusters. For this, first you need to install Hadoop on a single machine and it requires compulsory of installing Java if this doesn’t exist in your system.
Getting Hadoop to work on the entire cluster involves required software on all the machines that are tied up with the cluster. As per the norms one of the machines is associated with the Name Node and another associated with the Resource Manager. The other services like The MapReduce Job History and the Web App Proxy Server usually hosted on specific machines or even on shared resources are loaded as per the requirement of the task. Rest all the nodes in cluster have a dual nature of both the Node Manager and the Data Node. These are collectively termed as the slave nodes.
Hadoop to work in the non-secure mode
The Java configuration of Hadoop has two important files:
Possible to manage the Hadoop scripts in the bin/ directory distribution, by setting site-specific values by the following storage files etc/hadoop/hadoop-env.sh and etc/hadoop/yarn-env.sh.
For the Hadoop cluster configuration you first create the ecosystem where the Hadoop daemons can execute and also requires the parameters for configuration.
The various daemons of Hadoop Distributed File System are listed below:
The Hadoop Daemons configuration environment
To get the Hadoop daemons’ access the right site with specific customization where the administrators need to use the following commands the etc/hadoop/hadoop-env.sh or the etc/hadoop/mapred-env.sh and etc/hadoop/yarn-env.sh scripts. The JAVA_HOME are specified properly such that it is defined in properly on every remote node.
The list of Daemons with their relevant environment variable
DataNode – HADOOP_DATANODE_OPTS
Secondary NameNode – HADOOP_SECONDARYNAMENODE_OPTS
Resource Manager – YARN_RESOURCEMANAGER_OPTS
Node Manager – YARN_NODEMANAGER_OPTS
WebAppProxy – YARN_PROXYSERVER_OPTS
Map Reduce Job History Server – HADOOP_JOB_HISTORYSERVER_OPTS
Other related important Customization configuration parameters:
The important operations of Hadoop Distributed File System using the shell commands used for file management in the cluster.
hadoopfs -put: Copy single source file, or multiple source files based on local file system to the Hadoop data file system hadoopfs-put<localsrc> … <HDFS_dest_Path>
Copies or downloads the files to the local file system
Same as Unix cat command:
Copy the address from the LocalHost:
hadoopfs-copyFromLocal/home/saurzcode/abc.txt/user/saurzcode/abc.txt copyToLocal host
hadoopfs-copyToLocal [-ignorecrc] [-crc] URI<localdst>
But remember, you cannot move files across filesystem.
Repetitive version of delete.
Download Interview Questions asked by top MNCs in 2019?