Hadoop File System was developed using distributed file system design. It is highly fault tolerant and holds large amount of data and provides ease of access. The files are stored across multiple machines. These files are stored to eliminate possible data losses in case of failure and helps make applications available to parallel processing. This file System is designed for storing very big files with streaming data access.
Features of HDFS
- Used for distributed storage and processing.
- It is optimized for throughput over latency.
- Efficient read request for large files but poor at seek requests for many small ones.
- Provides a command interface to interact with HDFS.
- The built-in servers of data node and name node helps users to check the cluster‘s status.
- Streams access to data of file system
- Provides file permissions and authentication
- Uses replication rather handling disk failures. Each of the blocks comprising a file is stored on several nodes inside the cluster and the HDFS NameNode continuously monitors the reports which are sent by every DataNode to ensure that failures have not dropped any block below the desired replication factor. If this does happen then it schedules the addition of another copy within the cluster.
It uses master slave architecture and contains the following elements:
The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. The system with namenode acts as the master server and carries out following tasks:
- Manage file system namespace.
- Regulate client’s access to files.
- Executes file system operations like as rename, open and close files and directories.
It is a commodity hardware having the GNU/Linux operating system and datanode software. Datanodes stores and retrieve blocks when they are told to (by clients or the namenode) and they report back to the namenode periodically with lists of blocks that they are storing. There will be a datanode for every node in the cluster.
These nodes handle storage of data for their system. It performs the following tasks:
- According to the client request it performs read and write functions on the file systems.
- Operations such as block creation, deletion, and replication according to the instructions of the Namenode, are also carried out.
Data is stored in HDFS’s file. These files are separated into one or more segments and stored in individual data node. These file segments are known as block. The default block is 64MB which can be changed according to the HDFS configuration.
HDFS blocks are big compared to disk blocks and the reason is to reduce the cost of seeks. By making a block big enough the time to transfer the data from the disk can be made to be considerably larger than the time to seek to the start of the block. Thus the time to transfer a big file made of multiple blocks operates at the disk transfer rate.
This blog will help you get a better understanding of Skills For Hadoop Professionals!
- Fault detection and recovery: It must provide methods for fast and automatic fault detection and recovery.
- Huge datasets: It should have multiple nodes per cluster so as to manage the applications with large datasets.
- Hardware at data: HDFS reduces the network traffic and increases the throughput.