How does Hadoop process records split across block boundaries?

Question

asked Jun 18, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

According to the Hadoop - The Definitive Guide

The logical records that FileInputFormats define do not usually fit neatly into HDFS blocks. For example, a TextInputFormat’s logical records are lines, which will cross HDFS boundaries more often than not. This has no bearing on the functioning of your program—lines are not missed or broken, for example—but it’s worth knowing about, as it does mean that data-local maps (that is, maps that are running on the same host as their input data) will perform some remote reads. The slight overhead this causes is not normally significant.

Just take a scenario, where record line is distributed over two blocks B1 and B2.

The mapper that is processing the first block(B1), gets to know that there is no EOL separator, at the last line and it starts fetching the next block(B1) for the remaining line of the record.

How will the mapper, while processing the B2 block will come to know that the first record is incomplete and should process starting from the second record in the block (b2)?

1 Answer

Amit Rawat · Answer 1 · 2019-06-18T05:46:29+0000

Firstly, the MapReduce algorithm works on logical input splits not on physical memory blocks of a file. Each file that enters in HDFS is provided with a default memory size block, each of 128MB

It replicates data and transfers it to different nodes across the Hadoop clusters.

The storing of a file has no systematic manner, a file can start in one block and end of that file can be in another block.

To, resolve your problem of accessing data of a file that is stored in different blocks:

Hadoop uses a logical representation, known as input splits. It by default understands where the starting record of any file is stored in a block and where that record ends, maybe in any other block.
Whenever the last record of any file is incomplete in one block, the input split gives the location information of another block and the byte offset of the data needed to complete the record.

Suppose, I load a 150 MB file into HDFS, the file will split into 2 blocks. i.e, 128 MB and 22 MB. Let there be a paragraph in 150 MB file. If I load this data into HDFS, say from line 1 to half of the paragraph i.e line 16 which is equivalent to 128 MB, it will be stored as one split/block (upto "It make") and the remaining half of the sentence ( which was thereafter line16 eg “people laugh”) until the end of the file will be stored as a second block (22 MB). There is going to be two mapper jobs to do this task.

Also, each Input split is initialized with the start parameter corresponding to the offset in the input. So, when we initialize the LineRecordReader, it tries to instantiate a LineReader which starts reading the lines.

If CompressionCodec is defined, it takes care of boundaries. So, if the start of the InputSplit is not 0, then backtrack one character and then skip the first line,(encountered with \n or \r\n). Backtrack ensures that you don't skip the valid line.

Check the code below:

if (codec != null) {
  in = new LineReader(codec.createInputStream(fileIn), job);
  end = Long.MAX_VALUE;
} else {
  if (start != 0) {
    skipFirstLine = true;
    --start;
    fileIn.seek(start);
  }
  in = new LineReader(fileIn, job);
}
if (skipFirstLine) { // skip first line and re-establish "start".
start += in.readLine(new Text(), 0,(int)Math.min((long)Integer.MAX_VALUE, end - start));
}
this.pos = start;

So in the above example, First block B1 will read the data from offset 0 to "" line and block B2 will read the data from "To see a lamb at school" line to the last line offset.

You can also refer the following video if you have any doubt regarding the same:

How does Hadoop process records split across block boundaries?

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources