Firstly, the MapReduce algorithm works on logical input splits not on physical memory blocks of a file. Each file that enters in HDFS is provided with a default memory size block, each of 128MB
It replicates data and transfers it to different nodes across the Hadoop clusters.
The storing of a file has no systematic manner, a file can start in one block and end of that file can be in another block.
To, resolve your problem of accessing data of a file that is stored in different blocks:
Hadoop uses a logical representation, known as input splits. It by default understands where the starting record of any file is stored in a block and where that record ends, maybe in any other block.
Whenever the last record of any file is incomplete in one block, the input split gives the location information of another block and the byte offset of the data needed to complete the record.
Suppose, I load a 150 MB file into HDFS, the file will split into 2 blocks. i.e, 128 MB and 22 MB. Let there be a paragraph in 150 MB file. If I load this data into HDFS, say from line 1 to half of the paragraph i.e line 16 which is equivalent to 128 MB, it will be stored as one split/block (upto "It make") and the remaining half of the sentence ( which was thereafter line16 eg “people laugh”) until the end of the file will be stored as a second block (22 MB). There is going to be two mapper jobs to do this task.
Also, each Input split is initialized with the start parameter corresponding to the offset in the input. So, when we initialize the LineRecordReader, it tries to instantiate a LineReader which starts reading the lines.
If CompressionCodec is defined, it takes care of boundaries. So, if the start of the InputSplit is not 0, then backtrack one character and then skip the first line,(encountered with \n or \r\n). Backtrack ensures that you don't skip the valid line.
Check the code below:
if (codec != null) {
in = new LineReader(codec.createInputStream(fileIn), job);
end = Long.MAX_VALUE;
} else {
if (start != 0) {
skipFirstLine = true;
--start;
fileIn.seek(start);
}
in = new LineReader(fileIn, job);
}
if (skipFirstLine) { // skip first line and re-establish "start".
start += in.readLine(new Text(), 0,(int)Math.min((long)Integer.MAX_VALUE, end - start));
}
this.pos = start;
So in the above example, First block B1 will read the data from offset 0 to "" line and block B2 will read the data from "To see a lamb at school" line to the last line offset.
You can also refer the following video if you have any doubt regarding the same: