Merging multiple files into one within Hadoop

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-10T04:19:59+0000

In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer.

Try this:

hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \
-Dmapred.reduce.tasks=1 \
-input "<input-path-directory>" \
-output "<output-path-directory>" \
-mapper cat \
-reducer cat

make sure that you are using suitable version of hadoop streaming jar, according to your system.

Now, give the input path and make sure the output directory is not existed as this job will merge the files and creates the output directory for you.

Here is what i tried:-

#hdfs dfs -ls /user/amit/fold2/
Found 2 items
-rw-r--r-- 3 hdfs hdfs 150 2017-09-26 17:55 /user/amit/fold2/part1.txt
-rw-r--r-- 3 hdfs hdfs 20 2017-09-27 09:07 /user/amit/fold2/part1_sed.txt
#hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \
> -Dmapred.reduce.tasks=1 \
> -input "/user/amit/fold2/" \
> -output "/user/amit/fold1/" \
> -mapper cat \
> -reducer cat
Fold2 having 2 files after running the above command, I am storing the merged files to fold1 directory and the 2 files got merged into 1 file as you can see below.
#hdfs dfs -ls /user/amit/fold1/
Found 2 items
-rw-r--r-- 3 hdfs hdfs 0 2017-10-09 16:00 /user/amit/fold1/_SUCCESS
-rw-r--r-- 3 hdfs hdfs 174 2017-10-09 16:00 /user/amit/fold1/part-00000

If you want to know more about Hadoop, then do check out this awesome video tutorial:

Merging multiple files into one within Hadoop

1 Answer

Related questions

Browse Categories