Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig?

1 Answer

0 votes
by (32.3k points)
edited by

In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer.

Try this:

hadoop jar /usr/hdp/ \

                   -Dmapred.reduce.tasks=1 \

                   -input "<input-path-directory>" \

                   -output "<output-path-directory>" \

                   -mapper cat \

                   -reducer cat

make sure that you are using suitable version of hadoop streaming jar, according to your system. 

Now, give the input path and make sure the output directory is not existed as this job will merge the files and creates the output directory for you.

Here is what i tried:-

#hdfs dfs -ls /user/amit/fold2/

Found 2 items 

-rw-r--r--   3 hdfs hdfs        150 2017-09-26 17:55 /user/amit/fold2/part1.txt 

-rw-r--r--   3 hdfs hdfs         20 2017-09-27 09:07 /user/amit/fold2/part1_sed.txt

#hadoop jar /usr/hdp/ \

>                    -Dmapred.reduce.tasks=1 \

>                    -input "/user/amit/fold2/" \

>                    -output "/user/amit/fold1/" \

>                    -mapper cat \

>                    -reducer cat

Fold2 having 2 files after running the above command, I am storing the merged files to fold1 directory and the 2 files got merged into 1 file as you can see below.

#hdfs dfs -ls /user/amit/fold1/

Found 2 items 

-rw-r--r--   3 hdfs hdfs          0 2017-10-09 16:00 /user/amit/fold1/_SUCCESS 

-rw-r--r--   3 hdfs hdfs        174 2017-10-09 16:00 /user/amit/fold1/part-00000

If you want to know more about Hadoop, then do check out this awesome video tutorial:

Browse Categories