Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I'm writing to see if anyone knows how to speed up S3 write times from Spark running in EMR?

My Spark Job takes over 4 hours to complete, however the cluster is only under load during the first 1.5 hours.

enter image description here

I was curious into what Spark was doing all this time. I looked at the logs and I found many s3 mvcommands, one for each file. Then taking a look directly at S3 I see all my files are in a _temporarydirectory.

Secondary, I'm concerned with my cluster cost, it appears I need to buy 2 hours of compute for this specific task. However, I end up buying unto 5 hours. I'm curious if EMR AutoScaling can help with cost in this situation.

Some articles discuss changing the file output committer algorithm but I've had little success with that.

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

Writing to the local HDFS is quick. I'm curious if issuing a hadoop command to copy the data to S3 would be faster?

enter image description here

1 Answer

0 votes
by (32.3k points)

I think what you are encountering is a problem with outputcommitter and s3. Here, the commit job applies fs.rename on the _temporary folder and since rename is not supported by S3, this means that a single request is now copying and deleting all the files from _temporary to its final destination.

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2") only works with hadoop version > 2.7. This copies each file from _temporary on commit task and not commit job so it is distributed and works pretty fast.

If you use older version of hadoop, I would suggest you to use Spark 1.6 with it and use:

sc.hadoopConfiguration.set("spark.sql.parquet.output.committer.class","org.apache.spark.sql.parquet.DirectParquetOutputCommitter")

You need to note two important things here:

  • It does not work with speculation turned on or writing in append mode

  • In Spark 2.0 it is deprecated(replaced by algorithm.version=2)

By the way I personally write with Spark to HDFS and use DISTCP jobs (specifically s3-dist-cp) in production to copy the files to S3 but this is done for several other reasons (consistency, fault tolerance) so it is not necessary. Just try to implement what I suggested and you will be able to write to S3 pretty fast.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

29.3k questions

30.6k answers

501 comments

104k users

Browse Categories

...