Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

Related to my other question, but distinct:


If I save an RDD to HDFS, how can I tell spark to compress the output with gzip? In Hadoop, it is possible to set

mapred.output.compress = true

and choose the compression algorithm with

mapred.output.compression.codec = <<classname of compression codec>>

How would I do this in spark? Will this work as well?

edit: using spark-0.7.2

1 Answer

0 votes
by (32.3k points)

saveAsTextFile method takes an additional optional parameter of the codec class to use. So, in your case in order to use gzip, it should be something like :

someMap.saveAsTextFile("hdfs://HOST:PORT/out", classOf[GzipCodec])

Now, Since you're using 0.7.2 you might be able to port the compression code via configuration options that you set at startup. I'm not sure if this will work exactly, but you need to go from this:


conf.set("mapred.output.compress", "true")


conf.set("mapred.output.compression.codec", c.getCanonicalName)

conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString)

to something like this:

System.setProperty("spark.hadoop.mapred.output.compress", "true")

System.setProperty("spark.hadoop.mapred.output.compression.codec", "true")

System.setProperty("spark.hadoop.mapred.output.compression.codec", "")

System.setProperty("spark.hadoop.mapred.output.compression.type", "BLOCK")

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

30.5k questions

32.5k answers


108k users

Browse Categories