Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

When using Scala in Spark, whenever I dump the results out using saveAsTextFile, it seems to split the output into multiple parts. I'm just passing a parameter(path) to it.

val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")

I know I can combine the output together using bash, but is there an option to store the output in a single text file, without splitting?? I looked at the API docs, but it doesn't say much about this.

1 Answer

0 votes
by (32.3k points)
edited by

Since the computation is distributed, execution of your code splits the output into multiple parts. If the output is small enough to fit into one machine, then you can use collect at the end of your program.

val array = year.collect()

You may also use coalesce(1) and saveAsTextFile as:

coalesce(1,true).saveAsTextFile(), this basically means do the computation then coalesce to 1 partition. Instead, you can also use repartition(1). But this might be a bad approach if you have a huge pile of data. Separate files per split are generated just like in Hadoop in order to let separate mappers and reducers write to different files. Maintaining a single output file is a good idea if you have a very little amount of data.

If you want to know more about Scala, then do check out this awesome video tutorial:

...