Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I'm trying to implement a Hadoop Map/Reduce job that worked fine before in Spark. The Spark app definition is the following:

val data = spark.textFile(file, 2).cache()
val result = data
  .map(//some pre-processing)
  .map(docWeightPar => (docWeightPar(0),docWeightPar(1))))
  .flatMap(line => MyFunctions.combine(line))
  .reduceByKey( _ + _)

Where MyFunctions.combine is

def combine(tuples: Array[(String, String)]): IndexedSeq[(String,Double)] =
  for (i <- 0 to tuples.length - 2;
       j <- 1 to tuples.length - 1
  ) yield (toKey(tuples(i)._1,tuples(j)._1),tuples(i)._2.toDouble * tuples(j)._2.toDouble)

The combine function produces lots of map keys if the list used for input is big and this is where the exceptions is thrown.

In the Hadoop Map Reduce setting I didn't have problems because this is the point where the combine function yields was the point Hadoop wrote the map pairs to disk. Spark seems to keep all in memory until it explodes with a java.lang.OutOfMemoryError: GC overhead limit exceeded.

I am probably doing something really basic wrong but I couldn't find any pointers on how to come forward from this, I would like to know how I can avoid this.

1 Answer

0 votes
by (32.3k points)

The GC Overhead Limit Exceeded error is an indication of a resource exhaustion i.e. memory.


The JVM throws this error if the Java process spends more than 98% of its time doing GC and only less than 2% of the heap is recovered in each execution. In other words, this means that our application has exhausted nearly all the available memory and the Garbage Collector has spent too much time trying to clean it and failed repeatedly.

One approach will be:

Alter your JVM launch configuration(increase the heap size) and add just one parameter in your startup scripts:

java -Xmx1024m com.yourcompany.YourClass

Learn Spark with this Spark Certification Course by Intellipaat.

Browse Categories