Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I have a CSV file with about 5000 rows and 950 columns. First I load it to DataFrame:

val data =
  .option("header", "true")
  .option("inferSchema", "true")

After that I search all string columns

val featuresToIndex = data.schema
  .filter(_.dataType == StringType)
  .map(field =>

and want to index them. For that I create indexers for each string column

val stringIndexers = =>
  new StringIndexer()
    .setOutputCol(colName + "Indexed"))


and create pipeline

val pipeline = new Pipeline().setStages(stringIndexers.toArray)

But when I try to transform my initial dataframe with this pipeline

val indexedDf =

I get StackOverflowError

16/07/05 16:55:12 INFO DAGScheduler: Job 4 finished: countByValue at StringIndexer.scala:86, took 7.882774 s
Exception in thread "main" java.lang.StackOverflowError
at scala.collection.immutable.Set$Set1.contains(Set.scala:84)
at scala.collection.immutable.Set$Set1.$plus(Set.scala:86)
at scala.collection.immutable.Set$Set1.$plus(Set.scala:81)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:22)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:20)
at scala.collection.generic.Growable$class.loop$1(Growable.scala:53)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:57)
at scala.collection.mutable.SetBuilder.$plus$plus$eq(SetBuilder.scala:20)
at scala.collection.TraversableLike$
at scala.collection.TraversableOnce$class.toSet(TraversableOnce.scala:304)
at scala.collection.AbstractTraversable.toSet(Traversable.scala:104)
at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild$lzycompute(TreeNode.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild(TreeNode.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:280)
at scala.collection.Iterator$$anon$

What am I doing wrong?

1 Answer

0 votes
by (32.3k points)

The most probable reason for your problem is that there is just not enough memory to keep all stack frames. I experience something similar when trained RandomForestModel. Here, I will suggest you a workaround that even works for me, just simply run my driver application (that's a web service) with additional parameters:

-XX:ThreadStackSize=81920 -Dspark.executor.extraJavaOptions='-XX:ThreadStackSize=81920'

Browse Categories