When attempting to train a machine learning model using ALS in Spark's MLLib, I kept on receiving a StackoverflowError. Here's a small sample of the stack trace:

Traceback (most recent call last):
  File "/Users/user/Spark/", line 31, in <module>
    model = ALS.train(rdd, rank, numIterations)
  File "/usr/local/Cellar/apache-spark/1.3.1_1/libexec/python/pyspark/mllib/", line 140, in train
    lambda_, blocks, nonnegative, seed)
  File "/usr/local/Cellar/apache-spark/1.3.1_1/libexec/python/pyspark/mllib/", line 120, in callMLlibFunc
    return callJavaFunc(sc, api, *args)
  File "/usr/local/Cellar/apache-spark/1.3.1_1/libexec/python/pyspark/mllib/", line 113, in callJavaFunc
    return _java2py(sc, func(*args))
  File "/usr/local/Cellar/apache-spark/1.3.1_1/libexec/python/lib/", line 538, in __call__
  File "/usr/local/Cellar/apache-spark/1.3.1_1/libexec/python/lib/", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.trainALSModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 40.0 failed 1 times, most recent failure: Lost task 0.0 in stage 40.0 (TID 35, localhost): java.lang.StackOverflowError


This error would also appear when attempting to run .mean() to calculate the Mean Squared Error. It appeared in both version 1.3.1_1 and version 1.4.1 of Spark. I was using PySpark, and increasing the memory available did not help.

A solution here can be to add checkpointing, which prevents the recursion used by the codebase from creating an overflow. First, create a new directory to store the checkpoints. Then, you may have your SparkContext use that directory for checkpointing. Here is the example in Python:


You may also need to add checkpointing to the ALS as well, but I haven't been able to determine whether that makes a difference. Now, if you want to add a checkpoint there (probably not necessary), you can simply do:

ALS.checkpointInterval = 2

