PySpark serialization EOFError

Question

asked Jul 18, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

I am reading in a CSV as a Spark DataFrame and performing machine learning operations upon it. I keep getting a Python serialization EOFError - any idea why? I thought it might be a memory issue - i.e. file exceeding available RAM - but drastically reducing the size of the DataFrame didn't prevent the EOF error.

Toy code and error below.

#set spark context
conf = SparkConf().setMaster("local").setAppName("MyApp")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
#read in 500mb csv as DataFrame
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('myfile.csv')
#get dataframe into machine learning format
r_formula = RFormula(formula = "outcome ~ .")
mldf = r_formula.fit(df).transform(df)
#fit random forest model
rf = RandomForestClassifier(numTrees = 3, maxDepth = 2)
model = rf.fit(mldf)
result = model.transform(mldf).head()

Running the above code with spark-submit on a single node repeatedly throws the following error, even if the size of the DataFrame is reduced prior to fitting the model (e.g. tinydf = df.sample(False, 0.00001):

Traceback (most recent call last):
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/daemon.py", line 157,
     in manager
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/daemon.py", line 61,
     in worker
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/worker.py", line 136,
     in main if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/serializers.py", line 545,
     in read_int
    raise EOFError
EOFError

1 Answer

Related questions

0 votes

1 answer

Require kryo serialization in Spark (Scala)

asked Jul 15, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

0 votes

1 answer

Is Scala faster than PySpark?

asked Mar 15, 2021 in Big Data Hadoop & Spark by dev_sk2311 (45k points)

0 votes

1 answer

Is PySpark faster than Pandas?

asked Mar 15, 2021 in Big Data Hadoop & Spark by dev_sk2311 (45k points)

0 votes

1 answer

Is PySpark an API?

asked Mar 15, 2021 in Big Data Hadoop & Spark by dev_sk2311 (45k points)

0 votes

1 answer

When should I use PySpark?

asked Mar 15, 2021 in Big Data Hadoop & Spark by dev_sk2311 (45k points)

Amit Rawat · Answer 1 · 2019-07-18T14:38:16+0000

I don’t know if you have checked where in your code the EOF error is arising. As you have not mentioned it in, my guess would be that it's coming as you attempt to define df with, since that's the only place in your code where the file is getting read.

df = sqlContext.read.format('somethingspark.csv').options(header='true',
inferschema='true').load('myfile.csv')

At every point after this line, your code is working with the variable df, not the file itself, therefore it is very likely that the error is getting generated in this line. A simple way to test if this is the case or not would be to comment out the rest of your code, and/or place a line like this right after the line above.

print(len(df))

PySpark serialization EOFError

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources