Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I get "too many open files" during the shuffle phase of my Spark job. Why is my job opening so many files? What should I do?

1 Answer

0 votes
by (32.3k points)

The best way is definitely just to increase the ulimit if possible, 

this is basically an assumption that we make in Spark which indicates that clusters will be 

able to move it around. 

You might be able to hack around this by decreasing the number of 

reducers but this could have some performance implications for your 

job. 

In general if there are C assigned cores for a node in your cluster and you run 

a job with X reducers then C*X files will be opened by Spark in parallel and 

start writing. Shuffle consolidation will help to decrease the total 

number of files that are created but the number of file handles open at any 

time doesn't change so it won't help the ulimit problem. 

This means you'll have to use fewer reducers (e.g. pass reduceByKey a 

number of reducers) or use fewer cores on each machine.

Browse Categories

...