Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I'm having a problem with using Python on Spark. My application has some dependencies, such as numpy, pandas, astropy, etc. I cannot use virtualenv to create an environment with all dependencies, since the nodes on the cluster do not have any common mountpoint or filesystem, besides HDFS. Therefore I am stuck with using spark-submit --py-files. I package the contents of site-packages in a ZIP file and submit the job like with option (as suggested in Easiest way to install Python dependencies on Spark executor nodes?). However, the nodes on cluster still do not seem to see the modules inside and they throw ImportError such as this when importing numpy.

File "/path/anonymized/", line 6, in <module>
    import numpy
File "/tmp/pip-build-4fjFLQ/numpy/numpy/", line 180, in <module>  
File "/tmp/pip-build-4fjFLQ/numpy/numpy/", line 13, in <module>
File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/", line 8, in <module>
File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/", line 11, in <module>
File "/tmp/pip-build-4fjFLQ/numpy/numpy/core/", line 14, in <module>
ImportError: cannot import name multiarray

1 Answer

0 votes
by (32.3k points)

Let’s assume that your dependencies are listed in requirements.txt. In order to package and zip the dependencies, I would suggest you to run the following at the command line:

pip install -t dependencies -r requirements.txt

cd dependencies

zip -r ../ .

To ensure that the modules are in the top level of the zip file, the cd dependencies command mentioned above are crucial.

Next, submit the job via:

spark-submit --py-files

Now, The --py-files directive has send the zip file to the Spark workers but does not add it to the PYTHONPATH. So, in order to add the dependencies to the PYTHONPATH to fix the ImportError, you must add the following line to the Spark job,


Related questions

Browse Categories