I can't seem to get --py-files on Spark to work

Question

asked Jul 23, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

I'm having a problem with using Python on Spark. My application has some dependencies, such as numpy, pandas, astropy, etc. I cannot use virtualenv to create an environment with all dependencies, since the nodes on the cluster do not have any common mountpoint or filesystem, besides HDFS. Therefore I am stuck with using spark-submit --py-files. I package the contents of site-packages in a ZIP file and submit the job like with --py-files=dependencies.zip option (as suggested in Easiest way to install Python dependencies on Spark executor nodes?). However, the nodes on cluster still do not seem to see the modules inside and they throw ImportError such as this when importing numpy.

File "/path/anonymized/module.py", line 6, in <module>
    import numpy
File "/tmp/pip-build-4fjFLQ/numpy/numpy/__init__.py", line 180, in <module>
File "/tmp/pip-build-4fjFLQ/numpy/numpy/add_newdocs.py", line 13, in <module>
File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/__init__.py", line 8, in <module>
    #
File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/type_check.py", line 11, in <module>
File "/tmp/pip-build-4fjFLQ/numpy/numpy/core/__init__.py", line 14, in <module>
ImportError: cannot import name multiarray

1 Answer

Amit Rawat · Answer 1 · 2019-07-23T13:44:07+0000

Let’s assume that your dependencies are listed in requirements.txt. In order to package and zip the dependencies, I would suggest you to run the following at the command line:

pip install -t dependencies -r requirements.txt
cd dependencies
zip -r ../dependencies.zip .

To ensure that the modules are in the top level of the zip file, the cd dependencies command mentioned above are crucial.

Next, submit the job via:

spark-submit --py-files dependencies.zip spark_job.py

Now, The --py-files directive has send the zip file to the Spark workers but does not add it to the PYTHONPATH. So, in order to add the dependencies to the PYTHONPATH to fix the ImportError, you must add the following line to the Spark job, spark_job.py:

sc.addPyFile("dependencies.zip")

I can't seem to get --py-files on Spark to work

I can't seem to get --py-files on Spark to work

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions