Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I built a python module and I want to import it in my pyspark application.

My package directory structure is:

wesam/
|-- data.py
`-- __init__.py


A simple import wesam at the top of my pyspark script leads to

 ImportError: No module named wesam.

 I also tried to zip it and ship it with my code with --py-files but still hard luck.

I also added the file programmatically but I got the same ImportError: No module named wesam error.

.sc.addPyFile("wesam.zip")

1 Answer

0 votes
by (32.3k points)

I had the same problem and it turned out that since I was submitting my application in client mode, then the machine I ran the spark-submit command from was running the driver program and needed to access the module files.

enter image description here

I added my module to the PYTHONPATH environment variable on the node I'm submitting my job from by adding the following line to my .bashrc file (or execute it before submitting my job). You can do:

export PYTHONPATH=$PYTHONPATH:/home/welshamy/modules

And that solved the problem. Since the path is on the driver node, you don't have to zip and ship the module with --py-files or use sc.addPyFile().

The key to solve any pyspark module import error problem is understanding whether the driver or worker (or both) nodes need the module files.

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...