0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I'm am trying to use Spark with Python. I installed the Spark 1.0.2 for Hadoop 2 binary distribution from the downloads page. I can run through the quickstart examples in Python interactive mode, but now I'd like to write a standalone Python script that uses Spark. The quick start documentation says to just import pyspark, but this doesn't work because it's not on my PYTHONPATH.

I can run bin/pyspark and see that the module is installed beneath SPARK_DIR/python/pyspark. I can manually add this to my PYTHONPATH environment variable, but I'd like to know the preferred automated method.

1 Answer

0 votes
by (31.4k points)
edited by

Spark-2.2.0 onwards simply use pip install pyspark and get pyspark installed in your machine.

pip install pyspark

For older versions refer to the following steps.

 Add Pyspark lib in Python path in the bashrc

      export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

Also, don't forget to set up the SPARK_HOME. PySpark depends on the py4j Python package. So install that as follows

     pip install py4j

If you want to know more about Spark, then do check out this awesome video tutorial:

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...