How to link PyCharm with PySpark?

Question

asked Jul 10, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

I'm new with apache spark and apparently I installed apache-spark with homebrew in my MacBook.

I would like start playing in order to learn more about MLlib. However, I use Pycharm to write scripts in python. The problem is: when I go to Pycharm and try to call pyspark, Pycharm can not found the module. I tried adding the path to Pycharm as follows:

Then from a blog I tried this:

import os
import sys
# Path for spark source folder
os.environ['SPARK_HOME']="/Users/user/Apps/spark-1.5.2-bin-hadoop2.4"
# Append pyspark to Python Path
sys.path.append("/Users/user/Apps/spark-1.5.2-bin-hadoop2.4/python/pyspark")
try:
    from pyspark import SparkContext
    from pyspark import SparkConf
    print ("Successfully imported Spark Modules")
except ImportError as e:
    print ("Can not import Spark Modules", e)
    sys.exit(1)

And still can not start using PySpark with Pycharm, any idea of how to "link" PyCharm with apache-pyspark?

2 Answers

Amit Rawat · Answer 1 · 2019-07-10T04:23:06+0000

Prerequisites:

1. Pycharm

2. Python

3. Spark

Firstly in your Pycharm interface, install Pyspark by following these steps:

Go to File -> Settings -> Project Interpreter

Click on install button and search for PySpark
Click on install package button.
Manually with user provided Spark installation

Now, create Run configuration:

Go to Run -> Edit configurations
Add new Python configuration
Set Script path so it points to the script you want to execute
Edit Environment variables field so it contains at least:

SPARK_HOME - it should point to the directory with Spark installation. It should contain directories such as bin (with spark-submit, spark-shell, etc.) and conf (with spark-defaults.conf, spark-env.sh, etc.)

PYTHONPATH - it should contain $SPARK_HOME/python and optionally $SPARK_HOME/python/lib/py4j-some-version.src.zip if not available otherwise. some-version should match Py4J version used by a given Spark installation (0.8.2.1 - 1.5, 0.9 - 1.6, 0.10.3 - 2.0, 0.10.4 - 2.1, 0.10.4 - 2.2, 0.10.6 - 2.3)

Apply the settings

Add PySpark library to the interpreter path (required for code completion):

Go to File -> Settings -> Project Interpreter

Open settings for an interpreter you want to use with Spark

Edit interpreter paths so it contains path to $SPARK_HOME/python (an Py4J if required)

Save the settings

Finally

Use newly created configuration to run your script.

If you want to know more about PySpark, then do check out this awesome video tutorial:

Asha · Answer 2 · 2024-11-08T12:07:32+0000

Requirements

Pycharm

Python

Spark

In your Pycharm interface:

Install Pyspark with the below process-

Go to file -> settings -> Project interpreter

Select the install button then in search type pyspark.Then, click install package.

Manually

User-provided installation of Spark

Run configurations

Go to Run -> Edit Configurations. Then new option on the left bar-> New and then on selecting new python configuration put tick in Run and set a little box named Script path this is pointing to that which you want to run this time, edit Environment variable it contains at least:.

SPARK_HOME: it has to refer to the spark installation directory. It contains such directories as: bin-it has files like - spark-submit, spark-shell, etc. conf-it must have files like - spark-defaults.conf, spark-env.sh, etc.

PY SPARK_PATH - this should include $SPARK_HOME/python and optionally $SPARK_HOME/python/lib/py4j- some-version.src.zip (in case it's not included somewhere else). The version has to be corresponding to Py4J usage of some concrete Spark install. Here are examples:

0.8.2.1-1.5, 0.9-1.6, 0.10.3-2.0, 0.10.4-2.1, 0.10.4-2.2, 0.10.6-2.3

configuration for settings

To sum up, install PySpark library to interpreter's path:

File -> Settings -> Project Interpreter

Open configuration for an interpreter you would like to use with Spark

Edit interpreter paths so it includes the path to $SPARK_HOME/python (and a Py4J if needed)

Save the configuration

Finally

Use your newly created configuration to run your script.

How to link PyCharm with PySpark?

2 Answers

Related questions

Browse Categories