PySpark Tutorial

PySpark is the Python API for Apache Spark, one of the most powerful big data processing engines. It allows you to write Spark applications using Python and helps you work with large-scale data using RDDs and DataFrames. Whether you’re doing data analysis, machine learning, or real-time processing, PySpark makes it easier and faster.

In this PySpark tutorial, we’ll cover the basics—how to set it up, work with Spark RDDs and DataFrames, and explore key components like SparkConf, SparkContext, and more. Let’s get started with learning PySpark step by step.

Table of Contents

Apache Spark Overview
What is PySpark?
Why PySpark? The Need of PySpark
Installation and Configuration
Setting up PySpark Environment
SparkConf
SparkContext
SparkFiles and Class Methods
What is an RDD?
PySpark Dataframe
PySpark External Libraries
Use Cases of ‘Spark with Python’ in Industries
Conclusion

PySpark: Apache Spark with Python

PySpark is considered an interface for Apache Spark in Python. Through PySpark, you can write applications by using Python APIs. This interface also allows you to use PySpark Shell to analyze data in a distributed environment interactively. Being able to analyze huge data sets is one of the most valuable technical skills these days, and this tutorial will bring you to one of the most used technologies, Apache Spark, combined with one of the most popular programming languages, Python, by learning about which you will be able to analyze huge datasets. Here are some of the most frequently asked questions about Spark with Python:

Which programming language is more beneficial than others when used with Spark?
How to integrate Python with Spark?
What are the basic operations and building blocks of Spark that can be done using PySpark?

In this PySpark tutorial, we will implement codes using the Fortune 500 dataset and implement our codes on it. This dataset contains data about the top five companies listed as being in the Fortune 500 in 2017 with attributes like rank, title, website address, employees count and sector representation – such as shown here:

Rank	Title	Website	Employees	Sector
1	Walmart	http://www.walmart.com	2,300,000	Retail
2	Berkshire Hathaway	http://www.berkshirehathaway.com	367,700	Finance
3	Apple	http://www.apple.com	116,000	Technology
4	ExxonMobil	http://www.exxonmobil.com	72,700	Energy
5	McKesson Corporation	http://www.mckesson.com	68,000	Wholesale

Let’s start by understanding what Apache Spark is.

Apache Spark Overview

Apache Spark, as many may know it, is a general Big data analysis, processing, and computation engine with various advantages over MapReduce: faster analysis time, simpler usage experience, worldwide availability, and built-in tools for SQL, Machine learning, streaming are just a few reasons for its popularity within IT industries. Written in Scala with APIs available for Python Scala Java R and R; its main usage lies with Python API support in this tutorial.

What is PySpark?

PySpark is a Python API for Spark released by the Apache Spark community to support Python with Spark using PySpark, one can easily integrate and work with RDDs in Python programming language too. There are numerous features that make PySpark such an amazing framework when it comes to working with huge datasets. Whether it is to perform computations on large datasets or to just analyze them, Data Engineers are switching to this tool.

Get 100% Hike!

Master Most in Demand Skills Now!

Key Features of PySpark

Real-Time Computations: PySpark framework features in-memory processing which reduces latency.
Polyglot: PySpark supports various languages including Scala, Java, Python, and R which makes it one of the preferred frameworks for processing huge datasets.
Caching and disk persistence: This framework offers powerful caching with outstanding disk persistence.
Processing speed: The PySpark framework offers much faster Big Data processing speeds than its traditional counterparts.
Works effectively with RDDs: Python’s dynamic typing makes it an excellent language to work with RDDs.

Why PySpark? The Need of PySpark

More solutions to deal with big data, better. But then, if you have to switch between tools to perform different types of operations on big data, then having a lot of tools to perform a lot of different tasks does not sound very appealing, does it?

It just sounds like a lot of hassle one has to go through to deal with huge datasets. Here came some scalable and flexible tools to crack big data and gain benefits from it. One of those amazing tools that help handle big data is Apache Spark.

Now, it’s no secret that Python is one of the most widely used programming languages among Data Scientists, Data Analysts, and many other IT experts. The reason for this could be that it is simple and has an interactive interface or it is a general-purpose language. Therefore, it is trusted by Data Science folks to perform data analysis, Machine Learning, and many more tasks on big data. So, it’s pretty obvious that combining Spark and Python would rock the world of big data, isn’t it?

That is exactly what the Apache Spark community did when they came up with a tool called PySpark, which is basically a Python API for Apache Spark.

Spark with Python vs Spark with Scala

As it is already discussed, Python is not the only programming language that can be used with Apache Spark. Data Scientists already prefer Spark because of the several benefits it has over other Big Data tools, but choosing which language to use with Spark is a dilemma that they face.

Being one of the most popular frameworks when it comes to Big Data Analytics, Python has gained so much popularity that you wouldn’t be shocked if it became the de-facto framework for evaluating and dealing with large datasets and Machine Learning in the coming years.

The most popular programming languages with Spark are Python and Scala. If you are going to learn PySpark (Spark with Python), then you must know why and when to use Spark with Python instead of Spark with Scala. In this section, the basic criteria, one should keep in mind while choosing between Python and Scala to work on Apache Spark, are explained.

Now, see the comparison between Python and Scala in detail:

Criteria	Python with Spark	Scala with Spark
Performance Speed	Python is comparatively slower than Scala when used with Spark, but programmers can do much more with Python than with Scala as Python provides an easier interface	Spark is written in Scala, so it integrates well with Scala. It is faster than Python
Learning Curve	Python is known for its easy syntax and is a high-level language easier to learn. It is also highly productive even with its simple syntax	Scala has an arcane syntax making it hard to learn, but once you get a hold of it you will see that it has its own benefits
Data Science Libraries	In Python API, you don’t have to worry about the visualizations or Data Science libraries. You can easily port the core parts of R to Python as well	Scala lacks proper Data Science libraries and tools, and it does not have proper tools for visualization
Readability of Code	Readability, maintenance, and familiarity of code are better in Python API	In Scala API, it is easy to make internal changes since Spark is written in Scala
Complexity	Python API has an easy, simple, and comprehensive interface	Scala, in fact, produces verbose output, and hence it is considered a complex language
Machine Learning Libraries	Python is preferred for implementing Machine Learning algorithms	Scala is preferred when you have to implement Data Engineer technologies rather than Machine Learning

After choosing between Python and Scala, when you want to use one of them with Apache Spark, the next step is its installation. Move on with the installation and configuration of PySpark.

Installation and Configuration

Before installing Apache, you need to make sure that you have Java and Scala already installed in your system. If you don’t have Java and Scala installed in your system, don’t worry, this tutorial will walk you through the whole installation right from the basics.

To install Java and Scala in your system, all you have to do is go through the ‘Installation of Java’ and ‘Installation of Scala’ tutorials by IntelliPaat. These tutorials will provide you with step-by-step guides to install and get started with Java and Scala.

Setting up PySpark Environment

Installation on Linux

Step 1: Download the latest version of Apache Spark from the official Apache Spark website and after downing locate the file in the Downloads folder of your system

Step 2: Using the following command, extract the Spark tar file

Step 3: After extracting files from the Spark folder, use the following commands to move them to your opted folder since by default it will be in your Downloads folder

/usr/local/spark
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv sp
ark-2.4.0-bin-hadoop2.7 /usr/local/spark
# exit

Step 4: Set the path for PySpark using the following command:

export PATH = $PATH:/usr/local/spark/bin

Step 5: Set up the environment for PySpark using the following command:

$ source ~/.bashrc

Step 6: Verify the Spark installation using the following command:

$ spark-shell

You will get the following output if the installation was successful:

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.4.0
/_/Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc

Step 7: Invoke PySpark shell by running the following command in the Spark directory:

# ./bin/pyspark

Installation on Windows

In this section, you will come to know how to install PySpark on Windows systems step by step.

Step 1: Download the latest version of Spark from the official Spark website
Step 2: Extract the downloaded file into a new directory
Step3: Set variables as follows:

User Variables:
- Variable: SPARK_HOME
- Value: C:Program Files (x86)spark-2.4.0-bin-hadoop2.7bin

System variables:

Variable: PATH

Value: C:WindowsSystem32;C:Program Files (x86)spark-2.4.0-bin-hadoop2.7bin

Step 4: Download Windows utilities by clicking here and move it to C:Program Files (x86)spark-2.4.0-bin-hadoop2.7bin
When you click on the link provided to download the Windows utilities, it would take you to a Github page as shown in the above screenshot.

Step 5: Now, you can start the spark-shell by typing in the following command in the cmd:

Spark-shell

Step 6: To start Pyspark shell, type in the following command:

pyspark

This is how your PySpark Shell should look like after completing the above steps
block where I have used some attributes

Now that you have got your PySpark shell up and running, check out how to use the PySpark shell and perform various operations on files and applications in PySpark.

But before starting to use the PySpark shell, there are some configuration settings that you need to take care of. Moving forward in the tutorial, learn about SparkConf.

SparkConf

What is SparkConf?

Before running any Spark application on a local cluster or on a dataset, you need to set some configurations and parameters. This is done with the help of SparkConf. As the name suggests, it offers configurations for any Spark application.

Features of SparConf and Their Uses

Here is a list of some of the most commonly used features or attributes of SparkConf while working with PySpark:

set(key, value): This attribute is used to set a configuration property.
setMaster(value): This attribute is used to set the master URL.
setAppName(value): This attribute is used to set an application name.
get(key, defaultValue=None): This attribute is used to get a configuration value of a key.
setSparkHome(value): This attribute is used to set the Spark installation path.

Following is the code block where some attributes of SparkConf are used:

>>> from pyspark.conf import SparkConf
>>> from pyspark.context import SparkContext
>>> conf = SparkConf().setAppName("PySpark App").setMaster("local[2]")
>>> conf.get("spark.master")
>>> conf.get("spark.app.name")

pyspark
Note: The very first thing any Spark program does is create a SparkContext object that tells the application how to access a cluster. For that, you first need to implement SparkConf so that the SparkContext object has the configuration information about the application.

You have already seen how to use SparkConf to set the configurations. Now, move ahead to understand what exactly SparkContext is in detail.

SparkContext

What is PySpark SparkContext?

SparkContext is the entry gate for any Spark-derived application or functionality. It is the first and foremost thing that gets initiated when you run any Spark application. In PySpark, SparkContext is available as sc by default, so creating a new SparkContext will throw an error.

Parameters

SparkContext has some parameters that are listed below:

Master: The URL of the cluster SparkContext connects to
AppName: The name of your job
SparkHome: A Spark installation directory
PyFiles: The .zip or .py files send to the cluster and then added to PYTHONPATH
Environment: Worker node environment variables
BatchSize: The number of Python objects represented. However, to disable batching, set the value to 1; to automatically choose the batch size based on the object size, set it to 0; and to use an unlimited batch size, set it to −1
Serializer: This parameter tells about an RDD serializer
Conf: An object of L{SparkConf} to set all Spark properties
profiler_cls: A class of custom profilers used to do profiling; however, pyspark.profiler.BasicProfiler is the default one

The most widely used parameters among these are Master and AppName. The initial code lines for any PySpark application using the above parameters are as follows:

from pyspark import SparkContext
sc = SparkContext("local", "First App")

After getting done with the configuration settings and initiating a SparkContext object, which Spark does by default, check out the files in the application that you want to run on PySpark and understand how you can use a feature called SparkFiles, provided by Spark, to upload these files.

SparkFiles and Class Methods

What is a SparkFile?

A SparkFile is what you will use when you need to upload your files in Apache Spark using SparkContext.addfile().

Note: Here, to perform class methods in SparkFiles, a file named ‘path’ is created, and the dataset is uploaded on it using os.path.join(“path”,”filename”).

Class Methods: How to Use Them?

SparkFiles contain the following two types of class methods:

get(Filename): This class method is used when you need to specify the path of the file that you added using SparkContext.addfile() or sc.addFile()
- Input:

>>> from pyspark import SparkFiles
>>> from pyspark import SparkContext
>>> path = os.path.join("/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7", "Fortune5002017.csv")
>>> sc.addFile(path)
>>> SparkFiles.get(path)

- Output:

getRootDirectory(): It is used to specify the path to the root directory where the file that you added using SparkContext.addFile() or sc.addFile(), exists.
- Input:

>>> from pyspark import SparkFiles
>>> from pyspark import SparkContext
>>> path = os.path.join("/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7", "Fortune5002017.csv")
>>> sc.addFile(path)
>>> SparkFiles.getRootDirectory()

- Output:

output2

Now, you are acquainted with SparkFiles and have understood the basics of what you can do with them. It is time to understand the datasets in Spark. Let’s go on!

What is an RDD?

Introduction to RDDs and the Features of RDDs

As soon as one mentions Spark, regardless of the programming language used, an RDD comes to mind.

An RDD, or Resilient Distributed Database is one of Spark’s core features. An RDD contains elements distributed across multiple nodes in a cluster for parallel processing; its elements automatically recover when there are failures.

No changes can be made directly to a Relational Data Dictionary (RDD), however, you may create one from an existing RDD with necessary changes, or perform various types of operations on an RDD.

Below are a few key characteristics of RDDs:

Immutability: Once an RDD has been created, its contents cannot be modified; however, you can create a new RDD from it if any necessary modifications need to be made.
Distributed: An RDD’s data may exist across a cluster and be processed concurrently through parallel processing.

Subdivided: By subdividing work among multiple clusters, more partitions enable it to be distributed equally among them – though this creates additional overhead in scheduling.

Operations of RDDs

There are certain operations in Spark that can be performed on RDDs. These operations are basically methods. RDDs support two types of operations, namely, Actions and Transformations. Let’s understand them individually with examples.

Note: To implement different operations of RDDs, an RDD is created here using:

RDDName = sc.textFile(“ path of the file to be uploaded”)

The file used in this example is the dataset of the top 5 companies on the Fortune 500 list in the year 2017.

What are Action Operations?

Action operations are directly applied to the datasets to perform certain computations. Following are the examples of some Action operations.

take(n): This is one of the most used operations on RDDs. It takes a number as an argument and displays the same number of elements from the specified RDD. You can refer to the following to see how to use this operation.
- Input:

>>> from pyspark import SparkContext
>>> rdd = sc.textFile("C:/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7/Fortune5002017.csv")
>>> rdd.take(5)

- output:

output3
As you can see in the above screenshot, after running the take(n) command, all the rows of the dataset are displayed, and every row is considered as one element.

count(): This operation, as the name suggests, returns the number of elements in an RDD as shown below.
- Input:

>>> from pyspark import SparkContext
>>> rdd = sc.textFile("C:/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7/Fortune5002017.csv")
>>> rdd.take(5)
>>> rdd. count()

- Output:

top(n): This operation also takes a number, say n, as an argument and then displays the top n elements.
- Input:

>>> from pyspark import SparkContext >>> rdd = sc.textFile("C:/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7/Fortune5002017.csv") >>> rdd.top(2)

- Output:

What are Transformation Operations?

Transformation operations are the procedures used to generate new RDDs by either applying operations on existing ones or by creating entirely new RDDs from scratch. Some examples of Transformation operations:

Map Transformation: When performing a transformation on each element in an RDD by applying the function directly, use Map Transformation for transformations that require this operation such as uppercasing all words within your dataset.

- Input:

>>> def Func(lines):
. . . lines = lines.upper()
. . . lines = lines.split()
. . . return lines
>>> rdd1 = rdd.map(Func)
>>> rdd1.take(5)

- Output:
  
  As you can see in the above screenshot, all the words in the original RDD are uppercased with the help of the map transformation.

Filter Transformation: This transformation operation can be used when you want to remove some elements from your dataset. These elements are called stop_words. You can define your own set of stop_words. For example, you can remove some elements from your database as shown below.

- Input:

>>> from pyspark import SparkContext
>>> rdd = sc.textFile("C:/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7/Fortune5002017.csv")
>>> rdd.top(6)
>>> stop_words = [‘Rank, Title, Website, Employees, Sector’, ‘1, Walmart, http://www.walmart.com, 2300000, Retailing’]
>>> rdd1 = rdd.filter(lambda x: x not in stop_words)
>>> rdd1.take(4)

- Output:

output7 After learning about RDDs and understanding the operations that you can perform on RDDs, the next question is what else you can do using the datasets in Spark.

As discussed earlier, Spark is a great tool for real-time data processing and computation, but it is not just that for which Spark is widely known. Spark is popular for Machine Learning as well. Analyzing the provided datasets and predicting the end results using Machine Learning algorithms are also things that you can do on the Spark framework.

Further, learn about Machine Learning in Spark with Python.

PySpark Dataframe

PySpark Dataframes have distributed collections of structured and semi-structured data. Commonly referred to as data structures, PySpark Dataframes have tabular structures where rows may contain various kinds of data types while columns only support single-type columns – similar to SQL tables or spreadsheets which are in fact two-dimensional structures.

PySpark External Libraries

PySpark SQL

PySpark SQL is another layer on top of PySpark Core. PySpark SQL is used for processing structured and semi-structured data along with offering an optimized API that helps you to read data across different file formats from different sources. You can use either SQL or HIveQL to process data in PySpark. Because of its feature list, PySpark is gaining huge popularity among database programmers and Hive users.

GraphFrames

This library’s primary goal is processing graphs. It offers APIs that facilitate efficient graph analysis using PySpark Core and PySpark SQL for fast-distributed computing environments.

What is MLlib?

PySpark provides an API called MLlib which supports various Machine Learning algorithms, here’s an example:

Algorithms in PySpark MLlib

- mllib.classification: The spark.mllib package offers support for various methods to perform binary classification, regression analysis and multiclass classification. Common classification algorithms used include Naive Bayes decision trees etc.
- Clustering: With this API, clustering enables you to group similar elements or entities together into subsets based on similarities among them.
- mllib.linalg: Provides MLlib utilities to support linear algebra.
- mllib.recommendation: Allows recommender systems to fill any missing entries in any dataset by suggesting new items from within MLlib’s libraries.
- spark.mllib: This library supports collaborative filtering using Alternating Least Squares (ALS). Spark uses this technique to predict missing entries within user and product sets using its proprietary algorithms.

Common Errors in PySpark and Fixes

Working with PySpark is fun, but sometimes you run into errors that can be frustrating, especially for beginners. Here are a few common ones and how to fix them easily:

SparkContext already exists: Happens when you try to create a new SparkContext without stopping the old one. Fix it by using sc.stop() or better, use SparkSession.builder.getOrCreate().
Out of Memory Errors: PySpark jobs can crash if data is too large. Increase memory using –executor-memory 4G or optimize with .cache() and .persist().
Java gateway process exited before sending its port number: Usually a Java issue. Make sure Java is installed and JAVA_HOME is set correctly.
ModuleNotFoundError: This error pops up when a required Python library isn’t installed. Just install it using pip install <package>.
FileNotFoundError: Happens when reading a missing or wrong file path. Double-check your file path and make sure it exists in your working directory.

Use Cases of ‘Spark with Python’ in Industries

Apache Spark is one of the most used tools in various industries. Its use is not limited to just the IT industry, though it is maximum in IT. Even the big dogs of the IT industry are using Apache Spark for dealing with Big Data, e.g., Oracle, Yahoo, Cisco, Netflix, etc.

Use cases of spark in other industries

Finance: PySpark is used in this sector as it helps gain insights from call recordings, emails, and social media profiles.
E-commerce: Apache Spark with Python can be used in this sector for gaining insights into real-time transactions. It can also be used to enhance recommendations to users based on new trends.
Healthcare: Apache Spark is being used to analyze patients’ medical records, along with past medical history, and then make predictions on the most likely health issues those patients might face in the future.
Media: An example of this is Yahoo. Spark is being used at Yahoo to design its news pages for the targeted audience using Machine Learning features provided by Spark.

You have almost come to the end of this tutorial on ‘What is PySpark?’ Now, just check out the recommended audience at whom this tutorial is targeted.

Conclusion

Apache Spark’s many uses across industries made it inevitable that its community would create an API to support one of the most widely used, high-level, general-purpose programming languages: Python. Not only is Python easy to learn and use; its English-like syntax already has an established user community of users and supporters – enabling all its key features in Spark framework while taking advantage of building blocks and operations from Spark with Apache’s Python API is truly a gift from them all – this is all known as PySpark!

Frequently Asked Questions

1. What is the difference between PySpark and Apache Spark?

Apache Spark is a fast, general-purpose cluster computing system, while PySpark is the Python API for Apache Spark. PySpark allows Python developers to write Spark applications using Python code.

2. When should I use PySpark instead of Pandas?

Use PySpark when you’re working with large datasets that don’t fit in memory. It’s designed for distributed computing, unlike Pandas, which is great for small to medium datasets.

3. Is PySpark good for machine learning?

Yes, PySpark supports machine learning through MLlib, which includes classification, regression, clustering, and recommendation algorithms optimized for large-scale data.

4. Can I run PySpark locally without a cluster?

Absolutely. You can install PySpark and run it locally in standalone mode, which is great for learning and testing before deploying to a cluster.

5. What are the key components of PySpark I should learn first?

Start with SparkConf, SparkContext, RDDs, DataFrames, and SparkSession. These form the foundation of most PySpark applications.