PySpark: Apache Spark with Python

Being able to analyze huge datasets is one of the most valuable technical skills these days, and this tutorial will bring you to one of the most used technologies, Apache Spark, combined with one of the most popular programming languages, Python, by learning about which you will be able to analyze huge datasets. Here are some of the most frequently asked questions about Spark with Python:

  • Which programming language is more beneficial over others when used with Spark?
  • How to integrate Python with Spark?
  • What are the basic operations and building blocks of Spark that can be done using PySpark?

In this ‘What is PySpark?’ tutorial, you will find detailed answers to all these questions with examples.

Watch this video on What is Pyspark? PySpark for Beginners:

What is PySpark? Apache Spark with Python

Learn for free ! Subscribe to our youtube Channel.

Following is the list of topics covered in this tutorial:

Learn Apache Spark from Cloudera Spark Training and be an Apache Spark Specialist!

In this PySpark tutorial, we will use the dataset of Fortune 500 and implement the codes on it. This dataset consists of information related to the top 5 companies among the Fortune 500 in the year 2017. It includes attributes such as Rank, Title, Website, Employees, and Sector. The dataset looks like below:

RankTitleWebsiteEmployeesSector
1Walmarthttp://www.walmart.com2,300,000Retail
2Berkshire Hathawayhttp://www.berkshirehathaway.com367,700Finance
3Applehttp://www.apple.com116,000Technology
4ExxonMobilhttp://www.exxonmobil.com72,700Energy
5McKesson Corporationhttp://www.mckesson.com68,000Wholesale


Let’s start off by understanding what Apache Spark is.

Overview of Apache Spark

Apache Spark, as you might have heard of it, is a general engine for Big Data analysis, processing, and computations. It provides several advantages over MapReduce: it is faster, easier to use, offers simplicity, and runs virtually everywhere. It has built-in tools for SQL, Machine Learning, and streaming which make it a very popular and one of the most asked tools in the IT industry. Spark is written in Scala. Apache Spark has APIs for Python, Scala, Java, and R, though the most used languages with Spark are the former two. In this tutorial, you will learn how to use Python API with Apache Spark.

Check out this insightful video on Spark Tutorial for Beginners:

What is PySpark? Apache Spark with Python

Learn for free ! Subscribe to our youtube Channel.

This video will help you understand Spark better, along with its various components, versions, and frameworks.

Want to grasp detailed knowledge of Hadoop? Check out this extensive Spark Tutorial!

What is PySpark?

PySpark is a Python API for Spark released by the Apache Spark community to support Python with Spark. Using PySpark, one can easily integrate and work with RDDs in Python programming language too. There are numerous features that make PySpark such an amazing framework when it comes to working with huge datasets. Whether it is to perform computations on large datasets or to just analyze them, Data Engineers are switching to this tool.

Key Features of PySpark

  • Real-time computations: Because of the in-memory processing in the PySpark framework, it shows low latency.
  • Polyglot: The PySpark framework is compatible with various languages such as Scala, Java, Python, and R, which makes it one of the most preferable frameworks for processing huge datasets.
  • Caching and disk persistence: This framework provides powerful caching and great disk persistence.
  • Fast processing: The PySpark framework is way faster than other traditional frameworks for Big Data processing.
  • Works well with RDDs: Python programming language is dynamically typed, which helps when working with RDDs.

You will learn a lot more about RDDs and Python further in this tutorial.

Why PySpark? The Need of PySpark

More solutions to deal with big data, better. But then, if you have to switch between tools to perform different types of operations on big data, then having a lot of tools to perform a lot of different tasks does not sound very appealing, does it?

It just sounds like a lot of hassle one has to go through to deal with huge datasets. Here came some scalable and flexible tools to crack big data and gain benefits from it. One of those amazing tools that help handle big data is Apache Spark.

Now, it’s no secret that Python is one of the most widely used programming languages among Data Scientists, Data Analysts, and many other IT experts. The reason for this could be that it is simple and has an interactive interface or it is a general-purpose language. Therefore, it is trusted by Data Science folks to perform data analysis, Machine Learning, and many more tasks on big data. So, it’s pretty obvious that combining Spark and Python would rock the world of big data, isn’t it?

That is exactly what the Apache Spark community did when they came up with a tool called PySpark, which is basically a Python API for Apache Spark.

If you face any technical issue related to Spark and Hadoop, kindly refer to our Big Data Hadoop and Spark Community!

Spark with Python vs Spark with Scala

As it is already discussed, Python is not the only programming language that can be used with Apache Spark. Data Scientists already prefer Spark because of the several benefits it has over other Big Data tools, but choosing which language to use with Spark is a dilemma that they face.

Being one of the most popular frameworks when it comes to Big Data Analytics, Python has gained so much popularity that you wouldn’t be shocked if it became the de-facto framework for evaluating and dealing with large datasets and Machine Learning in the coming years.

The most used programming languages with Spark are Python and Scala. Now if you are going to learn PySpark (Spark with Python), then it is important that you know why and when to use Spark with Python, instead of Spark with Scala. In this section, the basic criteria, one should keep in mind while making the choice between Python and Scala to work on Apache Spark, are explained.

Now, see the comparison between Python and Scala in detail:

CriteriaPython with SparkScala with Spark

Performance Speed

Performance Speed

Python is comparatively slower than Scala when used with Spark, but programmers can do much more with Python than with Scala as Python provides an easier interfaceSpark is written in Scala, so it integrates well with Scala. It is faster than Python
Learning Curve
Learning Curve
Python is known for its easy syntax and is a high-level language easier to learn. It is also highly productive even with its simple syntaxScala has an arcane syntax making it hard to learn, but once you get a hold of it you will see that it has its own benefits
Data Science Libraries
Data science Libraries
In Python API, you don’t have to worry about the visualizations or Data Science libraries. You can easily port the core parts of R to Python as wellScala lacks proper Data Science libraries and tools, and it does not have proper tools for visualization
Readability of Code
Readability of Code
Readability, maintenance, and familiarity of code are better in Python APIIn Scala API, it is easy to make internal changes since Spark is written in Scala
Complexity
Complexity
Python API has an easy, simple and comprehensive interfaceScala, in fact, produces verbose output, and hence it is considered a complex language
Machine Learning Libraries
Machine learning libraries
Python is preferred for implementing Machine Learning algorithmsScala is preferred when you have to implement Data Engineer technologies rather than Machine Learning

After choosing between Python and Scala, when you want to use one of them with Apache Spark, the next step is its installation. Move on with the installation and configuration of PySpark.

Installation and Configuration

Before installing Apache, you need to make sure that you have Java and Scala already installed in your system. If you don’t have Java and Scala installed in your system, don’t worry, this tutorial will walk you through the whole installation right from the basics.

To install Java and Scala in your system, all you have to do is go through the ‘Installation of Java’ and ‘Installation of Scala’ tutorials by IntelliPaat. These tutorials will provide you step-by-step guides to install and get started with Java and Scala.

Be industry-ready by going through these Top Hadoop Interview Questions and Answers!

Setting up PySpark Environment

Installation on Linux

Step 1: Download the latest version of Apache Spark from the official Apache Spark website and after downing locate the file in the Downloads folder of your system

Step 2: Using the following command, extract the Spark tar file

Step 3: After extracting files from the Spark folder, use the following commands to move them to your opted folder since by default it will be in your Downloads folder

/usr/local/spark
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv sp
ark-2.4.0-bin-hadoop2.7 /usr/local/spark
# exit

Step 4: Set the path for PySpark using the following command:

export PATH = $PATH:/usr/local/spark/bin

Step 5: Set up the environment for PySpark using the following command:

$ source ~/.bashrc

Step 6: Verify the Spark installation using the following command:

<pre">$ spark-shell

You will get the following output if the installation was successful:

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
>>>

Step 7: Invoke PySpark shell by running the following command in the Spark directory:

# ./bin/pyspark

Installation on Windows

In this section, you will come to know how to install PySpark on Windows systems step by step.

Step 1: Download the latest version of Spark from the official Spark website
Installation in windowsStep 2: Extract the downloaded file into a new directory
downloaded file into a new directoryStep3: Set variables as follows:

  • User Variables:
    • Variable: SPARK_HOME
    • Value: C:\Program Files (x86)\spark-2.4.0-bin-hadoop2.7\bin
  • System variables:
    • Variable: PATH
    • Value: C:\Windows\System32;C:\Program Files (x86)\spark-2.4.0-bin-hadoop2.7\bin

Step 4: Download Windows utilities by clicking here and move it to C:\Program Files (x86)\spark-2.4.0-bin-hadoop2.7\bin
Download the windowsWhen you click on the link provided to download the Windows utilities, it would take you to a Github page as shown in the above screenshot.

Step 5: Now, you can start the spark-shell by typing in the following command in the cmd:

Spark-shell

Step 6: To start Pyspark shell, type in the following command:

pyspark

This is how your PySpark Shell should look like after completing the above steps
block where I have used some attributes

Now that you have got your PySpark shell up and running, check out how to use PySpark shell and perform various operations on files and applications in PySpark.

But before starting to use PySpark shell, there are some configuration settings that you need to be take care of. Moving forward in the tutorial, learn about SparkConf.

SparkConf

What is SparkConf?

Before running any Spark application on a local cluster or on a dataset, you need to set some configurations and parameters. This is done with the help of SparkConf. As the name suggests, it offers configurations for any Spark application.

Features of SparConf and Their Uses

Here is a list of some of the most commonly used features or attributes of SparkConf while working with PySpark:

  • set(key, value): This attribute is used to set a configuration property.
  • setMaster(value): This attribute is used to set the master URL.
  • setAppName(value): This attribute is used to set an application name.
  • get(key, defaultValue=None): This attribute is used to get a configuration value of a key.
  • setSparkHome(value): This attribute is used to set the Spark installation path.

Following is the code block where some attributes of SparkConf are used:

>>> from pyspark.conf import SparkConf
>>> from pyspark.context import SparkContext
>>>conf = SparkConf().setAppName("PySpark App").setMaster("local[2]")
>>> conf.get("spark.master")
>>> conf.get("spark.app.name")

pyspark
Note: The very first thing any Spark program does is creating a SparkContext object that tells the application how to access a cluster. For that, you first need to implement SparkConf so that the SparkContext object has the configuration information about the application.

You have already seen how to use SparkConf to set the configurations. Now, move ahead to understand what exactly SparkContext is in detail.

SparkContext

What is PySpark SparkContext?

SparkContext is the entry gate for any Spark-derived application or functionality. It is the first and foremost thing that gets initiated when you run any Spark application. In PySpark, SparkContext is available as sc by default, so creating a new SparkContext will throw an error.

Parameters

SparkContext has some parameters that are listed below:

  • Master: The URL of the cluster SparkContext connects to
  • AppName: The name of your job
  • SparkHome: A Spark installation directory
  • PyFiles: The .zip or .py files send to the cluster and then added to PYTHONPATH
  • Environment: Worker node environment variables
  • BatchSize: The number of Python objects represented. However, to disable batching, set the value to 1; to automatically choose the batch size based on the object size, set it to 0; and to use an unlimited batch size, set it to −1
  • Serializer: This parameter tells about an RDD serializer
  • Conf: An object of L{SparkConf} to set all Spark properties
  • profiler_cls: A class of custom profilers used to do profiling; however, pyspark.profiler.BasicProfiler is the default one

The most widely used parameters among these are Master and AppName. The initial code lines for any PySpark application using the above parameters are as follows:

from pyspark import SparkContext
sc = SparkContext("local", "First App")

After getting done with the configuration settings and initiating a SparkContext object, which Spark does by default, check out the files in the application that you want to run on PySpark and understand how you can use a feature called SparkFiles, provided by Spark, to upload these files.

SparkFiles and Class Methods

What is a SparkFile?

A SparkFile is what you will use when you need to upload your files in Apache Spark using SparkContext.addfile().

Note: Here, to perform class methods in SparkFiles, a file named ‘path’ is created, and the dataset is uploaded on it using os.path.join(“path”,”filename”).

Class Methods: How to Use Them?

SparkFiles contain the following two types of class methods:

  • get(Filename): This class method is used when you need to specify the path of the file that you added using SparkContext.addfile() or sc.addFile()
    • Input:
>>> from pyspark import SparkFiles
>>> from pyspark import SparkContext
>>> path = os.path.join("/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7", "Fortune5002017.csv")
>>> sc.addFile(path)
>>> SparkFiles.get(path)
    • Output:
Output
  • getRootDirectory(): It is used to specify the path to the root directory where the file that you added using SparkContext.addFile() or sc.addFile(), exists.
    • Input:
>>> from pyspark import SparkFiles
>>> from pyspark import SparkContext
>>> path = os.path.join("/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7", "Fortune5002017.csv")
>>> sc.addFile(path)
>>>SparkFiles.getRootDirectory()
    • Output:

output2Now, you are acquainted with SparkFiles and have understood the basics of what you can do with them. It is time to understand the datasets in Spark. Let’s go on!

What is an RDD?

Introduction to RDDs and the Features of RDDs

When talking about Spark, regardless of the programming language used, the first thing that strikes in your mind is an RDD.

An RDD is one of the key features of Spark. It stands for Resilient Distributed Database. It is a set of elements that are divided across multiple nodes in a cluster to run parallel processing. An RDD can automatically recover from failures.

you cannot make changes in an RDD. However, you can create an RDD from an existing one with the required changes, or you can perform different types of operations on an RDD.

Following are some features of RDDs:

  • Immutability: An RDD, once created can’t be changed or modified; however, you can create a new RDD from the existing one if you wish to make any changes.
  • Distributed: The data in an RDD can exist on a cluster and be operated on in parallel while parallel processing.
  • Partitioned: With more partitions, the work gets distributed among different clusters, but it also creates overhead in scheduling.

Operations of RDDs

There are certain operations in Spark that can be performed on RDDs. These operations are basically methods. RDDs support two types of operations, namely, Actions and Transformations. Let’s understand them individually with examples.

Note: To implement different operations of RDDs, an RDD is created here using:

RDDName = sc.textFile(“ path of the file to be uploaded”)

The file used in this example is the dataset of the top 5 companies among the Fortune 500 list in the year 2017.

What are Action Operations?

Action operations are directly applied to the datasets to perform certain computations. Following are the examples of some Action operations.

  • take(n): This is one of the most used operations on RDDs. It takes a number as an argument and displays the same number of elements from the specified RDD. You can refer to the following to see how to use this operation.
    • Input:
>>> from pyspark import SparkContext
>>> rdd = sc.textFile("C:/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7/Fortune5002017.csv")
>>>rdd.take(5)
    • output:

output3
As you can see in the above screenshot, after running the take(n) command, all the rows of the dataset are displayed, and every row is considered as one element.

  • count(): This operation, as the name suggests, returns the number of elements in an RDD as shown below.
    • Input:
>>> from pyspark import SparkContext
>>> rdd = sc.textFile("C:/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7/Fortune5002017.csv")
>>>rdd.take(5)
>>> rdd. count()
    • Output:
output4
  • top(n): This operation also takes a number, say n, as an argument and then displays the top n elements.
    • Input:
>>> from pyspark import SparkContext >>> rdd = sc.textFile("C:/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7/Fortune5002017.csv") >>> rdd.top(2)
    • Output:
output5

What are Transformation Operations?

Transformation operations are the set of operations used to create new RDDs either by applying an operation on an existing RDD or by making an entirely new RDD. Following are the examples of some Transformation operations:

  • Map Transformation: You can use this operation when you need to transform each element of an RDD by applying the function to the entire elements. For example, if you have to uppercase all the words in the dataset, then you can use the map transformation.
    • Input:
>>> def Func(lines):
. . . lines = lines.upper()
. . . lines = lines.split()
. . . return lines
>>> rdd1 = rdd.map(Func)
>>> rdd1.take(5)
    • Output:
      output6
      As you can see in the above screenshot, all the words in the original RDD are uppercased with the help of the map transformation.
  • Filter Transformation: This transformation operation can be used when you want to remove some elements from your dataset. These elements are called stop_words. You can define your own set of stop_words. For example, you can remove some elements from your database as shown below.
    • Input:
>>> from pyspark import SparkContext
>>> rdd = sc.textFile("C:/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7/Fortune5002017.csv")
>>> rdd.top(6)
>>> stop_words = [‘Rank, Title, Website, Employees, Sector’, ‘1, Walmart, http://www.walmart.com, 2300000, Retailing’]
>>> rdd1 = rdd.filter(lambda x: x not in stop_words)
>>> rdd1.take(4)
    • Output:

output7After learning about RDDs and understanding the operations that you can perform on RDDs, the next question is what else you can do using the datasets in Spark.

As discussed earlier, Spark is a great tool for real-time data processing and computation, but it is not just that for which Spark is widely known for. Spark is popular for Machine Learning as well. Analyzing the provided datasets and predicting the end results using Machine Learning algorithms are also things that you can do on the Spark framework.

Further, learn about Machine Learning in Spark with Python.

Intellipaat provides the most comprehensive Cloudera Spark Course to accelerate your career!

Machine Learning (MLlib) in PySpark

What is MLlib?

PySpark has a Machine Learning API, MLlib, which supports various kinds of algorithms. Some of these algorithms are listed below:

Algorithms in PySpark MLlib

  • mllib.classification: The spark.mllib package offers support for various methods to perform binary classification, regression analysis, and multiclass classification. Some of the most used algorithms in classifications are Naive Bayes, decision trees, etc.
  • mllib.clustering: In clustering, you can perform the grouping of subsets of entities on the basis of some similarities in the elements or entities.
  • mllib.linalg: This algorithm offers MLlib utilities to support linear algebra.
  • mllib.recommendation: This algorithm is used for recommender systems to fill in the missing entries in any dataset.
  • spark.mllib: This supports collaborative filtering, where Spark uses ALS (Alternating Least Squares) to predict the missing entries in the sets of descriptions of users and products.

Use Cases of ‘Spark with Python’ in Industries

Apache Spark is one of the most used tools in various industries. Its use is not limited to just the IT industry, though it is maximum in IT. Even the big dogs of the IT industry are using Apache Spark for dealing with Big Data, e.g., Oracle, Yahoo, Cisco, Netflix, etc.

Use cases of ‘Spark with Python’ in Industries

Use cases of spark in other industries

  • Finance: PySpark is used in this sector as it helps gain insights from call recordings, emails, and social media profiles.
  • E-commerce: Apache Spark with Python can be used in this sector for gaining insights into real-time transactions. It can also be used to enhance recommendations to users based on new trends.
  • Healthcare: Apache Spark is being used to analyze patients’ medical records, along with the past medical history, and then make predictions on the most likely health issues those patients might face in the future.
  • Media: An example of this is Yahoo. Spark is being used at Yahoo to design its news pages for the targeted audience using Machine Learning features provided by Spark.

You have almost come to the end of this tutorial on ‘What is PySpark?’ Now, just check out the recommended audience at whom this tutorial is targeted.

Recommended Audience

The target audience of this tutorial includes:

  • Professionals who are experienced in Python and want to learn how to use it for Big Data
  • Professionals who are interested in making a career in Big Data
  • Big Data Analysts, Data Scientists, and Data Engineers

Prerequisites

  • Learning prerequisites:
    • Sound programming knowledge and experience in any programming language, preferably in Python
    • Basic knowledge of Apache Spark, Hadoop, and Scala
    • Being familiar with what big data is will be helpful to understand this tutorial better
  • Software prerequisites:
    • Java and Scala installed
    • Python installed
    • Apache Spark

Conclusion

Apache Spark has so many use cases in various sectors that it was only a matter of time till Apache Spark community came up with an API to support one of the most popular, high-level and general-purpose programming languages, Python. Not only is Python easy to learn and use, with its English-like syntax, it already has a huge community of users and supporters. So, being able to implement all the key features of Python in the Spark framework, while also using the building blocks and operations of Spark in Python with Apache Spark’s Python API, is truly a gift from Apache Spark community. And, this is all about PySpark.

For more in-depth knowledge about Apache Spark with Python, check out Spark, Scala, and Python Certification training by Intellipaat. Enroll today and kick-start your career by learning one of the most famous frameworks to deal with Big Data.

Recommended Videos

1 thought on “What is PySpark? Apache Spark with Python”

  1. I am having problem in installation of spark on windows 10.Since i am getting ‘ipython’ is not recognized as an internal or external command,
    operable program or batch file in command prompt after doing all the steps which you have followed in the video.
    so kindly help me in this
    Himanshu verma

Leave a Reply

Your email address will not be published. Required fields are marked *