Pyspark – Apache Spark with Python

Being able to analyse huge data sets is one of the most valuable technological skills these days and this tutorial will bring you up to speed on one of the most used technologies, Apache Spark, combined with one of the most popular programming languages, Python, to do just that. In this what is PySpark tutorial,  we will also find the answer to some of the most frequently asked questions about Spark with Python, for example

  • Which programming language is more beneficial over the other when used with Spark
  • How to Integrate Python with Spark
  • What are the basic operations and building blocks of Spark that we can use in Python using PySpark
  • Examples of these operations

Watch this Pyspark Video for Beginners – What is Pyspark?


Following is the list of the topics that we are going to cover in this tutorial, in case you want to jump right into a specific section:

Study about Apache Spark from Cloudera Spark Training and be master as an Apache Spark Specialist.
In this Pyspark tutorial, we will use the dataset of Fortune 500 and implement the code examples on it.This data set consists of information related to the top 5 companies according to the Fortune 500 in year 2017. It includes attributes like Rank, Title, Website, Employees, Sector. The dataset looks like as follows

RankTitleWebsiteEmployeesSector
1Walmart       http://www.walmart.com2300000Retailing
2Berkshirehttp://www.berkshirehathaway.com367700Financials
3Apple       http://www.apple.com116000Technology
4Exxon Mobilhttp://www.exxonmobil.com72700Energy
5McKesson       http://www.mckesson.com68000Wholesalers


Let’s start off by understanding what Apache Spark is.

Overview of Apache Spark

Apache Spark, as you might have heard of it, is a general engine for big data analysis, processing and computations. It provides several advantages over MapReduce, it is faster, easier to use, offers simplicity and runs virtually everywhere. It has built in tools for SQL, machine learning, streaming which makes it a very popular and one of the most asked tools in IT industry. Spark is written in Scala programming language. Apache Spark has API’s for Python, Scala, Java and R, though the most used languages with Spark are the former two. In this tutorial we will learn how to use python API with Apache Spark.
In this tutorial we will learn how to use python API with Apache Spark.

Check out this insightful video on Spark Tutorial For Beginners


This video will help you understand Spark better along with its various components, versions and finally the frameworks in Spark.

Want to grasp a detailed knowledge on Hadoop? Read this great Spark Tutorial!

What is PySpark?

PySpark is a python API for spark released by Apache Spark community to support python with Spark. Using PySpark, one can easily integrate and work with RDD in python programming language too. There are numerous features that make PySpark such an amazing framework when it comes to working with huge datasets. Whether it is to perform computations on large data sets or to just analyze them, Data engineers are turning to this tool. Following are some of the said features

Key features of PySpark

  • Real time computations: Because of the in-memory processing in PySpark framework, it shows low latency
  • Polyglot: PySpark framework is compatible with various languages like Scala, Java, Python and R, which makes it one of the most preferable frameworks for processing huge datasets
  • Caching and disk persistence: PySpark framework provides powerful caching and very good disk persistence
  • Fast processing: PySpark framework is way faster than other traditional frameworks for big data processing
  • Works well with RDD: Python programming language is dynamically typed which helps when working with RDD.

We will learn a lot more about RDD and python together further in this tutorial

Why PySpark?

Need of PySpark

The more solutions to deal with big data, the better. But then, if we have to switch tools to perform different types of operations on big data then having a lot of tools to perform a lot of different tasks does not sound very appealing anymore, does it?
It just sounds like a lot of hassle one has to go through to deal with huge datasets. Then came some scalable and flexible tools to crack big data and gain benefits from it. One of those amazing tools that helps handling big data is Apache Spark. Now it’s no secret that Python is one of the most widely used programming language among data scientists, data analytics and many more IT experts. Be it because of its simple and interactive interface or because it’s easy to learn or because it’s a general-purpose language that is a secondary thing, what matters is that it is trusted by data scientist folks to perform data analysis, machine learning and many more tasks on big data using Python. So, it’s pretty obvious that combining Spark and Python would rock the world of big data, isn’t it?
That is exactly what the Apache Spark community did when they came up with a tool called PySpark, which is basically a Python API for Apache Spark.
If you have any problem related to Spark and Hadoop, kindly refer our Big Data Hadoop & Spark Community.

Spark with Python vs Spark with Scala

As we have already discussed that python is not the only programming language that can be used with Apache Spark. Being one of the most popular frameworks when it comes to big data analysis, it has gained so much popularity that we wouldn’t be shocked if it became the de-facto framework for evaluating and dealing with large datasets and machine learning in coming years.
The data science folks already prefer Spark because of the several benefits it has over other big data tools but choosing which language to use with Spark is a dilemma that they face whenever they choose to use this framework.
The most used programming languages with Spark are Python and Scala. Now if you are going to learn PySpark, that is, Spark with Python then it’s important that you know why and when to use Spark with Python instead of Spark with Scala. In this section we will go over the basic criteria, one should keep in mind while making the choice between python and Scala when they want to work on Apache Spark.
Now, let’s discuss the comparison between python and scala in detail in terms of the given criterion:

CriteriaPython with SparkScala with Spark

Performance Speed

Performance Speed

Python is comparatively slower than Scala when used with Spark, but programmers can do much more with python than Scala because of the easy interface that it providesSpark is written in Scala, so it integrates well with Scala. Its faster than python
Learning Curve
Learning Curve
Python is known for its easy syntax and being a high-level language makes it easier to learn. Python is also highly productive even with it’s simple syntaxScala has an arcane syntax which makes it hard to learn but once you get a hold of it you will see that it has its own benefits
Data science Libraries
Data science Libraries
In Python API, you don’t have to worry about the visualisations or Data science libraries. You can easily port the core parts of R to Python as wellScala lacks proper Data science libraries and tools, Scala does not have proper local tools and visualisations
Readability of Code
Readability of Code
Readability, maintenance and familiarity of code is better in Python APIIn Scala API, it’s easy to make internal changes since Spark is written in Scala.
Complexity
Complexity
Python API has an easy, simple and comprehensive interfaceScala’s syntax and the fact that it produces verbose output is why it is considered complex language
Machine learning libraries
Machine learning libraries
Python language is preferred for implementing machine learning algorithmsScala is preferred when you have to implement data engineer technologies rather than machine learning

After making the choice between Python and Scala, when you want to use one of them with Apache spark, the next step is the installations. Let’s start with the installations and configuration of PySpark.

Watch this Apache Spark for beginners video by intellipaat

Installation and configuration

Before installing Apache, you need to make sure that you have Java and Scala already installed in your system. If you don’t have Java and Scala installed in your system already then don’t worry, we won’t skip any part. We will walk you through the whole installation right from the basics.
To install Java and Scala in your system all you have to do is go through the ‘Installation of Java’ and ‘Installation of Scala’ tutorials by IntelliPaat. These tutorials will provide you a step by step guide to install and get started with Java and Scala.
Prepare yourself for the Top Hadoop Interview Questions And Answers Now!

Setting up PySpark Environment

Installation in linux:

Step 1:
Download the latest version of Apache Spark from the official Apache Spark website. Locate the file in downloads folder of your system.
Step 2:
Using the following command, extract the Spark tar file
Step 3:
After extracting files from Spark folder, use the following commands to move it to your opted folder since by default it will be in your download folder

/usr/local/spark
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv sp
ark-2.4.0-bin-hadoop2.7 /usr/local/spark
# exit

Step 4:
Setting the path for PySpark

export PATH = $PATH:/usr/local/spark/bin

Step 5:
Setting up the environment for PySpark, use the following command

$ source ~/.bashrc

Step 6:
Verify the Spark installation using the following command

$ spark-shell

You will get the following output if the installation was successful

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service ‘HTTP class server’ on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
>>>

Step 7:
Invoking PySpark shell in by running the following command in the Spark directory

# ./bin/pyspark

Installation in windows:

Step1: Download the latest version of Spark from the official Spark website.
Installation in windows
Step 2: Extract the downloaded file into a new directory
downloaded file into a new directory
Step3: Set the variables as follows:
User Variables:

  • Variable: SPARK_HOME
  • Value: C:\Program Files (x86)\spark-2.4.0-bin-hadoop2.7\bin

System variables:

  • Variable: PATH
  • Value: C:\Windows\System32;C:\Program Files (x86)\spark-2.4.0-bin-hadoop2.7\bin

Step 4: Download the windows utilities by clicking here and move it in C:\Program Files (x86)\spark-2.4.0-bin-hadoop2.7\bin
Download the windows
When you click on the link provided above to download the windows utilities, it should take you to a Github page as shown in the above screenshot.
Step5: Now you can start the spark shell by typing in the following command in the cmd

Spark-shell

Step6: To start Pyspark shell, type in the following command:

pyspark

This is how your PySpark Shell should look like after completing the above steps
block where I have used some attributes
Now that we have got our PySpark shell up and running, we will learn how to use PySpark shell and perform various operations on files and applications in PySpark. Before we start with that, there are some configuration settings that need to be taken care of. Moving forward in the tutorial let’s understand how to do that.

SparkConf

What is SparkConf

Before running any Spark application on a local Cluster or a dataset, we need to set some configurations and parameters. This is done with the help of what we call SparkConf, as the name suggests, it offers configurations for any Spark application.

Features of SparConf and their uses

We have listed some of the most commonly used features or attributes of SparkConf while working with PySpark

  • set(key, value):

This attribute is used to set a configuration property

  • setMaster(value):

This attribute is used to set the master URL.

  • setAppName(value):

This attribute is used to set an application name.

  • get(key, defaultValue=None) :

This attribute is used to get a configuration value of a key

  • setSparkHome(value):

This attribute is used to set Spark installation path

Following is the code block where I have used some attributes of SparkConf:

>>> from pyspark.conf import SparkConf
>>> from pyspark.context import SparkContext
>>>conf = SparkConf().setAppName(“PySpark App”).setMaster(“local[2]”)
>>> conf.get(“spark.master”)
>>> conf.get(“spark.app.name”)

pyspark
Note: The very first thing any spark program does is creating a SparkContext object which tells the application how to access a cluster. For that to happen, you first need to implement SparkConf so that the SparkContext object has the configuration information about the application.
We have already seen how to use SparkConf to set the configurations, now let’s understand what exactly is SparkContext in detail.

SparkContext

What is PySpark SparkContext

SparkContext is the entry gate for any spark derived application or functionality. It is the first and foremost thing that gets initiated when we run any Spark application. In PySpark, SparkContext is available as sc by default, so creating a new SparkContext will give an error.

Parameters

SparkContext has some parameters that we have listed down below:

  • Master

The URL of the cluster it connects to.

  • appName

The name of your job.

  • SparkHome

SparkHome is a Spark installation directory.

  • pyFiles

.zip or .py files are to send to the cluster and then added to PYTHONPATH.

  • Environment

Worker nodes environment variables.

  • BatchSize

The number of Python objects represented is what we call BatchSize, however, to disable batching, Set the value to 1, and to automatically choose the batch size based on object sizes set it to 0, and to use an unlimited batch size, set it to -1.

  • Serializer

This parameter tells about Serializer, an RDD serializer.

  • Conf

Moreover, to set all the Spark properties, an object of L{SparkConf} is there.

  • profiler_cls

A class of custom profiler is used to do the profiling, although make sure the pyspark.profiler.BasicProfiler is the default one.
The most widely used parameters among the above ones are Master and AppName. The initial code lines for any PySpark application using the above parameters are as follows:

from pyspark import SparkContext
sc = SparkContext(“local”, “First App”)

After getting done with the configuration settings and initiating a SparkContext object, which gets done by the Spark by default, let’s come to the files in the application that we want to run on PySpark and understand how we can use a feature called SparkFiles provided by Spark to upload the said files.

SparkFiles and class methods

What is SparkFiles?

SparkFiles is what we use when we want to upload our files in Apache Spark using SparkContext.addfile().
Note: I have created a file named path and uploaded my dataset using os.path.join(“path”,”filename”) in that file, and I have used this file to perform classmethods in SparkFiles.

Classmethods and how to use them

SparkFiles contains following two types of classmethods

  • get(Filename)

This classmethod is used when we want to specify the path of the file that we added using SparkContext.addfile() or sc.addFile()
Input:

>>> from pyspark import SparkFiles
>>> from pyspark import SparkContext
>>> path = os.path.join(“/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7”, “Fortune5002017.csv”)
>>> sc.addFile(path)
>>> SparkFiles.get(path)
Output:
Output
  • getRootDirectory()

It is used to specify the path to the root directory where the file that we added using SparkContext.addFile() or sc.addFile(), exists.
Input:

>>> from pyspark import SparkFiles
>>> from pyspark import SparkContext
>>> path = os.path.join(“/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7”, “Fortune5002017.csv”)
>>> sc.addFile(path)
>>>SparkFiles.getRootDirectory()

Output:
output2
Now that we are acquainted with the SparkFiles and have understood the basics of what we can do with SparkFiles when it comes to files in Spark it seems natural to next discuss about the datasets in Spark, doesn’t it?
Let’s move forward in the tutorial to do just that.

What is RDD

Introduction to RDD and features of RDD

When we talk about Spark, be it with any programming language, the first thing that strikes in our mind is RDD.
RDD is one of the key features of Spark. It stands for Resilient Distributed Database. It’s a set of elements that are divided across multiple nodes in a cluster to run parallel processing. It can also automatically recover from failures. We can create RDD but we can’t make changes in that RDD, we can create new RDD from the existing one with the required changes or we can perform different kind of operations on an RDD.
Following are some features of RDD
Immutability: An RDD, once created can’t be changed or modified, however you can create a new RDD from the existing one, if you wish to make any changes.
Distributed: The Data as an RDD can exist on a cluster and be operated on in parallel for parallel processing
Partitioned: With more partitions the work gets distributed among different cluster but it also creates overhead in scheduling

Operations of RDD

Note: To implement the operations of RDD, I have created an RDD using
RDDName = sc.textFile(“ path of the file to be uploaded”). The file that I have used is the dataset of Fortune 500 top 5 companies in year 2017.
There are certain operations in Spark that can be performed on RDD. Operations are basically methods, which are applied on RDD to perform certain tasks. RDD supports two types of operations, namely, Action and Transformation. Let’s understand them individually with examples.

What are Action Operations?

In Transformation operation, we create RDDs from each other, but Action operations are applied directly on the datasets to perform certain computations on the datasets. Following are the examples of some Action operations

  • take(n)

This is one of the most used operations in RDD. It takes a number as an argument and displays that same number of elements from the specified RDD.
You can refer to the following screenshot to see how to use this operation.
Input:

>>> from pyspark import SparkContext
>>> rdd = sc.textFile(“C:/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7/Fortune5002017.csv”)
>>>rdd.take(5)

output:
output3
In the above screenshot, after running the take(n) command, all the rows of my dataset are displayed, and every row is considered as one element.

  • count()

This operation, as the name suggests, returns the number of elements in an RDD, as shown in the following screenshot
Input:

>>> from pyspark import SparkContext
>>> rdd = sc.textFile(“C:/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7/Fortune5002017.csv”)
>>>rdd.take(5)
>>> rdd. count()
Output:
output4

  • top(n)

This operation also takes a number as an argument and then displays the top n elements.
Input:

>>> from pyspark import SparkContext
>>> rdd = sc.textFile(“C:/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7/Fortune5002017.csv”)
>>> rdd.top(2)
Output:
output5

What are Transformation Operations?

Transformation is the set of operations that are used to create new RDD either by applying the operation on an existing RDD or by making entirely new RDD.
Following are the examples of some Transformation operations:

  • A map Transformation

We use this operation when we need to transform each element of an RDD by applying the function to the entire elements.
For example, if I have to upper the case of all the words in my dataset then I can use map Transformation. Let’s see how.

Input:

>>> def Func(lines):
. . . lines = lines.upper()
. . . lines = lines.split()
. . . return lines
>>> rdd1 = rdd.map(Func)
>>> rdd1.take(5)
Output:
output6
As you can see in the above screenshot, all the words in original RDD have been upper cased with the help of map transformation.

  • Filter Transformation

This transformation operation can be used when you want to remove some elements from your dataset. These elements are called stop_words. We define our own set of stop_words.
For example, I will remove some elements from my database. You can refer to the screenshot to see how.

Input:

>>> from pyspark import SparkContext
>>> rdd = sc.textFile(“C:/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7/Fortune5002017.csv”)
>>> rdd.top(6)
>>> stop_words = [‘Rank, Title, Website, Employees, Sector’, ‘1, Walmart, http://www.walmart.com, 2300000, Retailing’]
>>> rdd1 = rdd.filter(lambda x: x not in stop_words)
>>> rdd1.take(4)

Output:
output7
After learning about RDD and understanding the operations that you can perform on datasets using various RDD operations the next question that comes to our mind is what else can we do using the datasets in Spark?
As we discussed above that Spark is a great tool for real time data processing and computation, but it’s not just that for which Spark is widely known for. Spark is popular for machine learning as well. Analyzing the provided datasets and predicting end results using machine learning algorithms is also something that you can do on Spark framework. Let’s further learn about the machine learning in Spark with Python, that is PySpark.
Intellipaat provides the most comprehensive Cloudera Spark course to accelerate your career!

Machine Learning (MLlib) in PySpark

What is MLlib

PySpark has a Machine learning API, MLlib that supports various kinds of algorithms. Some of those algorithms are listed below:

Algorithms in PySpark MLlib

  • mllib.classification:

The Spark.mllib package offers support for various methods to perform binary classification, regression analysis and multiclass classification. Some of the most used algorithms in Classifications are Naive Bayes, Decision Tree, etc.

  • mllib.clustering:

In clustering we perform grouping of subsets of entities on the basisof some similarities in the elements or entities

  • mllib.linalg:

This algorithms offers MLlib utilities to support linear algebra

  • mllib.recommendation:

This algorithm is used for recommender systems for filling in the missing entries in any dataset

  • spark.mllib:

spark.mllib supports collaborative filtering, where Spark uses ALS(Alternating Least Squares) to predict the sets of description of users and products in order to predict missing entries

Use cases of ‘Spark with Python’ in Industries

Apache spark is one of the most used tools in various industries. Its use is not limited to just IT industry, though it’s maximum in IT. Even the big dogs of IT industry are using Apache Spark for dealing with Big Data. For example Oracle, Yahoo, Cisco, Netflix.
Use cases of ‘Spark with Python’ in Industries

Use cases of spark in other industries

  • Finance:

PySpark is being used in this sector as it helps gaining insights from call recordings, emails, and social media profiles.

  • e-commerce:

Apache Spark with Python can be used in this sector for gaining insights about the real time transaction. It can also be used to enhance recommendations to the users based on the new trends.

  • Healthcare:

Apache Spark is being used to analyze patient’s medical records with past medical history and then make predictions for the most likely health issues that the patient may face in future.

  • Media industry:

Example of this is yahoo. Spark is being used at Yahoo to design their NEWS WebPages for targeted audience using machine learning features provided by Spark.

Recommended audience

The target audience of this tutorial is as follows

  • Professionals who are experienced in Python and want to learn how to use it for Big data
  • Professionals who are interested in making a career in big data
  • Big data analytics, Data scientists, Data engineers

Prerequisite

  • Learning Prerequisites:

    • Sound programming knowledge and experience in any programming language, preferably python.
    • Basic knowledge about Apache spark, Hadoop and Scala.
    • being familiar with what big data is, will also be helpful to understand this tutorial better.
  • Software prerequisites:

    • Java and Scala installed
    • Python installed
    • Apache Spark

Conclusion

Apache Spark has so many use cases in various sectors that it was only a matter of time till Apache Spark community came up with an API to support Spark with one of the most popular, high level and general-purpose programming languages, Python. Not only is python easy to learn and use, with its English-like syntax, it already has a huge community of users and supporters. So, being able to get the benefit of all the key features of python in Spark framework while also getting to use the building blocks and operations of Spark in Python, using Apache Spark’s Python API is truly a gift from Apache Spark community.
For more in-depth knowledge about Apache Spark with Python, checkout Spark, Scala, Python Certification training by Intellipaat. Enroll today and kick start your career by learning one of the most famous framework to deal with Big Data
This would be all for this tutorial and we hope you got something to learn.

Recommended Videos

1 thought on “What is Pyspark? – Apache Spark with Python”

  1. I am having problem in installation of spark on windows 10.Since i am getting ‘ipython’ is not recognized as an internal or external command,
    operable program or batch file in command prompt after doing all the steps which you have followed in the video.
    so kindly help me in this
    Himanshu verma

Leave a Reply

Your email address will not be published. Required fields are marked *