• Articles
  • Tutorials
  • Interview Questions

PySpark SQL Cheat Sheet

PySpark SQL User Handbook

Are you a programmer looking for a powerful tool to work on Spark? If yes, then you must take PySpark SQL into consideration. This PySpark SQL cheat sheet is designed for those who have already started learning about and using Spark and PySpark SQL. If you are one among them, then this sheet will be a handy reference for you.

However, don’t worry if you are a beginner and have no idea about how PySpark SQL works. This cheat sheet will give you a quick reference to all keywords, variables, syntax, and all the basics that you must know.

Download the printable PDF of this cheat sheet
PySpark SQL cheat sheet

Initializing SparkSession

Use the below command lines to initialize the SparkSession:

>> from pyspark.sql import SparkSession
>>> spark = SparkSession\.builder\.appName("PySpark SQL\.config("spark.some.config.option", "some-value") \.getOrCreate()

Creating DataFrames

For creating DataFrames, and inferring and specifying schemas, you need to follow these code guidelines:

#import pyspark class Row from module sql
>>>from pyspark.sql import *
col1 col2
row1 3
row2 4
row3 5

Inferring Schema

>>> sc = spark.sparkContext
>>> A = sc.textFile("Filename.txt")
>>> B = lines.map(lambda x: x.split(","))
>>> C = parts.map(lambda a: Row(col1=a[0],col2=int(a[1])))
>>> C_df = spark.createDataFrame(C)

Watch this PySpark Course for Beginners:

Video Thumbnail

Specifying Schema

>>> C = parts.map(lambda a: Row(col1=a[0], col2=int(a[1].strip())))
>>> schemaString = "MyTable"
>>> D = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
>>> E = StructType(D)
>>> spark.createDataFrame(C, E).show()

From Spark Data Sources

JSON

>>>df = spark.read.json("table.json)
>>>df.show()
>>> df2 = spark.read.load("tablee2.json", format="json")

Parquet Files

>>> df3 = spark.read.load("newFile.parquet")

Inspecting Data

You can inspect and perform operations on the entered data with the following command sets:

>>> df.dtypes               -- Returns df column names and data types
>>> df.show()                    -- Displays the content of df
>>> df.head()                    -- Returns first n rows
>>> df.first(n)                  -- Returns the first n rows
>>> df.schema                    -- Returns the schema of df
>>> df.describe().show()         -- Computes the summary statistics
>>> df.columns                   -- Returns the columns of df
>>> df.count()                   -- Counts the number of rows in df 
>>> df.distinct().count()        -- Counts the number of distinct rows in df
>>> df.printSchema()             -- Prints the schema of df
>>> df.explain()                 -- Prints the (logical and physical) plans

Get 100% Hike!

Master Most in Demand Skills Now!

Column Operations

These are the basic command sets that you need for performing operations on columns.

Add:

>>> df = df.withColumn('col1',df.table.col1) \ .withColumn('col2',df.table.col2) \ .withColumn('col3',df.table.col3) \ .withColumn('col4',df.table.col4) \.withColumn(col5', explode(df.table.col5))

Update:

>>> df = df.withColumnRenamed('col1', 'column1')

Remove:

>>> df = df.drop("col3", "col4")
>>> df = df.drop(df.col3).drop(df.col4)

Actions

GroupBy:

>>> df.groupBy("col1") .count()  .show()

Filter:

>>> df.filter(df["col2"]>4).show()

Sort:

>>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("col1", ascending=False).collect()
>>> df.orderBy(["col1","col3"],ascending=[0,1])\ .collect()

Replacing Missing Values:

>>> df.na.fill(20).show()
>>> df.na.drop().show()
>>> df.na \ .replace(10, 20) \ .show()

Repartitioning:

>>> df.repartition(10)\ df with 10 partitions .rdd \.getNumPartitions()
>>> df.coalesce(1).rdd.getNumPartitions()

SQL Queries

>>> from pyspark.sql import functions as f

Select:

>>> df.select("col1").show()
>>> df.select("col2","col3") \ .show()

When:

>>> df.select("col1", f.when(df.col2> 30, 1) \ .otherwise(0)) \ .show()
>>> df[df.col1.isin("A","B")] .collect()

Running SQL Queries Programmatically

Registering DataFrames as Views:

>>> peopledf.createGlobalTempView("column1")
>>> df.createTempView("column1")
>>> df.createOrReplaceTempView("column2")

Querying Views

>>> df_one = spark.sql("SELECT * FROM customer").show()
>>> df_new = spark.sql("SELECT * FROM global_temp.people") .show()

Certification in Bigdata Analytics

Output Operations

Data Structures:

rdd_1 = df.rdd
df.toJSON().first()
df.toPandas()

Writing and Saving to Files:

df.select("Col1", "Col2")\ .write \ .save("newFile.parquet")
df.select("col3", "col5") \ .write \ .save("table_new.json",format="json")

Stopping SparkSession:

spark.stop()

Download a Printable PDF of this Cheat Sheet

This PySpark SQL cheat sheet has included almost all important concepts. In case you are looking to learn PySpark SQL in-depth, you should check out the Apache Spark and Scala training certification provided by Intellipaat. In this course, you will work on real-life projects and assignments and thus will prepare yourself for being a certified PySpark SQL professional. On top of that, you will have 24/7 technical support from our experts while you do the course at Intellipaat.

Course Schedule

Name Date Details
Big Data Course 23 Nov 2024(Sat-Sun) Weekend Batch View Details
30 Nov 2024(Sat-Sun) Weekend Batch
07 Dec 2024(Sat-Sun) Weekend Batch

About the Author

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.