PySpark SQL User Handbook

Are you a programmer looking for a powerful tool to work on Spark? If yes, then you must take PySpark SQL into consideration. This PySpark SQL cheat sheet is designed for those who have already started learning about and using Spark and PySpark SQL. If you are one among them, then this sheet will be a handy reference for you.

However, don’t worry if you are a beginner and have no idea about how PySpark SQL works. This cheat sheet will give you a quick reference to all keywords, variables, syntax, and all the basics that you must know.

Download the printable PDF of this cheat sheet
PySpark SQL cheat sheet

Learn Apache Spark from Intellipaat’s Cloudera Spark Training and be an Apache Spark Specialist!

 

Initializing SparkSession

Use the below command lines to initialize the SparkSession:

>> from pyspark.sql import SparkSession
>>> spark = SparkSession\.builder\.appName("PySpark SQL\.config("spark.some.config.option", "some-value") \.getOrCreate()

 

Creating DataFrames

For creating DataFrames, and inferring and specifying schemas, you need to follow these code guidelines:

#import pyspark class Row from module sql
>>>from pyspark.sql import *
col1 col2
row1 3
row2 4
row3 5

 

Inferring Schema

>>> sc = spark.sparkContext
>>> A = sc.textFile("Filename.txt")
>>> B = lines.map(lambda x: x.split(","))
>>> C = parts.map(lambda a: Row(col1=a[0],col2=int(a[1])))
>>> C_df = spark.createDataFrame(C)

Watch this PySpark Course for Beginners:

Specifying Schema

>>> C = parts.map(lambda a: Row(col1=a[0], col2=int(a[1].strip())))
>>> schemaString = "MyTable"
>>> D = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
>>> E = StructType(D)
>>> spark.createDataFrame(C, E).show()

Want to grasp detailed knowledge of Hadoop? Read this extensive Spark Tutorial!

 

From Spark Data Sources

JSON

>>>df = spark.read.json("table.json)
>>>df.show()
>>> df2 = spark.read.load("tablee2.json", format="json")

 

Parquet Files

>>> df3 = spark.read.load("newFile.parquet")

If you have queries related to Spark and Hadoop, kindly refer to our Big Data Hadoop and Spark Community!

 

Inspecting Data

You can inspect and perform operations on the entered data with the following command sets:

>>> df.dtypes               -- Returns df column names and data types
>>> df.show()                    -- Displays the content of df
>>> df.head()                    -- Returns first n rows
>>> df.first(n)                  -- Returns the first n rows
>>> df.schema                    -- Returns the schema of df
>>> df.describe().show()         -- Computes the summary statistics
>>> df.columns                   -- Returns the columns of df
>>> df.count()                   -- Counts the number of rows in df 
>>> df.distinct().count()        -- Counts the number of distinct rows in df
>>> df.printSchema()             -- Prints the schema of df
>>> df.explain()                 -- Prints the (logical and physical) plans

Get 100% Hike!

Master Most in Demand Skills Now !

 

Column Operations

These are the basic command sets that you need for performing operations on columns.

Add:

>>> df = df.withColumn('col1',df.table.col1) \ .withColumn('col2',df.table.col2) \ .withColumn('col3',df.table.col3) \ .withColumn('col4',df.table.col4) \.withColumn(col5', explode(df.table.col5))

Update:

>>> df = df.withColumnRenamed('col1', 'column1')

Remove:

>>> df = df.drop("col3", "col4")
>>> df = df.drop(df.col3).drop(df.col4)

Actions

GroupBy:

>>> df.groupBy("col1") .count()  .show()

Filter:

>>> df.filter(df["col2"]>4).show()

Sort:

>>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("col1", ascending=False).collect()
>>> df.orderBy(["col1","col3"],ascending=[0,1])\ .collect()

Replacing Missing Values:

>>> df.na.fill(20).show()
>>> df.na.drop().show()
>>> df.na \ .replace(10, 20) \ .show()

Repartitioning:

>>> df.repartition(10)\ df with 10 partitions .rdd \.getNumPartitions()
>>> df.coalesce(1).rdd.getNumPartitions()

SQL Queries

>>> from pyspark.sql import functions as f

Select:

>>> df.select("col1").show()
>>> df.select("col2","col3") \ .show()

When:

>>> df.select("col1", f.when(df.col2> 30, 1) \ .otherwise(0)) \ .show()
>>> df[df.col1.isin("A","B")] .collect()

Running SQL Queries Programmatically

Registering DataFrames as Views:

>>> peopledf.createGlobalTempView("column1")
>>> df.createTempView("column1")
>>> df.createOrReplaceTempView("column2")

Querying Views

>>> df_one = spark.sql("SELECT * FROM customer").show()
>>> df_new = spark.sql("SELECT * FROM global_temp.people") .show()

Prepare yourself by going through the Top Spark Interview Questions and Answers now!

Certification in Bigdata Analytics

 

Output Operations

Data Structures:

rdd_1 = df.rdd
df.toJSON().first()
df.toPandas()

Writing and Saving to Files:

df.select("Col1", "Col2")\ .write \ .save("newFile.parquet")
df.select("col3", "col5") \ .write \ .save("table_new.json",format="json")

Stopping SparkSession:

spark.stop()

 

Download a Printable PDF of this Cheat Sheet

This PySpark SQL cheat sheet has included almost all important concepts. In case you are looking to learn PySpark SQL in-depth, you should check out the Apache Spark and Scala training certification provided by Intellipaat. In this course, you will work on real-life projects and assignments and thus will prepare yourself for being a certified PySpark SQL professional. On top of that, you will have 24/7 technical support from our experts while you do the course at Intellipaat.

Intellipaat provides the most comprehensive Pyspark Training course to fast-track your career!
 

Course Schedule

Name Date Details
Big Data Course 30 Mar 2024(Sat-Sun) Weekend Batch
View Details
Big Data Course 06 Apr 2024(Sat-Sun) Weekend Batch
View Details
Big Data Course 13 Apr 2024(Sat-Sun) Weekend Batch
View Details