PySpark SQL User Handbook
Are you a programmer looking for a powerful tool to work on Spark? If yes, then you must take PySpark SQL into consideration. This PySpark SQL cheat sheet is designed for those who have already started learning about and using Spark and PySpark SQL. If you are one among them, then this sheet will be a handy reference for you.
However, don’t worry if you are a beginner and have no idea about how PySpark SQL works. This cheat sheet will give you a quick reference to all keywords, variables, syntax, and all the basics that you must know.
Download the printable PDF of this cheat sheet
Initializing SparkSession
Use the below command lines to initialize the SparkSession:
>> from pyspark.sql import SparkSession
>>> spark = SparkSession\.builder\.appName("PySpark SQL\.config("spark.some.config.option", "some-value") \.getOrCreate()
Creating DataFrames
For creating DataFrames, and inferring and specifying schemas, you need to follow these code guidelines:
#import pyspark class Row from module sql
>>>from pyspark.sql import *
col1 |
col2 |
row1 |
3 |
row2 |
4 |
row3 |
5 |
Inferring Schema
>>> sc = spark.sparkContext
>>> A = sc.textFile("Filename.txt")
>>> B = lines.map(lambda x: x.split(","))
>>> C = parts.map(lambda a: Row(col1=a[0],col2=int(a[1])))
>>> C_df = spark.createDataFrame(C)
Watch this PySpark Course for Beginners:
Specifying Schema
>>> C = parts.map(lambda a: Row(col1=a[0], col2=int(a[1].strip())))
>>> schemaString = "MyTable"
>>> D = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
>>> E = StructType(D)
>>> spark.createDataFrame(C, E).show()
From Spark Data Sources
JSON
>>>df = spark.read.json("table.json)
>>>df.show()
>>> df2 = spark.read.load("tablee2.json", format="json")
Parquet Files
>>> df3 = spark.read.load("newFile.parquet")
Inspecting Data
You can inspect and perform operations on the entered data with the following command sets:
>>> df.dtypes -- Returns df column names and data types
>>> df.show() -- Displays the content of df
>>> df.head() -- Returns first n rows
>>> df.first(n) -- Returns the first n rows
>>> df.schema -- Returns the schema of df
>>> df.describe().show() -- Computes the summary statistics
>>> df.columns -- Returns the columns of df
>>> df.count() -- Counts the number of rows in df
>>> df.distinct().count() -- Counts the number of distinct rows in df
>>> df.printSchema() -- Prints the schema of df
>>> df.explain() -- Prints the (logical and physical) plans
Get 100% Hike!
Master Most in Demand Skills Now!
Column Operations
These are the basic command sets that you need for performing operations on columns.
Add:
>>> df = df.withColumn('col1',df.table.col1) \ .withColumn('col2',df.table.col2) \ .withColumn('col3',df.table.col3) \ .withColumn('col4',df.table.col4) \.withColumn(col5', explode(df.table.col5))
Update:
>>> df = df.withColumnRenamed('col1', 'column1')
Remove:
>>> df = df.drop("col3", "col4")
>>> df = df.drop(df.col3).drop(df.col4)
Actions
GroupBy:
>>> df.groupBy("col1") .count() .show()
Filter:
>>> df.filter(df["col2"]>4).show()
Sort:
>>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("col1", ascending=False).collect()
>>> df.orderBy(["col1","col3"],ascending=[0,1])\ .collect()
Replacing Missing Values:
>>> df.na.fill(20).show()
>>> df.na.drop().show()
>>> df.na \ .replace(10, 20) \ .show()
Repartitioning:
>>> df.repartition(10)\ df with 10 partitions .rdd \.getNumPartitions()
>>> df.coalesce(1).rdd.getNumPartitions()
SQL Queries
>>> from pyspark.sql import functions as f
Select:
>>> df.select("col1").show()
>>> df.select("col2","col3") \ .show()
When:
>>> df.select("col1", f.when(df.col2> 30, 1) \ .otherwise(0)) \ .show()
>>> df[df.col1.isin("A","B")] .collect()
Running SQL Queries Programmatically
Registering DataFrames as Views:
>>> peopledf.createGlobalTempView("column1")
>>> df.createTempView("column1")
>>> df.createOrReplaceTempView("column2")
Querying Views
>>> df_one = spark.sql("SELECT * FROM customer").show()
>>> df_new = spark.sql("SELECT * FROM global_temp.people") .show()
Output Operations
Data Structures:
rdd_1 = df.rdd
df.toJSON().first()
df.toPandas()
Writing and Saving to Files:
df.select("Col1", "Col2")\ .write \ .save("newFile.parquet")
df.select("col3", "col5") \ .write \ .save("table_new.json",format="json")
Stopping SparkSession:
spark.stop()
Download a Printable PDF of this Cheat Sheet
This PySpark SQL cheat sheet has included almost all important concepts. In case you are looking to learn PySpark SQL in-depth, you should check out the Apache Spark and Scala training certification provided by Intellipaat. In this course, you will work on real-life projects and assignments and thus will prepare yourself for being a certified PySpark SQL professional. On top of that, you will have 24/7 technical support from our experts while you do the course at Intellipaat.