PySpark SQL User Handbook

Are you a programmer looking for a powerful tool to work on Spark? If yes, then you must take PySpark SQL into consideration. This PySpark SQL cheat sheet is designed for those who have already started learning about and using Spark and PySpark SQL. If you are one among them, then this sheet will be a handy reference for you.

However, don’t worry if you are a beginner and have no idea about how PySpark SQL works. This cheat sheet will give you a quick reference to all keywords, variables, syntax, and all the basics that you must know.

Download the printable PDF of this cheat sheet
PySpark SQL cheat sheet

Learn Apache Spark from Intellipaat’s Cloudera Spark Training and be an Apache Spark Specialist!

Watch this PySpark Video for Beginners:

PySpark SQL Cheat Sheet

Learn for free ! Subscribe to our youtube Channel.

Initializing SparkSession

Use the below command lines to initialize the SparkSession:

>> from pyspark.sql import SparkSession
>>> spark = SparkSession\.builder\.appName("PySpark SQL\.config("spark.some.config.option", "some-value") \.getOrCreate()

Creating DataFrames

For creating DataFrames, and inferring and specifying schemas, you need to follow these code guidelines:

#import pyspark class Row from module sql
>>>from pyspark.sql import *
col1col2
row13
row24
row35

Inferring Schema

>>> sc = spark.sparkContext
>>> A = sc.textFile("Filename.txt")
>>> B = lines.map(lambda x: x.split(","))
>>> C = parts.map(lambda a: Row(col1=a[0],col2=int(a[1])))
>>> C_df = spark.createDataFrame(C)

Specifying Schema

>>> C = parts.map(lambda a: Row(col1=a[0], col2=int(a[1].strip())))
>>> schemaString = "MyTable"
>>> D = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
>>> E = StructType(D)
>>> spark.createDataFrame(C, E).show()

Want to grasp detailed knowledge of Hadoop? Read this extensive Spark Tutorial!

From Spark Data Sources

JSON

>>>df = spark.read.json("table.json)
>>>df.show()
>>> df2 = spark.read.load("tablee2.json", format="json")

Parquet Files

>>> df3 = spark.read.load("newFile.parquet")

If you have queries related to Spark and Hadoop, kindly refer to our Big Data Hadoop and Spark Community!

Inspecting Data

You can inspect and perform operations on the entered data with the following command sets:

>>> df.dtypes               -- Returns df column names and data types
>>> df.show()                    -- Displays the content of df
>>> df.head()                    -- Returns first n rows
>>> df.first(n)                  -- Returns the first n rows
>>> df.schema                    -- Returns the schema of df
>>> df.describe().show()         -- Computes the summary statistics
>>> df.columns                   -- Returns the columns of df
>>> df.count()                   -- Counts the number of rows in df 
>>> df.distinct().count()        -- Counts the number of distinct rows in df
>>> df.printSchema()             -- Prints the schema of df
>>> df.explain()                 -- Prints the (logical and physical) plans

Column Operations

These are the basic command sets that you need for performing operations on columns.

  • Add:
>>> df = df.withColumn('col1',df.table.col1) \ .withColumn('col2',df.table.col2) \ .withColumn('col3',df.table.col3) \ .withColumn('col4',df.table.col4) \.withColumn(col5', explode(df.table.col5))
  • Update:
>>> df = df.withColumnRenamed('col1', 'column1')
  • Remove:
>>> df = df.drop("col3", "col4")
>>> df = df.drop(df.col3).drop(df.col4)

Actions

  • GroupBy:
>>> df.groupBy("col1")\ .count() \ .show()
  • Filter:
>>> df.filter(df["col2"]>4).show()
  • Sort:
>>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("col1", ascending=False).collect()
>>> df.orderBy(["col1","col3"],ascending=[0,1])\ .collect()
  • Replacing Missing Values:
>>> df.na.fill(20).show()
>>> df.na.drop().show()
>>> df.na \ .replace(10, 20) \ .show()
  • Repartitioning:
>>> df.repartition(10)\ df with 10 partitions .rdd \.getNumPartitions()
>>> df.coalesce(1).rdd.getNumPartitions()

SQL Queries

>>> from pyspark.sql import functions as f
  • Select:
>>> df.select("col1").show()
>>> df.select("col2","col3") \ .show()
  • When:
>>> df.select("col1", f.when(df.col2> 30, 1) \ .otherwise(0)) \ .show()
>>> df[df.col1.isin("A","B")] .collect()

Running SQL Queries Programmatically

  • Registering DataFrames as Views:
>>> peopledf.createGlobalTempView("column1")
>>> df.createTempView("column1")
>>> df.createOrReplaceTempView("column2")
  • Querying Views
>>> df_one = spark.sql("SELECT * FROM customer").show()
>>> df_new = spark.sql("SELECT * FROM global_temp.people")\ .show()


Prepare yourself by going through the Top Hadoop Interview Questions and Answers now!

Output Operations

  • Data Structures:
>>> rdd_1 = df.rdd
>>> df.toJSON().first()
>>> df.toPandas()
  • Writing and Saving to Files:
>>> df.select("Col1", "Col2")\ .write \ .save("newFile.parquet")
>>> df.select("col3", "col5") \ .write \ .save("table_new.json",format="json")
  • Stopping SparkSession:
>> spark.stop()

Download a Printable PDF of this Cheat Sheet

This PySpark SQL cheat sheet has included almost all important concepts. In case you are looking to learn PySpark SQL in-depth, you should check out the Spark, Scala, and Python training certification provided by Intellipaat. In this course, you will work on real-life projects and assignments and thus will prepare yourself for being a certified PySpark SQL professional. On top of that, you will have 24/7 technical support from our experts while you do the course at Intellipaat.

Intellipaat provides the most comprehensive Cloudera Spark Course to fast-track your career!
 

Recommended Videos

Leave a Reply

Your email address will not be published. Required fields are marked *