Spark SQL: Functions Explained with Example

Spark SQL is a module of Apache Spark used for processing structured and semi-structured data. It allows querying data using SQL or Hive Query Language, making it easy for those familiar with RDBMS to adopt. Structured data has a defined schema with known fields, while semi-structured data combines schema and data without clear separation. Spark SQL makes it simple to locate tables and metadata and provides powerful capabilities for handling diverse data formats. It’s widely used for analytics and data transformation tasks in big data environments due to its flexibility and ease of use.

Table of contents:

Spark SQL Functions
Hive Limitations
Architecture of Spark SQL
Components of Spark SQL
Features of Spark SQL
Spark SQL Example
Spark SQL Join
Use Cases of Spark SQL
Advantages of Spark SQL
Disadvantages of Spark SQL
Conclusion

Spark SQL Functions

The following are the various functions offered by Spark :

String Functions
Date & Time Functions
Collection Functions
Math Functions
Aggregate Functions
Window Functions

Let us now cover each of the above-mentioned Spark functions in detail:

1. Spark SQL String Functions

String functions are used to perform operations on String values such as computing numeric values, calculations and formatting etc. The String functions are grouped as “ string_funcs” in spark SQL. The following given are some of the String functions in Spark:

concat_ws(sep: String, exprs: Column*): Column
encode(value: Column, charset: String): Column
length(e: Column): Column
instr(str: Column, substring: String): Column
initcap(e: Column): Column
decode(value: Column, charset: String): Column

2. Date and Time Functions

The Date and Time Functions in Spark help in performing operations like returning the current date as a date column, returning the number of days from beginning till end or converting a column into a ‘DateType’ with a specific date format. Some of the Date and Time functions used in Spark are as follows:

current_date () : Column
to_date(e: Column): Column
to_date(e: Column, fmt: String): Column
add_months(startDate: Column, numMonths: Int): Column
date_add(start: Column, days: Int): Column
date_sub(start: Column, days: Int): Column

3. Collection Functions

Collection Functions in Spark SQL are basically used to perform operations on groups or arrays. Some of the important Collection functions in Spark SQL are:

array_contains(column: Column, value: Any)
array_except(col1: Column, col2: Column)
array_join(column: Column, delimiter: String, nullReplacement: String)
array_join(column: Column, delimiter: String)
array_remove(column: Column, element: Any)
array_repeat(left: Column, right: Column)
arrays_zip(e: Column*)

4. Math Functions

Math Functions are used to perform calculations such as Trigonometry (Sin, Cos, Tan), Hyperbolic statements, etc. Following mentioned are some of the Math Functions used in Spark SQL:

sin ( e : Column ) : Column
sin(columnName: String): Column
cosh(e: Column): Column
cosh(columnName: String): Column

5. Aggregate Functions

Aggregate functions are used to perform aggregate operations on DataFrame columns. The working of aggregate functions is on the basis of the groups and rows. Following are some of the aggregate functions in Spark SQL:

approx_count_distinct(e: Column)
approx_count_distinct(e: Column, rsd: Double)
avg(e: Column)
collect_set(e: Column)
countDistinct(expr: Column, exprs: Column*)

6. Window Functions

The use of Window functions in Spark is to perform operations like calculating the rank and row number etc. on large sets of input rows. These Window functions are available by importing ‘org.apache.spark.sql.’ functions. Let us now have a look at some of the important Window functions available in Spark SQL :

row_number(): Column
rank(): Column
dense_rank(): Column
cume_dist(): Column
ntile(n: Int): Column

Hive Limitations

Apache Hive was originally designed to run on top of Apache Spark. But, it had considerable limitations:

1) For running the ad-hoc queries, Hive internally launches MapReduce jobs. In the processing of medium-sized datasets, MapReduce lags in performance.

2) If during the execution of a workflow, the processing suddenly fails, then Hive can’t resume from the point where it failed as the system returns back to normal.

3) If trash is enabled, it leads to an execution error when encrypted databases are dropped in a cascade.

Spark SQL was incepted to overcome these inefficiencies.

Architecture of Spark SQL

It consists of three main layers:

Language API: Spark is compatible with and even supported by the languages like Python, HiveQL, Scala, and Java.
SchemaRDD: RDD (resilient distributed dataset) is a special data structure with which the Spark core is designed. As Spark SQL works on schema, tables, and records, you can use SchemaRDD or data frame as a temporary table.
Data Sources: For Spark core, the data source is usually a text file, Avro file, etc. Data sources for Spark SQL are different like JSON documents, Parquet files, HIVE tables, and Cassandra databases.

Get 100% Hike!

Master Most in Demand Skills Now!

Components of Spark SQL

Spark SQL DataFrames: There were some shortcomings on part of RDDs which the Spark DataFrame overcame in version 1.3 of Spark. First of all, there was no provision to handle structured data and there was no optimization engine to work with it. On the basis of attributes, developers had to optimize each RDD. Spark DataFrame is a distributed collection of data ordered into named columns. You might be knowing what a table is in a relational database. Spark DataFrame is quite similar to that.
Spark SQL Datasets: In version 1.6 of Spark, the Spark dataset was the interface that was added. The catch with this interface is that it provides the benefits of RDDs along with the benefits of the optimized execution engine of Apache Spark SQL. To achieve conversion between JVM objects and tabular representation, the concept of the encoder is used. Using JVM objects, a dataset can be incepted, and functional transformations like map, filter, etc., have to be used to modify them. The dataset API is available both in Scala and Java, but it is not supported in Python.
Spark Catalyst Optimizer: Catalyst optimizer is the optimizer used in Spark SQL and all queries written by Spark SQL and DataFrame DSL is optimized by this tool. This optimizer is better than the RDD, and hence, the performance of the system is increased.

Features of Spark SQL

Let’s take a look at the aspects which make Spark SQL so popular in data processing.

Integrated: One can mix SQL queries with Spark programs easily. Structured data can be queried inside Spark programs using either Spark SQL or a DataFrame API. Running SQL queries, alongside analytic algorithms, is easy because of this tight integration.
Hive compatibility: Hive queries can be run as they are as Spark SQL supports HiveQL, along with UDFs (user-defined functions) and Hive SerDes. This allows one to access the existing Hive warehouses.
Unified data access: Loading and querying data from a variety of sources is possible. One only needs a single interface to work with structured data which the schema-RDDs provide.
Standard connectivity: It includes a server mode with high-grade connectivity to JDBC or ODBC.
Performance and scalability: To make queries agile, alongside computing hundreds of nodes using the Spark engine, Spark SQL incorporates a code generator, cost-based optimizer, and columnar storage. This provides complete mid-query fault tolerance. Note that, as is mentioned in the Hive limitations section, this kind of tolerance was lacking in Hive. Spark has ample information regarding the structure of data, as well as the type of computation being performed which is provided by the interfaces of Spark SQL. This leads to extra optimization from Spark SQL, internally. Faster execution of Hive queries is possible as Spark SQL can directly read from multiple sources like HDFS, Hive, existing RDDs, etc.

Spark SQL Example

Consider a scenario where you wish to create and load two tables along with selecting rows from the tables. Let us use Spark SQL to implement this.

As the first step, copy the Hue sample_07.csv and sample_08.csv files to your object store in a location that can be easily accessed by the Spark cluster.
Next, launch the spark shell.
Once the launch is initiated and completed, create Hive tables sample_07 and sample_08. Refer to the below code:

scala> spark.sql("CREATE EXTERNAL TABLE sample_07 (code string,description string,total_emp int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TextFile LOCATION 's3a://<bucket_name>/s07/'")

scala> spark.sql("CREATE EXTERNAL TABLE sample_08 (code string,description string,total_emp int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TextFile LOCATION 's3a://<bucket_name>/s08/'")

Now launch Beeline and show the Hive tables.
Also, load the CSV files into tables and create DataFrames from the content of Sample 7 and 8
Now save the DataFrame to show the tables.

Spark SQL Join

SQL Joins refers to a clause in SQL that is used to combine rows of two or more tables by using common values. It takes into consideration the records from two or more tables in a database and combines them. SQL Join is typically used in situations where you want to access one or more tables through a select statement.

SQL Join Syntax

SQL Join ( also referred to as INNER JOIN ) is the most commonly used method of joining data from multiple tables. The syntax for SQL Join is given below:

Consider the example below, the syntax used to join data from these tables will be:

SELECT Name, City, Marks, Phone_no 
FROM Student_detail, Personal_detail WHERE Student_details.Name=Personal_detail.Name;

Use Cases of Spark SQL

There is a lot to learn about Spark SQL as how it is applied in the industry scenario, but the below three use cases can give an apt idea:

Twitter sentiment analysis: Initially, you used to get all data from Spark streaming. Later, Spark SQL came into the picture to analyze everything about a topic, say, Narendra Modi. Every tweet regarding him is gathered, and then Spark SQL does its magic by classifying tweets as neutral tweets, positive tweets, negative tweets, very positive tweets, and very negative tweets. This is just one of the ways sentiment analysis is done. This is useful in target marketing, crisis management, and service adjusting.
Stock market analysis: As you are streaming data in real time, you can also do the processing in real time. Stock movements and market movements generate so much data and traders need an edge, an analytics framework, which will calculate all the data in real time and provide the most rewarding stock or contract, all within the stipulated time limit. As said earlier, if there is a need for a real-time analytics framework, then Spark, along with its components, is the technology to be considered.
Banking: Real-time processing is required in credit card fraud detection. Assume that a transaction happens in Bangalore where a purchase worth 4,000 rupees has been done by swiping a credit card. Within 5 minutes, there is another purchase of 10,000 rupees in Kolkata by swiping the same credit card. Banks can make use of real-time analytics provided by Spark SQL in detecting fraud in such cases.

Advantages of Spark SQL

The following are the various advantages of using Spark SQL:

It helps in easy data querying. The SQL queries are mixed with Spark programs for querying structured data as a distributed dataset (RDD). Also, the SQL queries are run with analytic algorithms using Spark SQL’s integration property.
Another important advantage of Spark SQL is that the loading and querying can be done for data from different sources. Hence, the data access is unified.
It offers standard connectivity as Spark SQL can be connected through JDBC or ODBC.
It can be used for faster processing of Hive tables.
Another important offering of Spark SQL is that it can run unmodified Hive queries on existing warehouses as it allows easy compatibility with existing Hive data and queries.

Disadvantages of Spark SQL

The following are the disadvantages of Spark SQL:

Creating or reading tables containing union fields is not possible with Spark SQL.
It does not convey if there is any error in situations where the varchar is oversized.
It does not support Hive transactions.
It also does not support the Char type (fixed-length strings). Hence, reading or creating a table with such fields is not possible.

Conclusion

The Apache Software Foundation offers Spark, a powerful component for real-time analytics. As Hadoop falls short in real-time processing, migrating to Spark becomes the clear choice. Similarly, as Hive’s limitations surface, users naturally shift to Spark SQL. Tasks that take 10 minutes in Hive can be done in under a minute with Spark SQL. Migration is also easy, as Spark SQL supports Hive. This creates a great opportunity for learners, as Spark professionals are still in short supply. Gaining hands-on experience in Spark SQL and DataFrames will place individuals in high demand as adoption continues to grow.