Why DataFrames over RDDs in Apache Spark?

Have you ever wondered what makes Apache Spark system up to hundred times faster than Apache Hadoop? Apache Spark is widely known and accepted for its speed and agility, all thanks to its in-memory computation which was absent in Apache Hadoop. The one concept which made Apache Spark streaming possible was RDD (Resilient Distributed Dataset) which exists since its inception.

Immutability: Once RDDs are created, they cannot be changed until some transformations are applied on it to create a new RDD.

Partitioned: RDDs are a collection of records which are partitioned and stored across distributed nodes.

Fault-tolerance: RDDs are created with the help of transformation; the logs maintain the details of those transformations rather than the data.

Lazy evaluation: Spark transformations are lazily evaluated until an action is executed to trigger the evaluation.

In-memory: The data can reside in the memory as long as possible.

However, with the release of Spark 1.3, a new API named DataFrame got evolved which allowed wider audiences to access the data apart from the Big Data engineers. What is different in DataFrames?

Basis of Difference	Spark RDD	Spark DataFrame
What is it?	Low-level API	High-level abstraction
Execution	Lazy evaluation	Lazy evaluation
Data types	unstructured	Both unstructured and structured
Benefit	Simple API	Gives schema to distributed data
Limitation	Limited performance	Possibility of failure during the run time

Spark RDD to DataFrame

With the launch of Apache Spark 1.3, a new kind of API was introduced which resolved the limitations of performance and scaling that occurred with Spark RDD. Previously, RDDs used to read or write data with the help of Java serialization which was a lengthy and cumbersome process. However, Spark DataFrame resolved this issue as it is equipped with the concept of schema that is used to describe data which in turn reduces the burden and improves the performance.

A DataFrame can be described as a collection of distributed data that is organized as named columns. Basically, it is equivalent to a table in a relational database. It is developed on the basis of data frames in R and Python; however, it has better optimization features.

Check out the video on PySpark Course to learn more about its basics:

Spark DataFrame Features

Spark DataFrame comes with many valuable features:

Scalability: It allows processing petabytes of data at once.

Flexibility: It supports a broad array of data formats (csv, Elasticsearch, Avro, etc.) and storage systems (HDFS, Hive tables, etc.)

Custom Memory Management: Data is stored off-heap in a binary format that saves memory and removes garbage collection. Also, Java serialization is avoided here as the schema is already known.
Optimized Execution Plans: Spark catalyst optimizer executes query plans, and it executes the queries on RDDs.

Creating a DataFrame

A DataFrame can be created using three methods:

Loading data from structured files, Hive tables, or external databases

The first method of creating a DataFrame is to read data from structured files such as JSON, parquet, JDBC, ORC, etc.

read: DataFrameReader

read returns a DataFrameReader instance.

val reader = spark.read

r: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@59e67a18

reader.parquet("file.parquet")

reader.json("file.json")

reader.format("libsvm").load("sample_libsvm_data.txt")

Using Spark DataFrames

Distributed manipulation can be performed using the domain-specific language. For instance, following examples use DataFrames to manipulate the demographic details of a group of users:

#Create a new DataFrame that contains “Old users” only

Old = users.filter(users.age > 50)

# Count the number of young users by gender

Old.groupBy(“gender”).count()

# Increment everybody’s age by 1

Old.select(Old.name, Old.age + 1)

Limitations of Spark DataFrames

Despite having multiple benefits, none of the technologies exist without loopholes. Considerable limitations of Spark DataFrames are as follows:

The compiler is not able to catch errors as the code refers to data attribute names. Errors are detected during the run time after the creation of query plans.
It works better with Scala and very limited with Java.
Domain objects cannot be regenerated from it.

Bottom line

Though limitations exist and Dataset has evolved, DataFrames are still popular in the technology market. Since it is an extension of RDDs with better levels of abstraction. This feature is helpful in Advanced Analytics and Machine Learning as it can directly access MLlib’s Machine Learning Pipeline API. Moreover, developers can execute complex programs using DataFrames easily. Hence, DataFrame is still used by lots of users because of its incredibly fast-processing speed and ease of use. Enroll in a Data Engineering course to bridge the gap between raw data and actionable insights by mastering data transformation and storage techniques.

Discover Big Data and Hadoop’s full potential with our comprehensive collection of cheat sheets, covering everything from fundamental concepts to advanced techniques in one convenient guide!