Have you ever wondered what makes Apache Spark system up to hundred times faster than Apache Hadoop? Apache Spark is widely known and accepted for its speed and agility, all thanks to its in-memory computation which was absent in Apache Hadoop. The one concept which made Apache Spark streaming possible was RDD (Resilient Distributed Dataset) which exists since its inception.
- Immutability: Once RDDs are created, they cannot be changed until some transformations are applied on it to create a new RDD.
- Partitioned: RDDs are a collection of records which are partitioned and stored across distributed nodes.
- Fault-tolerance: RDDs are created with the help of transformation; the logs maintain the details of those transformations rather than the data.
- Lazy evaluation: Spark transformations are lazily evaluated until an action is executed to trigger the evaluation.
- In-memory: The data can reside in the memory as long as possible.
However, with the release of Spark 1.3, a new API named DataFrame got evolved which allowed wider audiences to access the data apart from the Big Data engineers. What is different in DataFrames?
Basis of Difference |
Spark RDD |
Spark DataFrame |
What is it? |
Low-level API |
High-level abstraction |
Execution |
Lazy evaluation |
Lazy evaluation |
Data types |
unstructured |
Both unstructured and structured |
Benefit |
Simple API |
Gives schema to distributed data |
Limitation |
Limited performance |
Possibility of failure during the run time |
Spark RDD to DataFrame
With the launch of Apache Spark 1.3, a new kind of API was introduced which resolved the limitations of performance and scaling that occurred with Spark RDD. Previously, RDDs used to read or write data with the help of Java serialization which was a lengthy and cumbersome process. However, Spark DataFrame resolved this issue as it is equipped with the concept of schema that is used to describe data which in turn reduces the burden and improves the performance.
A DataFrame can be described as a collection of distributed data that is organized as named columns. Basically, it is equivalent to a table in a relational database. It is developed on the basis of data frames in R and Python; however, it has better optimization features.
Check out the video on PySpark Course to learn more about its basics:
Spark DataFrame Features
Spark DataFrame comes with many valuable features:
- Scalability: It allows processing petabytes of data at once.
- Flexibility: It supports a broad array of data formats (csv, Elasticsearch, Avro, etc.) and storage systems (HDFS, Hive tables, etc.)
- Custom Memory Management: Data is stored off-heap in a binary format that saves memory and removes garbage collection. Also, Java serialization is avoided here as the schema is already known.
- Optimized Execution Plans: Spark catalyst optimizer executes query plans, and it executes the queries on RDDs.
Creating a DataFrame
A DataFrame can be created using three methods:
Loading data from structured files, Hive tables, or external databases
The first method of creating a DataFrame is to read data from structured files such as JSON, parquet, JDBC, ORC, etc.
read: DataFrameReader
read returns a DataFrameReader instance.
val reader = spark.read
r: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@59e67a18
reader.parquet("file.parquet")
reader.json("file.json")
reader.format("libsvm").load("sample_libsvm_data.txt")
Using Spark DataFrames
Distributed manipulation can be performed using the domain-specific language. For instance, following examples use DataFrames to manipulate the demographic details of a group of users:
#Create a new DataFrame that contains “Old users” only
Old = users.filter(users.age > 50)
# Count the number of young users by gender
Old.groupBy(“gender”).count()
# Increment everybody’s age by 1
Old.select(Old.name, Old.age + 1)
Limitations of Spark DataFrames
Despite having multiple benefits, none of the technologies exist without loopholes. Considerable limitations of Spark DataFrames are as follows:
- The compiler is not able to catch errors as the code refers to data attribute names. Errors are detected during the run time after the creation of query plans.
- It works better with Scala and very limited with Java.
- Domain objects cannot be regenerated from it.
Bottom line
Though limitations exist and Dataset has evolved, DataFrames are still popular in the technology market. Since it is an extension of RDDs with better levels of abstraction. This feature is helpful in Advanced Analytics and Machine Learning as it can directly access MLlib’s Machine Learning Pipeline API. Moreover, developers can execute complex programs using DataFrames easily. Hence, DataFrame is still used by lots of users because of its incredibly fast-processing speed and ease of use.