What is Spark Dataframe?
In Spark, Dataframes are distributed collections of data, organized into rows and columns. Each column in a Dataframe has a name and an associated type. Dataframes are similar to traditional database tables, which are structured and concise. We can say that, Dataframes are relational databases with better optimization techniques.
Spark Dataframes can be created from various sources, such as hive tables, log tables, external databases, or existing RDDs. They allow processing of huge amounts of data.
Back then when Apache Spark 1.3 was launched, it came with a new API called Dataframes, which resolved the limitations of performance and scaling that occured while using RDDs.
When there is not much storage space in memory or a disk, RDDs do not function properly or they get exhausted. Spark RDDs do not have the concept of Schema—the structure of a database that defines the objects of the database. RDDs store both structured and unstructured data together, which is not efficient.
RDDs do not use input optimization engines to modify the system in such a way that it runs efficiently. This again decreases the performance. RDDs do not allow us to debug errors during the runtime.
RDDs store the data as a collection of Java Objects.
RDDs uses Serialization (converting an object into a stream of bytes to allow faster processing) and Garbage Collection (an automatic memory management technique, which detects unused objects and frees them from memory) techniques. This increases the overhead on the memory of the system as they are very lengthy.
Read about Apache Spark from Apache Spark Training and be master as an Apache Spark Specialist.
Features of Dataframes
The main reason why Dataframes were created is to overcome the difficulties faced while using RDDs. Some of the features of Dataframes are:
- Use of Input Optimization Engine: Dataframes make use of input optimization engines like Catalyst Optimizer to process the data efficiently. We can use the same engine for all Python, Java, Scala, and R Dataframe APIs.
- Handling Structured Data: Dataframes provide a schematic view of data. The data has some meaning to it when it is being stored.
- Custom Memory Management: In RDDs, the data is stored in memory, whereas Dataframes store data off-heap (outside the main Java Heap space, but still inside RAM), which in turn reduces the garbage collection overload.
- Flexibility: Dataframes, like RDDs, can support various formats of data, such as csv, Cassandra, etc.
- Scalability: Dataframes can be integrated with various other Big Data tools, and they allow processing megabytes to petabytes of data at once.
Want to grasp a detailed knowledge on Hadoop? Read this extensive Spark Tutorial!
There are many ways to create Dataframes. Here are three of the most commonly used methods used to create Dataframes:
- Creating Dataframes from JSON Files
Now, What are JSON files?
The above JSON is a simple employee database file which contains two records/rows.
When it comes to Spark, .JSON files which are being loaded are not the typical JSON file. You cannot load a normal JSON file into a Dataframe. The JSON file
which you want to load should be of the format given below:
JSON files can be loaded onto the Dataframes using the read.JSON function, with the file name you want to upload it.
If you have any query related to Spark and Hadoop, kindly refer our Big Data Hadoop & Spark Community.
Example: Here, we are loading an olympic medal count sheet onto a Dataframe. There are 10 fields in total. The function printSchema() prints the schema of the Dataframe.
- Creating Dataframes from Existing RDDs
Dataframes can also be created from the existing RDDs. First, you create an RDD and then load that RDD on to a Dataframe using the
createDataframe(Name_of_the_rdd_file) function. In the below figure, we are creating an RDD first, which contains numbers from 1 to 10 and their cubes and then
loading that RDD onto a Dataframe.
Get familiar with top Spark Interview Questions and Answer to get a head start in your career now!
- Creating Dataframes from .csv Files
You can also create Dataframes by loading .csv files. Here is an example for loading a csv file onto a Dataframe.
Datasets are an extension to the Dataframe APIs in Spark. In addition to the features of Dataframes and RDDs, datasets provide various other functionalities.
Datasets provide object-oriented programming interface, which includes the concepts of classes and objects.
Datasets were introduced when Spark 1.6 was released. Datasets provide the convenience of RDDs, static typing of Scala, and the optimization features of Dataframes.
Datasets are a collection of Java Virtual Machine (JVM) objects, which uses Spark’ Catalyst Optimizer to provide efficient processing.
Dataframes Vs. RDDs Vs. Datasets
|Basis of Difference||Spark RDD||Spark Dataframe||Spark Datasets|
|What is it?||Low-level API||Higher level abstraction||Combination of both RDDs and Dataframes|
|Input Optimization Engine||Cannot make use of optimization engines||Uses optimization engines to generate logical queries||Datasets use the Catalyst Optimizer for input optimization, just like Dataframes do|
|Data Representation||It is distributed across multiple nodes of a cluster||Collection of rows and named columns||It is an extension of Dataframes, so it provides the functionalities of both RDDs and Dataframes|
|Benefit||Simple API||Gives a schema to the distributed data||Improves memory usage|
|Immutability and Interoperability||RDDs can track data lineage information to recover the lost data||Once transforming into a Dataframe, we cannot get the domain object||We can regenerate RDDs from Datasets|
|Performance Limitation||Java Serialization and Garbage Collection overheads||Offers huge performance improvement over RDDs||Operations are performed on serialized data to improve performance|
Though limitations exist and Datasets have evolved after DataFrames, it is still popular in the technology market. Since it is an extension of RDDs with better levels of abstractions, this feature is helpful in Advanced Analytics and Machine Learning as it can directly access MLlib’s Machine Learning pipeline API. Moreover, developers can execute complex programs using DataFrames easily. Hence, DataFrame is still used by lots of users because of its incredibly fast processing speed and ease of use.
Intellipaat provides the most comprehensive Big Data & Spark Course in New York to fast-track your career!