What is Spark DataFrame?

In Spark, DataFrames are the distributed collections of data, organized into rows and columns. Each column in a DataFrame has a name and an associated type. DataFrames are similar to traditional database tables, which are structured and concise. We can say that DataFrames are relational databases with better optimization techniques.

Spark DataFrames can be created from various sources, such as Hive tables, log tables, external databases, or the existing RDDs. DataFrames allow the processing of huge amounts of data.

Why DataFrames?

When Apache Spark 1.3 was launched, it came with a new API called DataFrames that resolved the limitations of performance and scaling that occur while using RDDs.

When there is not much storage space in memory or on disk, RDDs do not function properly as they get exhausted. Besides, Spark RDDs do not have the concept of schema—the structure of a database that defines the objects of it. RDDs store both structured and unstructured data together, which is not very efficient.

RDDs cannot modify the system in such a way that it runs more efficiently. RDDs do not allow us to debug errors during the runtime. They store the data as a collection of Java objects.

RDDs use serialization (converting an object into a stream of bytes to allow faster processing) and garbage collection (an automatic memory management technique that detects unused objects and frees them from memory) techniques. This increases the overhead on the memory of the system as they are very lengthy.

This was when Spark DataFrames were introduced to overcome the limitations Spark RDDs had. Now, what makes Spark DataFrames so unique? Let’s check out the features of Spark DataFrames that make them so popular.

Read about Spark from Apache Spark Training and be a master in Apache Spark!

Features of DataFrames

Some of the unique features of DataFrames are:

  • Use of Input Optimization Engine: DataFrames make use of the input optimization engines, e.g., Catalyst Optimizer, to process data efficiently. We can use the same engine for all Python, Java, Scala, and R DataFrame APIs.
  • Handling of Structured Data: DataFrames provide a schematic view of data. Here, the data has some meaning to it when it is being stored.
  • Custom Memory Management: In RDDs, the data is stored in memory, whereas DataFrames store data off-heap (outside the main Java Heap space, but still inside RAM), which in turn reduces the garbage collection overload.
  • Flexibility: DataFrames, like RDDs, can support various formats of data, such as CSV, Cassandra, etc.
  • Scalability: DataFrames can be integrated with various other Big Data tools, and they allow processing megabytes to petabytes of data at once.

Wish to learn Apache Spark in detail? Read this extensive Spark Tutorial!

Creating DataFrames

There are many ways to create DataFrames. Here are three of the most commonly used methods to create DataFrames:

  • Creating DataFrames from JSON Files

Now, what are JSON files?

JSON, or JavaScript Object Notation, is a type of file that stores simple data structure objects in the .json format. It is mainly used to transmit data between Web servers. This is how a simple .json file looks like:

Creating Dataframes
The above JSON is a simple employee database file that contains two records/rows.

If you have more queries related to Big Data Hadoop and Apache Spark, kindly refer to our Big Data Hadoop and Spark Community!

When it comes to Spark, the .json files that are being loaded are not the typical .json files. We cannot load a normal JSON file into a DataFrame. The JSON file that we want to load should be in the format given below:

What are JSON files
JSON files can be loaded onto DataFrames using the read.JSON function, with the file name we want to upload it.

  • Example:

Here, we are loading an Olympic medal count sheet onto a DataFrame. There are 10 fields in total. The function printSchema() prints the schema of the DataFrame.

What are JSON file

Get familiar with the most asked Spark Interview Questions and Answers to kick-start your career!

  • Creating DataFrames from the Existing RDDs

DataFrames can also be created from the existing RDDs. First, we create an RDD and then load that RDD onto a DataFrame using the createDataFrame(Name_of_the_rdd_file) function.

  • Example:

In the below figure, we are creating an RDD first, which contains numbers from 1 to 10 and their cubes. Then, we will load that RDD onto a DataFrame.

Creating Dataframes from Existing RDDs

  • Creating DataFrames from CSV Files

We can also create DataFrames by loading the .csv files.

Here is an example of loading a .csv file onto a DataFrame.

Creating Dataframes from .csv Files

Spark Datasets

Datasets are an extension of the DataFrame APIs in Spark. In addition to the features of DataFrames and RDDs, datasets provide various other functionalities.

They provide an object-oriented programming interface, which includes the concepts of classes and objects.

Datasets were introduced when Spark 1.6 was released. They provide the convenience of RDDs, the static typing of Scala, and the optimization features of DataFrames.

Datasets are a collection of Java Virtual Machine (JVM) objects that use Spark’s Catalyst Optimizer to provide efficient processing.

DataFrames vs RDDs vs Datasets

Basis of DifferenceSpark RDDSpark DataFrameSpark Dataset
What is it?A low-level APIA high-level abstractionA combination of both RDDs and DataFrames
Input Optimization EngineCannot make use of input optimization enginesUses input optimization engines to generate logical queriesUses Catalyst Optimizer for input optimization, as DataFrames do
Data RepresentationDistributed across multiple nodes of a clusterA collection of rows and named columnsAn extension of DataFrames, providing the functionalities of both RDDs and DataFrames
BenefitA simple APIGives a schema for the distributed dataImproves memory usage
Immutability and InteroperabilityTracks data lineage information to recover the lost dataOnce transformed into a DataFrame, not possible to get the domain objectCan regenerate RDDs
Performance LimitationJava Serialization and Garbage Collection overheadsOffers huge performance improvement over RDDsOperations are performed on serialized data to improve performance

Learn why should you choose DataFrames over RDDs in Apache Spark!

Though a few limitations exist and Datasets have evolved lately, DataFrames are still popular in the field of technology. Since they are the extension of RDDs with better levels of abstractions, they are helpful in Advanced Analytics and Machine Learning as they can directly access MLlib’s Machine Learning Pipeline API. Moreover, developers can execute complex programs using DataFrames easily. Hence, DataFrames are still used by lots of users because of its incredibly fast processing and ease of use.

Intellipaat provides the most comprehensive Big Data and Spark Course in New York to fast-track your career!

Recommended Videos

Leave a Reply

Your email address will not be published. Required fields are marked *