What is Spark Dataframe?

In Spark, Dataframes are distributed collections of data, organized into rows and columns. Each column in a Dataframe has a name and an associated type. Dataframes are similar to traditional database tables, which are structured and concise. We can say that, Dataframes are relational databases with better optimization techniques.

Spark Dataframes can be created from various sources, such as hive tables, log tables, external databases, or existing RDDs. They allow processing of huge amounts of data.

Why Dataframes?

Back then when Apache Spark 1.3 was launched, it came with a new API called Dataframes, which resolved the limitations of performance and scaling that occured while using RDDs.

When there is not much storage space in memory or a disk, RDDs do not function properly or they get exhausted. Spark RDDs do not have the concept of Schema—the structure of a database that defines the objects of the database. RDDs store both structured and unstructured data together, which is not efficient.

RDDs do not use input optimization engines to modify the system in such a way that it runs efficiently. This again decreases the performance. RDDs do not allow us to debug errors during the runtime.

RDDs store the data as a collection of Java Objects.

RDDs uses Serialization (converting an object into a stream of bytes to allow faster processing) and Garbage Collection (an automatic memory management technique, which detects unused objects and frees them from memory) techniques. This  increases the overhead on the memory of the system as they are very lengthy.

Read about Apache Spark from Apache Spark Training and be master as an Apache Spark Specialist.

Features of Dataframes

The main reason why Dataframes were created is to overcome the difficulties faced while using RDDs. Some of the features of Dataframes are:

  • Use of Input Optimization Engine: Dataframes make use of input optimization engines like Catalyst Optimizer to process the data efficiently. We can use the same engine for all Python, Java, Scala, and R Dataframe APIs.
  • Handling Structured Data: Dataframes provide a schematic view of data. The data has some meaning to it when it is being stored.
  • Custom Memory Management: In RDDs, the data is stored in memory, whereas Dataframes store data off-heap (outside the main Java Heap space, but still inside RAM), which in turn reduces the garbage collection overload.
  • Flexibility: Dataframes, like RDDs, can support various formats of data, such as csv, Cassandra, etc.
  • Scalability: Dataframes can be integrated with various other Big Data tools, and they allow processing megabytes to petabytes of data at once.

Want to grasp a detailed knowledge on Hadoop? Read this extensive Spark Tutorial!

Creating Dataframes

There are many ways to create Dataframes. Here are three of the most commonly used methods used to create Dataframes:

  • Creating Dataframes from JSON Files

Now, What are JSON files?

JSON, or JavaScript Object Notation, is a type of file that stores simple data structure objects   in the .JSON format. It is mainly used to transmit data between web servers. This is how a simple .JSON file looks like:
Creating Dataframes
The above JSON is a simple employee database file which contains two records/rows.

When it comes to Spark, .JSON files which are being loaded are not the typical JSON file. You cannot load a normal JSON file into a Dataframe. The JSON file

which you want to load should be of the format given below:
What are JSON files
JSON files can be loaded onto the Dataframes using the read.JSON function, with the file name you want to upload it.

If you have any query related to Spark and Hadoop, kindly refer our Big Data Hadoop & Spark Community.

Example: Here, we are loading an olympic medal count sheet onto a Dataframe. There are 10 fields in total. The function printSchema() prints the schema of the Dataframe.

What are JSON file

  • Creating Dataframes from Existing RDDs

Dataframes can also be created from the existing RDDs. First, you create an RDD and then load that RDD on to a Dataframe using the

createDataframe(Name_of_the_rdd_file) function. In the below figure, we are creating an RDD first, which contains numbers from 1 to 10 and their cubes and then

loading that RDD onto a Dataframe.

Get familiar with top Spark Interview Questions and Answer to get a head start in your career now!

Creating Dataframes from Existing RDDs

  • Creating Dataframes from .csv Files

You can also create Dataframes by loading .csv files. Here is an example for loading a csv file onto a Dataframe.
Creating Dataframes from .csv Files

Spark Datasets

Datasets are an extension to the Dataframe APIs in Spark. In addition to the features of Dataframes and RDDs, datasets provide various other functionalities.

Datasets provide object-oriented programming interface, which includes the concepts of classes and objects.

Datasets were introduced when Spark 1.6 was released. Datasets provide the convenience of RDDs, static typing of Scala, and the optimization features of Dataframes.

Datasets are a collection of Java Virtual Machine (JVM) objects, which uses Spark’ Catalyst Optimizer to provide efficient processing.

Dataframes Vs. RDDs Vs. Datasets

Basis of DifferenceSpark RDDSpark DataframeSpark Datasets
What is it?Low-level APIHigher level abstractionCombination of both RDDs and Dataframes
Input Optimization EngineCannot make use of optimization enginesUses optimization engines to generate logical queriesDatasets use the Catalyst Optimizer for input optimization, just like Dataframes do
Data RepresentationIt is distributed across multiple nodes of a clusterCollection of rows and named columnsIt is an extension of Dataframes, so it provides the functionalities of both RDDs and Dataframes
BenefitSimple APIGives a schema to the distributed dataImproves memory usage
Immutability and InteroperabilityRDDs can track data lineage information to recover the lost dataOnce transforming into a Dataframe, we cannot get the domain objectWe can regenerate RDDs from Datasets
Performance LimitationJava Serialization and Garbage Collection overheadsOffers huge performance improvement over RDDsOperations are performed on serialized data to improve performance

Though limitations exist and Datasets have evolved after DataFrames, it is still popular in the technology market. Since it is an extension of RDDs with better levels of abstractions, this feature is helpful in Advanced Analytics and Machine Learning as it can directly access MLlib’s Machine Learning pipeline API. Moreover, developers can execute complex programs using DataFrames easily. Hence, DataFrame is still used by lots of users because of its incredibly fast processing speed and ease of use.

Intellipaat provides the most comprehensive Big Data & Spark Course in New York to fast-track your career!

Recommended Videos

Leave a Reply

Your email address will not be published. Required fields are marked *