Spark DataFrame

Spark has an easy-to-use API for handling structured and unstructured data called Dataframe. Every DataFrame has a blueprint called a Schema. It can contain universal data types string types and integer types and the data types which are specific to spark such as struct type. Let’s discuss what is Spark DataFrame, its features, and the application of DataFrame.

What is Spark DataFrame?

In Spark, DataFrames are the distributed collections of data, organized into rows and columns. Each column in a DataFrame has a name and an associated type. DataFrames are similar to traditional database tables, which are structured and concise. We can say that DataFrames are relational databases with better optimization techniques.

Spark DataFrames can be created from various sources, such as Hive tables, log tables, external databases, or the existing RDDs. DataFrames allow the processing of huge amounts of data.

Check out this video on Spark Tutorial for beginners:

Why DataFrames?

When Apache Spark 1.3 was launched, it came with a new API called DataFrames that resolved the limitations of performance and scaling that occur while using RDDs.

When there is not much storage space in memory or on disk, RDDs do not function properly as they get exhausted. Besides, Spark RDDs do not have the concept of schema—the structure of a database that defines its objects. RDDs store both structured and unstructured data together, which is not very efficient.

RDDs cannot modify the system in such a way that it runs more efficiently. RDDs do not allow us to debug errors during the runtime. They store the data as a collection of Java objects.

RDDs use serialization (converting an object into a stream of bytes to allow faster processing) and garbage collection (an automatic memory management technique that detects unused objects and frees them from memory) techniques. This increases the overhead on the memory of the system as they are very lengthy.

This was when DataFrames were introduced to overcome the limitations Spark RDDs had. Now, what makes Spark DataFrames so unique? Let’s check out the features of Spark DataFrames that make them so popular.

Features of DataFrames

Some of the unique features of DataFrames are:

Use of Input Optimization Engine: DataFrames make use of the input optimization engines, e.g., Catalyst Optimizer, to process data efficiently. We can use the same engine for all Python, Java, Scala, and R DataFrame APIs.
Handling of Structured Data: DataFrames provide a schematic view of data. Here, the data has some meaning to it when it is being stored.
Custom Memory Management: In RDDs, the data is stored in memory, whereas DataFrames store data off-heap (outside the main Java Heap space, but still inside RAM), which in turn reduces the garbage collection overload.
Flexibility: DataFrames, like RDDs, can support various formats of data, such as CSV, Cassandra, etc.
Scalability: DataFrames can be integrated with various other Big Data tools, and they allow processing megabytes to petabytes of data at once.

Get 100% Hike!

Master Most in Demand Skills Now!

Creating DataFrames

There are many ways to create DataFrames. Here are three of the most commonly used methods to create DataFrames:

Creating DataFrames from JSON Files

Now, what are JSON files?

JSON, or JavaScript Object Notation, is a type of file that stores simple data structure objects in the .json format. It is mainly used to transmit data between Web servers. This is how a simple .json file looks like:

The above JSON is a simple employee database file that contains two records/rows.

When it comes to Spark, the .json files that are being loaded are not the typical .json files. We cannot load a normal JSON file into a DataFrame. The JSON file that we want to load should be in the format given below:

JSON files can be loaded onto DataFrames using the read.JSON function, with the file name we want to upload it.

Example:

Here, we are loading an Olympic medal count sheet onto a DataFrame. There are 10 fields in total. The function printSchema() prints the schema of the DataFrame.

Creating DataFrames from the Existing RDDs

DataFrames can also be created from the existing RDDs. First, we create an RDD and then load that RDD onto a DataFrame using the createDataFrame(Name_of_the_rdd_file) function.

Example:

In the below figure, we are creating an RDD first, which contains numbers from 1 to 10 and their cubes. Then, we will load that RDD onto a DataFrame.

Creating DataFrames from CSV Files

We can also create DataFrames by loading the .csv files.

Here is an example of loading a .csv file onto a DataFrame.

DataFrame Operations

For structured Data Manipulation, Spark DataFrame provides a domain-specific language. Let’s understand that through an example where the process structured data using DataFrames. Let’s take an example of a dataset wherein all the details of the employee are stored. Now follow along with the steps for DataFrame operations:

Read JSON Document
Start by reading the JSON Document, then generate a DataFrame named (dfs). You can use the following command to read the JSON document that is named employee.json. The data we need is shown as a table with the fields – age, name, and id.

 scala> val dfs = sqlContext.read.json("employee.json")

The output: Field names will be taken automatically from the employee.json file.

 dfs: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]

Show the Data
Use this command if you want to see the data in the DataFrame. The command goes like this:

 scala> dfs.show()

The output: You can now see the employee data in a neat tabular format, something like this:

:18, took 0.043410 s<br>
+----+------+--------+<br>
|age | id | name |<br>
+----+------+--------+<br>
| 23 | 1201 | raju |<br>
| 25 | 1202 | krishna|<br>
| 30 | 1203 | sanjay |<br>
| 42 | 1204 | javed |<br>
| 37 | 1205 | prithvi |<br>
+----+------+--------+<br>

Use Age Filter
You can now use the following commands to find out the employees whose age is below 3p (age<30).

scala> dfs.filter (dfs("age") <30).show()<br>

The Output:

:22, took 0.078670 s<br>
+----+------+--------+<br>
|age | id | name |<br>
+----+------+--------+<br>
| 23 | 1201 | raju |<br>
| 25 | 1202 | krishna|<br>
+----+------+--------+

Applications of DataFrame

Spark DataFrames are used in Advanced Analytics and Machine Learning. Data Scientists are using these DataFrames for increasingly sophisticated techniques to get their job done. DataFrames could be used directly in MLlib’s ML pipeline API. Added to that, several different programs can run complex user functions on DataFrames. These advanced analytics tasks could be specified using ML Pipeline API in MLlib.

Spark Datasets

Datasets are an extension of the DataFrame APIs in Spark. In addition to the features of DataFrames and RDDs, datasets provide various other functionalities.

They provide an object-oriented programming interface, which includes the concepts of classes and objects.

Datasets were introduced when Spark 1.6 was released. They provide the convenience of RDDs, the static typing of Scala, and the optimization features of DataFrames.

Datasets are a collection of Java Virtual Machine (JVM) objects that use Spark’s Catalyst Optimizer to provide efficient processing.

DataFrames vs RDDs vs Datasets

Basis of Difference	Spark RDD	Spark DataFrame	Spark Dataset
What is it?	A low-level API	A high-level abstraction	A combination of both RDDs and DataFrames
Input Optimization Engine	Cannot make use of input optimization engines	Uses input optimization engines to generate logical queries	Uses Catalyst Optimizer for input optimization, as DataFrames do
Data Representation	Distributed across multiple nodes of a cluster	A collection of rows and named columns	An extension of DataFrames, providing the functionalities of both RDDs and DataFrames
Benefit	A simple API	Gives a schema for the distributed data	Improves memory usage
Immutability and Interoperability	Tracks data lineage information to recover the lost data	Once transformed into a DataFrame, not possible to get the domain object	Can regenerate RDDs
Performance Limitation	Java Serialization and Garbage Collection overheads	Offers huge performance improvement over RDDs	Operations are performed on serialized data to improve performance

Conclusion

Though a few limitations exist and Datasets have evolved lately, DataFrames are still popular in the field of technology. Since they are the extension of RDDs with better levels of abstractions, they are helpful in Advanced Analytics and Machine Learning as they can directly access MLlib’s Machine Learning Pipeline API. Moreover, developers can execute complex programs using DataFrames easily. Hence, DataFrames are still used by lots of users because of their incredibly fast processing and ease of use.