- What is Spark DataFrame?
- Spark Datasets
- DataFrames vs RDDs vs Datasets
Check out this video on Spark Tutorial for beginners:
What is Spark DataFrame?
In Spark, DataFrames are the distributed collections of data, organized into rows and columns. Each column in a DataFrame has a name and an associated type. DataFrames are similar to traditional database tables, which are structured and concise. We can say that DataFrames are relational databases with better optimization techniques.
Spark DataFrames can be created from various sources, such as Hive tables, log tables, external databases, or the existing RDDs. DataFrames allow the processing of huge amounts of data.
When Apache Spark 1.3 was launched, it came with a new API called DataFrames that resolved the limitations of performance and scaling that occur while using RDDs.
When there is not much storage space in memory or on disk, RDDs do not function properly as they get exhausted. Besides, Spark RDDs do not have the concept of schema—the structure of a database that defines the objects of it. RDDs store both structured and unstructured data together, which is not very efficient.
RDDs cannot modify the system in such a way that it runs more efficiently. RDDs do not allow us to debug errors during the runtime. They store the data as a collection of Java objects.
RDDs use serialization (converting an object into a stream of bytes to allow faster processing) and garbage collection (an automatic memory management technique that detects unused objects and frees them from memory) techniques. This increases the overhead on the memory of the system as they are very lengthy.
This was when DataFrames were introduced to overcome the limitations Spark RDDs had. Now, what makes Spark DataFrames so unique? Let’s check out the features of Spark DataFrames that make them so popular.
Read about Spark from Apache Spark Training and be a master in Apache Spark!
Features of DataFrames
Some of the unique features of DataFrames are:
- Use of Input Optimization Engine: DataFrames make use of the input optimization engines, e.g., Catalyst Optimizer, to process data efficiently. We can use the same engine for all Python, Java, Scala, and R DataFrame APIs.
- Handling of Structured Data: DataFrames provide a schematic view of data. Here, the data has some meaning to it when it is being stored.
- Custom Memory Management: In RDDs, the data is stored in memory, whereas DataFrames store data off-heap (outside the main Java Heap space, but still inside RAM), which in turn reduces the garbage collection overload.
- Flexibility: DataFrames, like RDDs, can support various formats of data, such as CSV, Cassandra, etc.
- Scalability: DataFrames can be integrated with various other Big Data tools, and they allow processing megabytes to petabytes of data at once.
Wish to learn Apache Spark in detail? Read this extensive Spark Tutorial!
Get 50% Hike!
Master Most in Demand Skills Now !
There are many ways to create DataFrames. Here are three of the most commonly used methods to create DataFrames:
- Creating DataFrames from JSON Files
Now, what are JSON files?
The above JSON is a simple employee database file that contains two records/rows.
If you have more queries related to Big Data Hadoop and Apache Spark, kindly refer to our Big Data Hadoop and Spark Community!
When it comes to Spark, the .json files that are being loaded are not the typical .json files. We cannot load a normal JSON file into a DataFrame. The JSON file that we want to load should be in the format given below:
JSON files can be loaded onto DataFrames using the read.JSON function, with the file name we want to upload it.
Here, we are loading an Olympic medal count sheet onto a DataFrame. There are 10 fields in total. The function printSchema() prints the schema of the DataFrame.
Get familiar with the most asked Spark Interview Questions and Answers to kick-start your career!
- Creating DataFrames from the Existing RDDs
DataFrames can also be created from the existing RDDs. First, we create an RDD and then load that RDD onto a DataFrame using the createDataFrame(Name_of_the_rdd_file) function.
In the below figure, we are creating an RDD first, which contains numbers from 1 to 10 and their cubes. Then, we will load that RDD onto a DataFrame.
- Creating DataFrames from CSV Files
We can also create DataFrames by loading the .csv files.
Here is an example of loading a .csv file onto a DataFrame.
For structured Data Manipulation, Spark DataFrame provides a domain-specific language. Let’s understand that through an example where the process structured data using DataFrames. Let’s take an example of a dataset wherein all the details of the employee are stored. Now follow along with the steps for DataFrame operations:
Read JSON Document
Start by reading the JSON Document, then generate a DataFrame named (dfs). You can use the following command to read the JSON document that is named employee.json. The data we need is shown as a table with the fields – age, name, and id.
scala> val dfs = sqlContext.read.json("employee.json")
The output: Field names will be taken automatically from the employee.json file.
dfs: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]
Show the Data
Use this command if you want to see the data in the DataFrame. The command goes like this:
The output: You can now see the employee data in a neat tabular format, something like this:
:18, took 0.043410 s
|age | id | name |
| 23 | 1201 | raju |
| 25 | 1202 | krishna|
| 30 | 1203 | sanjay |
| 42 | 1204 | javed |
| 37 | 1205 | prithvi |
Use Age Filter
You can now use the following commands to find out the employees whose age is below than 3p (age<30).
scala> dfs.filter (dfs("age") <30).show()
:22, took 0.078670 s
|age | id | name |
| 23 | 1201 | raju |
| 25 | 1202 | krishna|
Applications of DataFrame
Spark DataFrames are used in Advanced Analytics and Machine Learning. Data Scientists are using these DataFrames for increasingly sophisticated techniques to get their job done. DataFrames could be used directly in MLlib’s ML pipeline API. Added to that, several different programs can run complex user functions on DataFrames. These advanced analytics tasks could be specified using ML Pipeline API in MLlib.
Datasets are an extension of the DataFrame APIs in Spark. In addition to the features of DataFrames and RDDs, datasets provide various other functionalities.
They provide an object-oriented programming interface, which includes the concepts of classes and objects.
Datasets were introduced when Spark 1.6 was released. They provide the convenience of RDDs, the static typing of Scala, and the optimization features of DataFrames.
Datasets are a collection of Java Virtual Machine (JVM) objects that use Spark’s Catalyst Optimizer to provide efficient processing.
If you want to learn about Kafka Introduction, refer to this insightful Blog!
DataFrames vs RDDs vs Datasets
|Basis of Difference
|What is it?
||A low-level API
||A high-level abstraction
||A combination of both RDDs and DataFrames
|Input Optimization Engine
||Cannot make use of input optimization engines
||Uses input optimization engines to generate logical queries
||Uses Catalyst Optimizer for input optimization, as DataFrames do
||Distributed across multiple nodes of a cluster
||A collection of rows and named columns
||An extension of DataFrames, providing the functionalities of both RDDs and DataFrames
||A simple API
||Gives a schema for the distributed data
||Improves memory usage
|Immutability and Interoperability
||Tracks data lineage information to recover the lost data
||Once transformed into a DataFrame, not possible to get the domain object
||Can regenerate RDDs
||Java Serialization and Garbage Collection overheads
||Offers huge performance improvement over RDDs
||Operations are performed on serialized data to improve performance
Learn why should you choose DataFrames over RDDs in Apache Spark!
Though a few limitations exist and Datasets have evolved lately, DataFrames are still popular in the field of technology. Since they are the extension of RDDs with better levels of abstractions, they are helpful in Advanced Analytics and Machine Learning as they can directly access MLlib’s Machine Learning Pipeline API. Moreover, developers can execute complex programs using DataFrames easily. Hence, DataFrames are still used by lots of users because of its incredibly fast processing and ease of use.
Intellipaat provides the most comprehensive Big Data and Spark Course in New York to fast-track your career!