Before coming onto the difference between RDD and Dataframe, we must know the given same data, both the abstraction will compute and give same results to user, but they differ in performance and the way they compute the result, let us first look into their functionality:-
It can be termed as building block of spark. Internal final computation is always done on RDDs no matter which of the abstraction DataFrame or Dataset is used, it is the vital part. One of the most advantageous things about RDD is its simplicity, it provides us with familiar OOP style APIs. RDD can also be easily cached if some data is to be reevaluated.
DataFrame can simply be defined as an abstraction which gives a schema view of data. We can think of the data in DataFrame like a table in database. But It works only on structured and semi-structured data, it offers huge performance improvement over RDDs because of features like Custom Memory management and Optimized Execution Plans.
1. RDD provides a more familiar OOP type programming style with compile time safety, while DataFrame detects attribute error only at runtime.
2. No inbuilt optimization engine is available in case of RDD while the DataFrame optimization takes place using Catalyst optimizer.
3. Incase of RDD whenever the data needs to be distributed within the cluster or written to the disk, it is done using Java serialization. There is no need to use java serialization to encode the data in case of DataFrame.
4. Efficiency in case of RDD is less than DataFrame because serialization needs to be performed individually on the objects which takes more time.
5. RDD is slower in performing simple grouping and aggregation operations as compared to DataFrame.