Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
+15 votes
5 views
in Big Data Hadoop & Spark by (1.1k points)

I want to know in simple language what are all the differences between rdd and dataframes?


 

3 Answers

+13 votes
by (13.2k points)
edited by
Before coming onto the difference between RDD and Dataframe, we must know the given same data, both the abstraction will compute and give same results to user, but they differ in performance and the way they compute the result, let us first look into their functionality:-

RDD:

It can be termed as building block of spark. Internal final computation is always done on RDDs no matter which of the abstraction DataFrame or Dataset is used, it is the vital part. One of the most advantageous things about RDD is its simplicity, it provides us with familiar OOP style APIs. RDD can also be easily cached if some data is to be reevaluated.

DataFrame:

DataFrame can simply be defined as an abstraction which gives a schema view of data. We can think of the data in DataFrame like a table in database. But It works only on structured and semi-structured data, it offers huge performance improvement over RDDs because of features like Custom Memory management and Optimized Execution Plans.

Difference

  1.  RDD provides a more familiar OOP type programming style with compile    time safety, while DataFrame detects attribute error only at runtime.

   2. No inbuilt optimization engine is available in case of RDD while the DataFrame optimization takes place using Catalyst optimizer.

    3. Incase of RDD whenever the data needs to be distributed within the cluster or written to the disk, it is done using Java serialization. There is no need to use java serialization to encode the data in case of DataFrame.

    4. Efficiency in case of RDD is less than DataFrame because serialization needs to be performed individually on the objects which takes more time.

 5. RDD is slower in performing simple grouping and aggregation operations as compared to DataFrame.
0 votes
by (32.3k points)
edited by

  • A data frame is a table, or a two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
  • DataFrame consists of a tabular format due to which it carries additional metadata, which allows Spark to run certain optimizations on the finalized query.
  • An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a black box of data that cannot be optimized as the operations that can be performed against it, are not as constrained.
  • However, you can achieve an RDD by applying an rdd method on a dataframe, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method
  • In general, it is recommended to use a DataFrame where possible due to the built-in query optimization.

Do check out this awesome tutorial video regarding Apache Spark: 

0 votes
by (33.1k points)

DataFrame: 

  • A Data Frame is used for storing data in tables. 
  • It is equivalent to a table in a relational database but with richer optimization. 
  • It is a data abstraction and domain-specific language (DSL) applicable to a structure and semi-structured data. 
  • It is a distributed collection of data in the form of named column and row. 
  • It has a matrix-like structure whose column may be different types (numeric, logical, factor, or character ). 
  • We can say the data frame has a two-dimensional array-like structure where each column contains the value of one variable and row contains one set of values for each column. 
  • It combines features of lists and matrices.


RDD: 

  • It is the representation of a set of records, an immutable collection of objects with distributed computing. 
  • RDD is a large collection of data or RDD is an array of references for partitioned objects. 
  • Each and every dataset in RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster. 
  • RDDs are fault-tolerant i.e. self-recovered/recomputed in the case of failure. 
  • The dataset could be data loaded externally by the users which can be in the form of JSON file, CSV file, text file or database via JDBC with no specific data structure.

I hope this answer would help you!

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...