Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)
I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]) in Apache Spark?

1 Answer

0 votes
by (32.3k points)
edited by

DataFrame: A Data Frame is used for storing data in tables. It is equivalent to a table in a relational database but with better optimization. It is a data abstraction and domain-specific language (DSL) applicable to a structure and semi-structured data. It is a distributed collection of data in the form of row and named column. It has a matrix-like structure whose column may be different types (numeric, logical, factor, or character ).we can say data frame has a two-dimensional array like structure where each column contains the value of one variable and row contains one set of values for each column. It combines feature of list and matrices.

An RDD stands for Resilient Distributed Datasets that is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations on large clusters in a fault-tolerant manner. RDD is the representation of a set of records, immutable collection of objects with distributed computing. RDD is a large collection of data or RDD is an array of reference for partitioned objects. Here, In RDD each and every dataset is logically partitioned across many servers so that they can be computed on different nodes of the cluster.The dataset could be data loaded externally by the users which can be in the form of CSV file, JSON file, text file or database via JDBC with no specific data structure.

Dataset is a data structure in SparkSQL which is also a map to a relational schema and is strongly typed. It represents structured queries with encoders. Dataset is an extension to data frame API. Spark Dataset provides object-oriented programming interface. The main approach is to work with semi-structured and structured data.

Basically, Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface.

Answering your second question, yes conversion of RDD to DataFrame and DF to RDD is possible:

RDD to DataFrame with .toDF()

val rowsRdd: RDD[Row] = sc.parallelize(

  Seq(

    Row("first", 2.0, 7.0),

    Row("second", 3.5, 2.5),

    Row("third", 7.0, 5.9)

  )

)

val df = spark.createDataFrame(rowsRdd).toDF("id", "val1", "val2")

df.show()

+------+----+----+

|    id|val1|val2|

+------+----+----+

| first| 2.0| 7.0|

|second| 3.5| 2.5|

| third| 7.0| 5.9|

+------+----+----+

2. DataFrame/DataSet to RDD with .rdd() method

val rowsRdd: RDD[Row] = df.rdd() // DataFrame to RDD

If you want to know more about Spark, then do check out this awesome video tutorial:

...