0 votes
1 view
in Big Data Hadoop & Spark by (55.5k points)

Can anyone tell me the difference between Dataset and DataFrame in spark?

1 Answer

0 votes
by (119k points)

DataFrame is similar to a table in a relational database. Dataset is similar to DataFrame but an extension for dataframe API. Dataset provides additional feature such as type-safe, object-oriented programming interface of RDD API.

In DataFrame, in case if we want to access the column which is not present in the table then the dataframe APIs do not support compile-time error. Datasets offer compile-time type safety. The similarity is both DataFrame and Dataset support data from data sources.

DataFrame is immutable i.e. once transforming into dataframe, we cannot regenerate a domain object but Dataset can overcome this disadvantage of DataFrame to regenerate the RDD from dataframe.

DataFrame reduce the memory usage using off-heap memory for serialization and DataSets allow to perform an operation on serialized data to improve memory usage.

DataFrame is available in for languages such as Java, Python, Scala, and R whereas DataSets are available only in Scala and Java.

If you wish to learn Spark then sign up for this Spark Training course by Intellipaat that offers instructor-led training, hands-on projects, and certification.

Also, watch this video on Spark DataFrames:

Welcome to Intellipaat Community. Get your technical queries answered by top developers !