Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (55.6k points)

Can anyone explain the Dataset and DataFrame in spark?

1 Answer

0 votes
by (119k points)

DataFrames is a distributed collection of data organized as columns with the column names and types info. In addition, we can say data in dataframe is as same as the table in a relational database or a data frame in R/Python. The execution in DataFrame is lazy triggered (similar to RDD). It allows data processing in several formats such as AVRO, CSV, JSON, and storage system HDFS, HIVE tables, MySQL.

Datasets are an extension of DataFrames. Actually, it earns strongly typed and untyped APIs characteristics. Datasets are a collection of strongly typed JVM objects by default whereas, in dataframes, it is not. Also, it uses Spark’s Catalyst optimizer to reveal expressions & data field to a query planner. Dataset also supports data from different sources.

If you wish to learn Spark then check out this Spark Course by Intellipaat that offers instructor-led training, hands-on projects, and certification.

Also, watch this video on Spark DataFrames:

Related questions

Browse Categories