What is DataFrame in spark?

Question

1 Answer

Praveen_1998 · Answer 1 · 2020-06-04T14:06:25+0000

In Spark, a DataFrame is a bunch of data organized into named columns and rows. It is similar to a table in a relational database or a data frame in R/Python with better optimizations under the hood. DataFrames can be constructed from a variety of sources like structured data files, tables in Hive, external databases, or existing RDDs.

Here are a few examples of how you can create DataFrame and use DataFrame:

# To create a DataFrame from the users table in Hive.
users = context.table("users")
# from JSON files in S3
logs = context.load("s3n://path_to_data.json", "json")

# Create a new DataFrame of “young users” of age less than 21
young = users.filter(users.age < 21)
# Alternatively, using Pandas-like syntax
young = users[users.age < 21]
# Increment everybody’s age by 1
young.select(young.name, young.age + 1)
#To count the number of young users gender-wise
young.groupBy("gender").count()
# Join two DataFrames young users and logs
young.join(logs, logs.userId == users.userId, "left_outer")

If you wish to learn Spark then you can take up this Spark Training course by Intellipaat

Also, watch this video on Spark DataFrames:

What is DataFrame in spark?

What is DataFrame in spark?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions