Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (55.6k points)

Can anyone explain the DataFrame in spark?

1 Answer

0 votes
by (119k points)

In Spark, a DataFrame is a bunch of data organized into named columns and rows. It is similar to a table in a relational database or a data frame in R/Python with better optimizations under the hood. DataFrames can be constructed from a variety of sources like structured data files, tables in Hive, external databases, or existing RDDs.

Here are a few examples of how you can create DataFrame and use DataFrame:

# To create a DataFrame from the users table in Hive.

users = context.table("users")

# from JSON files in S3

logs = context.load("s3n://path_to_data.json", "json")

# Create a new DataFrame of “young users” of age less than 21

young = users.filter(users.age < 21)

# Alternatively, using Pandas-like syntax

young = users[users.age < 21]

# Increment everybody’s age by 1

young.select(young.name, young.age + 1)

#To count the number of young users gender-wise

young.groupBy("gender").count()

# Join two DataFrames young users and logs

young.join(logs, logs.userId == users.userId, "left_outer")

If you wish to learn Spark then you can take up this Spark Training course by Intellipaat

Also, watch this video on Spark DataFrames:

Browse Categories

...