Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I am starting to use Spark DataFrames and I need to be able to pivot the data to create multiple columns out of 1 column with multiple rows. There is built in functionality for that in Scalding and I believe in Pandas in Python, but I can't find anything for the new Spark Dataframe.

I assume I can write custom function of some sort that will do this but I'm not even sure how to start, especially since I am a novice with Spark. I anyone knows how to do this with built in functionality or suggestions for how to write something in Scala, it is greatly appreciated.

1 Answer

0 votes
by (32.3k points)

Spark provides pivot function since version 1.6.

Let me give you a example using nycflights13 and csv format.

Nycflights13 is a package that contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) in 2013: 336,776 flights in total. To help understand what causes delays, it also includes a number of other useful datasets. This package provides the following data tables.


 

val flights = sqlContext

  .read

  .format("csv")

  .options(Map("inferSchema" -> "true", "header" -> "true"))

  .load("flights.csv")

flights

  .groupBy($"origin", $"dest", $"carrier")

  .pivot("hour")

  .agg(avg($"arr_delay"))

Browse Categories

...