Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I have the following dataframe

val transactions_with_counts = sqlContext.sql(
  """SELECT user_id AS user_id, category_id AS category_id,
  COUNT(category_id) FROM transactions GROUP BY user_id, category_id""")


I'm trying to convert the rows to Rating objects but since x(0) returns an array this fails

val ratings = transactions_with_counts
  .map(x => Rating(x(0).toInt, x(1).toInt, x(2).toInt))


error: value toInt is not a member of Any

1 Answer

0 votes
by (32.3k points)

Let’s take a dummy data:

val transactions = Seq((1, 2), (1, 4), (2, 3)).toDF("user_id", "category_id")

val transactions_with_counts = transactions

  .groupBy($"user_id", $"category_id")

  .count

transactions_with_counts.printSchema

// root

// |-- user_id: integer (nullable = false)

// |-- category_id: integer (nullable = false)

// |-- count: long (nullable = false)

Now, given below are some ways to access Row values and keep expected types:


 

1.Pattern matching

import org.apache.spark.sql.Row

transactions_with_counts.map{

  case Row(user_id: Int, category_id: Int, rating: Long) =>

    Rating(user_id, category_id, rating)

2.Typed get* methods like getInt, getLong:

transactions_with_counts.map(

  r => Rating(r.getInt(0), r.getInt(1), r.getLong(2))

)

3.getAs method which can use both names and indices:

transactions_with_counts.map(r => Rating(

  r.getAs[Int]("user_id"), r.getAs[Int]("category_id"), r.getAs[Long](2)

))

It can be used to properly extract user defined types, including mllib.linalg.Vector. Obviously accessing by name requires a schema.

4.Converting to statically typed Dataset (Spark 1.6+ / 2.0+):

transactions_with_counts.as[(Int, Int, Long)]

Browse Categories

...