Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.

I want to sum the values of each column, for instance the total number of steps on "steps" column.

As far as I see I want to use these kind of functions:http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

But I can understand how to use the function sum.

When I write the following:

val df = CSV.load(args(0))
val sumSteps = df.sum("steps")

the function sum cannot be resolved.

Do I use the function sum wrongly?

1 Answer

0 votes
by (32.3k points)

In order to use the functions you must import them first:

import org.apache.spark.sql.functions._

And then you can easily use them like this:

val df = CSV.load(args(0))

val sumSteps =  df.agg(sum("steps")).first.get(0)

You can also cast the result if needed:

val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)

Also, in case you want to go for multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:

val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first

If you wish to know about Spark and Scala Online Training.

1.2k questions

2.7k answers

501 comments

693 users

Browse Categories

...