Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Big Data Hadoop & Spark by (11.4k points)

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.

I want to sum the values of each column, for instance the total number of steps on "steps" column.

As far as I see I want to use these kind of functions:http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

But I can understand how to use the function sum.

When I write the following:

val df = CSV.load(args(0))
val sumSteps = df.sum("steps")

the function sum cannot be resolved.

Do I use the function sum wrongly?

1 Answer

0 votes
by (32.3k points)

In order to use the functions you must import them first:

import org.apache.spark.sql.functions._

And then you can easily use them like this:

val df = CSV.load(args(0))

val sumSteps =  df.agg(sum("steps")).first.get(0)

You can also cast the result if needed:

val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)

Also, in case you want to go for multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:

val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first

If you wish to know about Spark and Scala Online Training.

Related questions

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94.1k users

Browse Categories

...