1 view

Question is pretty much in the title. I can't find any detailed documentation regarding the differences.

I do notice a difference because when interchanging cube and groupBy function calls, I get different results. I noticed that for the result using 'cube', I got a lot of null values on the expressions I often grouped by.

by (32.2k points)
• The GROUP BY clause is basically used to group the results of aggregate functions based on a specified column. However, this clause doesn’t perform aggregate operations on multiple levels of a hierarchy. For example, you can calculate the total of all employee salaries for each department in a company (one level of hierarchy) but you cannot calculate the total salary of all employees regardless of the department they work in (two levels of hierarchy).

groupBy is simply an equivalent of the GROUP BY clause in standard SQL.

table.groupBy(\$"foo", \$"bar")

is equivalent to:

SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar

• ROLLUP operators let you extend the functionality of GROUP BY clauses by calculating subtotals and grand totals for a set of columns.

CUBE operator consist of similar functionality as the ROLLUP operator, the CUBE operator has got the similar functionality; however, the CUBE operator is capable of calculating subtotals and grand totals for all permutations of the columns specified in it.

• cube is equivalent to CUBE extension to GROUP BY. It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns. Lets say you have data like this:

val df = Seq(("abc", 1L), ("abc", 2L), ("mnp", 2L), ("mnp", 2L)).toDF("x", "y")

df.show

// +---+---+

// |  x| y|

// +---+---+

// |abc|  1|

// |abc|  2|

// |mnp|  2|

// |mnp|  2|

// +---+---+

and you compute cube(x, y) with count as an aggregation:

df.cube(\$"x", \$"y").count.show

// +----+----+-----+

// |   x| y|count|

// +----+----+-----+

// |null|   1| 1| <- count of records where y = 1

// |null|   2| 3| <- count of records where y = 2

// | abc|null|    2| <- count of records where x = abc

// | mnp|   2| 2| <- count of records where x = mnp AND y = 2

// | abc|   1| 1| <- count of records where x = abc AND y = 1

// | abc|   2| 1| <- count of records where x = abc AND y = 2

// |null|null|    4| <- total count of records

// | mnp|null|    2| <- count of records where x = mnp

// +----+----+-----+

As I mentioned earlier similar functioning of cube is similar to rollup which computes hierarchical subtotals from left to right:

df.rollup(\$"x", \$"y").count.show

// +----+----+-----+

// |   x| y|count|

// +----+----+-----+

// | abc|null|    2| <- count where x is fixed to abc

// | mnp|   2| 2| <- count where x is fixed to mnp and y is fixed to  2

// | abc|   1| 1| ...

// | abc|   2| 1| ...

// |null|null|    4| <- count where no column is fixed

// | mnp|null|    2| <- count where x is fixed to mnp

// +----+----+-----+

Now, given below is the result of plain groupBy:

df.groupBy(\$"x", \$"y").count.show

// +---+---+-----+

// |  x| y|count|

// +---+---+-----+

// |abc|  1| 1| <- this is identical to x = abc AND y = 1 in CUBE or ROLLUP

// |abc|  2| 1| <- this is identical to x = abc AND y = 2 in CUBE or ROLLUP

// |mnp|  2| 2| <- this is identical to x = mnp AND y = 2 in CUBE or ROLLUP

// +---+---+-----+