The GROUP BY clause is basically used to group the results of aggregate functions based on a specified column. However, this clause doesn’t perform aggregate operations on multiple levels of a hierarchy. For example, you can calculate the total of all employee salaries for each department in a company (one level of hierarchy) but you cannot calculate the total salary of all employees regardless of the department they work in (two levels of hierarchy).
groupBy is simply an equivalent of the GROUP BY clause in standard SQL.
table.groupBy($"foo", $"bar")
is equivalent to:
SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar
CUBE operator consist of similar functionality as the ROLLUP operator, the CUBE operator has got the similar functionality; however, the CUBE operator is capable of calculating subtotals and grand totals for all permutations of the columns specified in it.
val df = Seq(("abc", 1L), ("abc", 2L), ("mnp", 2L), ("mnp", 2L)).toDF("x", "y")
df.show
// +---+---+
// | x| y|
// +---+---+
// |abc| 1|
// |abc| 2|
// |mnp| 2|
// |mnp| 2|
// +---+---+
and you compute cube(x, y) with count as an aggregation:
df.cube($"x", $"y").count.show
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// |null| 1| 1| <- count of records where y = 1
// |null| 2| 3| <- count of records where y = 2
// | abc|null| 2| <- count of records where x = abc
// | mnp| 2| 2| <- count of records where x = mnp AND y = 2
// | abc| 1| 1| <- count of records where x = abc AND y = 1
// | abc| 2| 1| <- count of records where x = abc AND y = 2
// |null|null| 4| <- total count of records
// | mnp|null| 2| <- count of records where x = mnp
// +----+----+-----+
As I mentioned earlier similar functioning of cube is similar to rollup which computes hierarchical subtotals from left to right:
df.rollup($"x", $"y").count.show
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// | abc|null| 2| <- count where x is fixed to abc
// | mnp| 2| 2| <- count where x is fixed to mnp and y is fixed to 2
// | abc| 1| 1| ...
// | abc| 2| 1| ...
// |null|null| 4| <- count where no column is fixed
// | mnp|null| 2| <- count where x is fixed to mnp
// +----+----+-----+
Now, given below is the result of plain groupBy:
df.groupBy($"x", $"y").count.show
// +---+---+-----+
// | x| y|count|
// +---+---+-----+
// |abc| 1| 1| <- this is identical to x = abc AND y = 1 in CUBE or ROLLUP
// |abc| 2| 1| <- this is identical to x = abc AND y = 2 in CUBE or ROLLUP
// |mnp| 2| 2| <- this is identical to x = mnp AND y = 2 in CUBE or ROLLUP
// +---+---+-----+