Spark sql doesn't support it if you use aggregate functions without aggregating.
Sparklyr translates commands to an sql statement, this is the reason why it works in dplyr with an ordinary dataframe but does’nt work in a SparkDataFrame.
If you look at the second bit in the error message,you can see that it is incorrect.
== SQL ==
SELECT `id`, `x`, CONCAT_WS(' ', `y`, ' ' AS "collapse") AS `y`
paste gets translated to CONCAT_WS. concat however would paste columns together.
Here, you can also use collect_list and collect_set, but they produce list outputs.
You can use summarise, collect_list, and paste, if you do not want your result to be replicated by the same row.
res <- d_sdf %>%
group_by(id, x) %>%
summarise( yconcat =paste(collect_list(y)))
result:
Source: lazy query [?? x 3]
Database: spark connection master=local[8] app=sparklyr local=TRUE
Grouped by: id
id x y
<chr> <chr> <chr>
1 1 201 End
2 2 201 Other End
3 1 200 This That
4 2 200 The
If you want to have your rows replicated, you can join this back onto your original data.
d_sdf %>% left_join(res)
result:
Source: lazy query [?? x 4]
Database: spark connection master=local[8] app=sparklyr local=TRUE
id x y yconcat
<chr> <chr> <chr> <chr>
1 1 200 This This That
2 1 200 That This That
3 2 200 The The
4 2 201 Other Other End
5 1 201 End End
6 2 201 End Other End