Sparklyr: Use group_by and then concatenate strings from rows in a group

Question

asked Jul 10, 2019 in Data Science by sourav (17.6k points)

I am trying to use the group_by() and mutate() functions in sparklyr to concatenate rows in a group.

Here is a simple example that I think should work but doesn't:

library(sparkylr)
d <- data.frame(id=c("1", "1", "2", "2", "1", "2"),
x=c("200","200","200","201","201","201"),
y=c("This","That","The","Other","End","End"))
d_sdf <- copy_to(sc, d, "d")
d_sdf %>% group_by(id, x) %>% mutate( y = paste(y, collapse = " "))

What I'd like it to produce is:

Source: local data frame [6 x 3]
Groups: id, x [4]
# A tibble: 6 x 3
id x y
<fctr> <fctr> <chr>
1 1 200 This That
2 1 200 This That
3 2 200 The
4 2 201 Other End
5 1 201 End
6 2 201 Other End

I get the following error:

Error: org.apache.spark.sql.AnalysisException: missing ) at 'AS' near '' '' in selection target; line 1 pos 42

Note that the using the same code on a data.frame works fine:

d %>% group_by(id, x) %>% mutate( y = paste(y, collapse = " "))

1 Answer

Shlok Pandey · Answer 1 · 2019-07-12T07:22:12+0000

Spark sql doesn't support it if you use aggregate functions without aggregating.

Sparklyr translates commands to an sql statement, this is the reason why it works in dplyr with an ordinary dataframe but does’nt work in a SparkDataFrame.

If you look at the second bit in the error message,you can see that it is incorrect.

== SQL ==
SELECT `id`, `x`, CONCAT_WS(' ', `y`, ' ' AS "collapse") AS `y`

paste gets translated to CONCAT_WS. concat however would paste columns together.

Here, you can also use collect_list and collect_set, but they produce list outputs.

You can use summarise, collect_list, and paste, if you do not want your result to be replicated by the same row.

res <- d_sdf %>%
group_by(id, x) %>%
summarise( yconcat =paste(collect_list(y)))
result:
Source: lazy query [?? x 3]
Database: spark connection master=local[8] app=sparklyr local=TRUE
Grouped by: id
id x y
<chr> <chr> <chr>
1 1 201 End
2 2 201 Other End
3 1 200 This That
4 2 200 The

If you want to have your rows replicated, you can join this back onto your original data.

d_sdf %>% left_join(res)
result:
Source: lazy query [?? x 4]
Database: spark connection master=local[8] app=sparklyr local=TRUE
id x y yconcat
<chr> <chr> <chr> <chr>
1 1 200 This This That
2 1 200 That This That
3 2 200 The The
4 2 201 Other Other End
5 1 201 End End
6 2 201 End Other End

Sparklyr: Use group_by and then concatenate strings from rows in a group

1 Answer

Related questions

Browse Categories