0 votes
1 view
in Data Science by (17.6k points)

I am trying to use the group_by() and mutate() functions in sparklyr to concatenate rows in a group.

Here is a simple example that I think should work but doesn't:

library(sparkylr)

d <- data.frame(id=c("1", "1", "2", "2", "1", "2"), 

             x=c("200","200","200","201","201","201"), 

             y=c("This","That","The","Other","End","End"))

d_sdf <- copy_to(sc, d, "d")

d_sdf %>% group_by(id, x) %>% mutate( y = paste(y, collapse = " "))

What I'd like it to produce is:

Source: local data frame [6 x 3]

Groups: id, x [4]

# A tibble: 6 x 3

      id      x         y

  <fctr> <fctr>     <chr>

1      1    200 This That

2      1    200 This That

3      2    200       The

4      2    201 Other End

5      1    201       End

6      2    201 Other End

I get the following error:

Error: org.apache.spark.sql.AnalysisException: missing ) at 'AS' near '' '' in selection target; line 1 pos 42

Note that the using the same code on a data.frame works fine:

d %>% group_by(id, x) %>% mutate( y = paste(y, collapse = " "))

1 Answer

0 votes
by (38.2k points)

Spark sql doesn't support it if you use aggregate functions without aggregating.

Sparklyr translates commands to an sql statement, this is the reason why it works in dplyr with an ordinary dataframe but does’nt  work in a SparkDataFrame.

 If you look at the second bit in the error message,you can see that it is incorrect.

== SQL ==

SELECT `id`, `x`, CONCAT_WS(' ', `y`, ' ' AS "collapse") AS `y`

paste gets translated to CONCAT_WS. concat however would paste columns together.

Here, you can also use collect_list and collect_set, but they produce list outputs.

You can use summarise, collect_list, and paste, if you do not want your result to be replicated by the same row.

res <- d_sdf %>% 

      group_by(id, x) %>% 

      summarise( yconcat =paste(collect_list(y)))

result:

Source:     lazy query [?? x 3]

Database:   spark connection master=local[8] app=sparklyr local=TRUE

Grouped by: id

     id     x         y

  <chr> <chr>     <chr>

1     1   201       End

2     2   201 Other End

3     1   200 This That

4     2   200       The

If you want to have your rows replicated, you can join this back onto your original data.

d_sdf %>% left_join(res)

result:

Source:     lazy query [?? x 4]

Database:   spark connection master=local[8] app=sparklyr local=TRUE

     id     x     y   yconcat

  <chr> <chr> <chr>     <chr>

1     1   200  This This That

2     1   200  That This That

3     2   200   The       The

4     2   201 Other Other End

5     1   201   End       End

6     2   201   End Other End

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...