Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in R Programming by (7.3k points)
edited by

I have data from an online survey where respondents go through a loop of questions 1-3 times. The survey software (Qualtrics) records this data in multiple columns—that is, Q3.2 in the survey will have columns Q3.2.1., Q3.2.2., and Q3.2.3.:

df <- data.frame(

  id = 1:10,

  time = as.Date('2009-01-01') + 0:9,

  Q3.2.1. = rnorm(10, 0, 1),

  Q3.2.2. = rnorm(10, 0, 1),

  Q3.2.3. = rnorm(10, 0, 1),

  Q3.3.1. = rnorm(10, 0, 1),

  Q3.3.2. = rnorm(10, 0, 1),

  Q3.3.3. = rnorm(10, 0, 1)

)

# Sample data

   id       time    Q3.2.1.     Q3.2.2.    Q3.2.3.     Q3.3.1.    Q3.3.2.     Q3.3.3.

1   1 2009-01-01 -0.2059165 -0.29177677 -0.7107192  1.52718069 -0.4484351 -1.21550600

2   2 2009-01-02 -0.1981136 -1.19813815  1.1750200 -0.40380049 -1.8376094  1.03588482

3   3 2009-01-03  0.3514795 -0.27425539  1.1171712 -1.02641801 -2.0646661 -0.35353058

...

I want to combine all the QN.N* columns into tidy individual QN.N columns, ultimately ending up with something like this:

   id       time loop_number        Q3.2        Q3.3

1   1 2009-01-01           1 -0.20591649  1.52718069

2   2 2009-01-02           1 -0.19811357 -0.40380049

3   3 2009-01-03           1  0.35147949 -1.02641801

...

11  1 2009-01-01           2 -0.29177677  -0.4484351

12  2 2009-01-02           2 -1.19813815  -1.8376094

13  3 2009-01-03           2 -0.27425539  -2.0646661

...

21  1 2009-01-01           3 -0.71071921 -1.21550600

22  2 2009-01-02           3  1.17501999  1.03588482

23  3 2009-01-03           3  1.11717121 -0.35353058

...

The tidyr library has the gather() function, which works great for combining one set of columns:

library(dplyr)

library(tidyr)

library(stringr)

df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>% 

  mutate(loop_number = str_sub(loop_number,-2,-2)) %>%

  select(id, time, loop_number, Q3.2)

   id       time loop_number        Q3.2

1   1 2009-01-01           1 -0.20591649

2   2 2009-01-02           1 -0.19811357

3   3 2009-01-03           1  0.35147949

...

29  9 2009-01-09           3 -0.58581232

30 10 2009-01-10           3 -2.33393981

The resultant data frame has 30 rows, as expected (10 individuals, 3 loops each). However, gathering a second set of columns does not work correctly—it successfully makes the two combined columns Q3.2 and Q3.3, but ends up with 90 rows instead of 30 (all combinations of 10 individuals, 3 loops of Q3.2, and 3 loops of Q3.3; the combinations will increase substantially for each group of columns in the actual data):

df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>% 

  gather(loop_number, Q3.3, starts_with("Q3.3")) %>%

  mutate(loop_number = str_sub(loop_number,-2,-2))

   id       time loop_number        Q3.2        Q3.3

1   1 2009-01-01           1 -0.20591649  1.52718069

2   2 2009-01-02           1 -0.19811357 -0.40380049

3   3 2009-01-03           1  0.35147949 -1.02641801

...

89  9 2009-01-09           3 -0.58581232 -0.13187024

90 10 2009-01-10           3 -2.33393981 -0.48502131

Is there a way to use multiple calls to gather() like this, combining small subsets of columns like this while maintaining the correct number of rows?

1 Answer

0 votes
by
edited by

To gather multiple sets of columns, you can use the tidyr package as follows:

library("tidyr")

df <- data.frame(

  id = 1:10,

  time = as.Date('2009-01-01') + 0:9,

  Q3.2.1. = rnorm(10, 0, 1),

  Q3.2.2. = rnorm(10, 0, 1),

  Q3.2.3. = rnorm(10, 0, 1),

  Q3.3.1. = rnorm(10, 0, 1),

  Q3.3.2. = rnorm(10, 0, 1),

  Q3.3.3. = rnorm(10, 0, 1)

)

df %>%

  gather(key, value, -id, -time) %>%

  extract(key, c("question", "loop_number"), "(Q.\\..)\\.(.)") %>%

  spread(question, value)

Output:

   id       time loop_number          Q3.2        Q3.3

1   1 2009-01-01           1  2.473868e-01 -0.52260859

2   1 2009-01-01           2 -1.139342e+00  0.64499495

3   1 2009-01-01           3  1.365931e+00 -0.88350400

4   2 2009-01-02           1  5.017121e-01  0.36718054

5   2 2009-01-02           2  1.120268e+00 -2.00949983

6   2 2009-01-02           3 -3.571883e-01 -0.04633724

7   3 2009-01-03           1  1.230658e+00  0.80449116

8   3 2009-01-03           2  1.603380e-01 -2.38161811

9   3 2009-01-03           3  2.903213e-01 -0.08863290

10  4 2009-01-04           1  1.278451e+00  0.55469112

The above code first gathers all the columns, then separates them into question and loop_number using the extract function, and then spreads questions back into columns using the spread function.

Related questions

Browse Categories

...