Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Data Science by (17.6k points)

I have a very large table (1,000,000 X 20) to process and need to do it in a fast way.

For example, There are 2 columns X2 and X3 in my table:

enter image description here

    X1  X2                                          X3

c1  1   100020003001, 100020003002, 100020003003    100020003001, 100020003002, 100020003004

c2  2   100020003001, 100020004002, 100020004003    100020003001, 100020004007, 100020004009

c3  3   100050006003, 100050006001, 100050006001    100050006011, 100050006013, 100050006021

Now I would like to create 2 new columns which contain

1) the common words or the same numbers

For example: [1] "100020003001" "100020003002"

2) the count of the common words or the same numbers

For example: [1] 2

I have tried the method from the below thread, however, the processing time is slow since I did it with for loop:

Count common words in two strings

 library(stringi)

 Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+"))

Thanks for the help! I am really struggling here...

1 Answer

0 votes
by (41.4k points)

See the code below:

library(tidyverse)

df1 %>%

   mutate(common_words = map2(strsplit(X2, ", "),

                              strsplit(X3, ", "),  

                                   intersect), 

          count = lengths(common_words))

# X1                                       X2                                       X3

#1  1 100020003001, 100020003002, 100020003003 100020003001, 100020003002, 100020003004

#2  2 100020003001, 100020004002, 100020004003 100020003001, 100020004007, 100020004009

#3  3 100050006003, 100050006001, 100050006001 100050006011, 100050006013, 100050006021

#                common_words count

#1 100020003001, 100020003002     2

#2               100020003001     1

#3                                0

Here, we are splitting the 'X2', 'X3' columns by the ‘,’, 

Then, getting the intersect of corresponding list elements with map2 and to ‘count ’ the number of elements in the list, we are using lengths.

 

Or we can also use base R

 

df1$common_words <- Map(intersect, strsplit(df1$X2, ", "), strsplit(df1$X3, ", "))

df1$count <- lengths(df1$common_words)

data

df1 <- structure(list(X1 = 1:3, X2 = c("100020003001, 100020003002, 100020003003", 

"100020003001, 100020004002, 100020004003", "100050006003, 

 100050006001, 100050006001"

 ), X3 = c("100020003001, 100020003002, 100020003004", "100020003001, 

 100020004007, 100020004009", 

 "100050006011, 100050006013, 100050006021")), class = "data.frame", 

  row.names = c("c1", "c2", "c3"))

If you wish to learn more about how to use python for data science, then go through data science python programming course by Intellipaat for more insights.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94.1k users

Browse Categories

...