Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

I am stuck on the Text Mining assignment. I am working on Zhaos text mining with Twitter. I have downloaded tweets and converted them to the data frame and now in the corpus, I want to remove all terms using the word count of one instead of using the stopword list.

Here is the code:

tf1 <- Corpus(VectorSource(tweets.df$text))

tf1 <- tm_map(tf1, content_transformer(tolower))

removeUser <- function(x) gsub("@[[:alnum:]]*", "", x)

tf1 <- tm_map(tf1, content_transformer(removeUser))

removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)

tf1 <- tm_map(tf1, content_transformer(removeNumPunct))

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)

tf1 <- tm_map(tf1, content_transformer(removeURL))

tf1 <- tm_map(tf1, stripWhitespace)

#Using TermDocMatrix in order to find terms with count 1, dont know any other way

tdmtf1 <- TermDocumentMatrix(tf1, control = list(wordLengths = c(1, Inf)))

ones <- findFreqTerms(tdmtf1, lowfreq = 1, highfreq = 1)

tf1Copy <- tf1

tf1List <- setdiff(tf1Copy, ones)

tf1CList <- paste(unlist(tf1List),sep="", collapse=" ")

tf1Copy <- tm_map(tf1Copy, removeWords, tf1CList)

tdmtf1Test <- TermDocumentMatrix(tf1Copy, control = list(wordLengths = c(1, Inf)))

#Just to test success...

ones2 <- findFreqTerms(tdmtf1Test, lowfreq = 1, highfreq = 1)

(ones2)

Error:

Error in gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : invalid regular expression '(*UCP)\b(senior data scientist global strategy firm

25.0010230541229 48 17 6 6 115 1 186 0 1 en kdnuggets poll primary programming language for analytics data mining data scienc

25.0020229816437 48 17 6 6 115 1 186 0 2 en iapa canberra seminar mining the internet of everything official statistics in the information age anu june

25.0020229816437 48 17 6 6 115 1 186 0 3 en handling and processing strings in r an ebook in pdf format pages

25.0020229816437 48 17 6 6 115 1 186 0 4 en webinar getting your data into r by hadley wickham am edt june th

25.0020229816437 48 17 6 6 115 1 186 0 5 en before loading the rdmtweets dataset please run librarytwitter to load required package

25.0020229816437 48 17 6 6 115 1 186 0 6 en an infographic on sas vs r vs python datascience via

25.0020229816437 48 17 6 6 115 1 186 0 7 en r is again the kdnuggets poll on top analytics data mining science software

25.0020229816437 48 17 6 6 115 1 186 0 8 en i will run

Warning:

Warning message: In gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : PCRE pattern compilation error 'regular expression is too large' at ''

1 Answer

0 votes
by (36.8k points)

The below code helps you to fix the corpus as you required. 

library(tm)

mytweets <- c("This is a doc", "This is another doc")

corp <- Corpus(VectorSource(mytweets))

inspect(corp)

# [[1]]

# <<PlainTextDocument (metadata: 7)>>

# This is a doc

# [[2]]

# <<PlainTextDocument (metadata: 7)>>

#   This is another doc

##            ^^^ 

dtm <- DocumentTermMatrix(corp)

inspect(dtm)

# Terms

# Docs another doc this

# 1       0   1    1

# 2       1   1    1

(stopwords <- findFreqTerms(dtm, 1, 1))

# [1] "another"

corp <- tm_map(corp, removeWords, stopwords)

inspect(corp)

# [[1]]

# <<PlainTextDocument (metadata: 7)>>

# This is a doc

# [[2]]

# <<PlainTextDocument (metadata: 7)>>

# This is  doc

##        ^ 'another' is gone

If you want to know more about the Data Science then do check out the following Data Science which will help you in understanding Data Science from scratch

 

Browse Categories

...