I am stuck on the Text Mining assignment. I am working on Zhaos text mining with Twitter. I have downloaded tweets and converted them to the data frame and now in the corpus, I want to remove all terms using the word count of one instead of using the stopword list.

Here is the code:

tf1 <- Corpus(VectorSource(tweets.df$text))

tf1 <- tm_map(tf1, content_transformer(tolower))

removeUser <- function(x) gsub("@[[:alnum:]]*", "", x)

tf1 <- tm_map(tf1, content_transformer(removeUser))

removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)

tf1 <- tm_map(tf1, content_transformer(removeNumPunct))

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)

tf1 <- tm_map(tf1, content_transformer(removeURL))

tf1 <- tm_map(tf1, stripWhitespace)

#Using TermDocMatrix in order to find terms with count 1, dont know any other way

tdmtf1 <- TermDocumentMatrix(tf1, control = list(wordLengths = c(1, Inf)))

ones <- findFreqTerms(tdmtf1, lowfreq = 1, highfreq = 1)

tf1Copy <- tf1

tf1List <- setdiff(tf1Copy, ones)

tf1CList <- paste(unlist(tf1List),sep="", collapse=" ")

tf1Copy <- tm_map(tf1Copy, removeWords, tf1CList)

tdmtf1Test <- TermDocumentMatrix(tf1Copy, control = list(wordLengths = c(1, Inf)))

#Just to test success...

ones2 <- findFreqTerms(tdmtf1Test, lowfreq = 1, highfreq = 1)



Error in gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : invalid regular expression '(*UCP)\b(senior data scientist global strategy firm

25.0010230541229 48 17 6 6 115 1 186 0 1 en kdnuggets poll primary programming language for analytics data mining data scienc

25.0020229816437 48 17 6 6 115 1 186 0 2 en iapa canberra seminar mining the internet of everything official statistics in the information age anu june

25.0020229816437 48 17 6 6 115 1 186 0 3 en handling and processing strings in r an ebook in pdf format pages

25.0020229816437 48 17 6 6 115 1 186 0 4 en webinar getting your data into r by hadley wickham am edt june th

25.0020229816437 48 17 6 6 115 1 186 0 5 en before loading the rdmtweets dataset please run librarytwitter to load required package

25.0020229816437 48 17 6 6 115 1 186 0 6 en an infographic on sas vs r vs python datascience via

25.0020229816437 48 17 6 6 115 1 186 0 7 en r is again the kdnuggets poll on top analytics data mining science software

25.0020229816437 48 17 6 6 115 1 186 0 8 en i will run


Warning message: In gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : PCRE pattern compilation error 'regular expression is too large' at ''

1 Answer

0 votes
by (36.8k points)

The below code helps you to fix the corpus as you required. 


mytweets <- c("This is a doc", "This is another doc")

corp <- Corpus(VectorSource(mytweets))


# [[1]]

# <<PlainTextDocument (metadata: 7)>>

# This is a doc

# [[2]]

# <<PlainTextDocument (metadata: 7)>>

#   This is another doc

##            ^^^ 

dtm <- DocumentTermMatrix(corp)


# Terms

# Docs another doc this

# 1       0   1    1

# 2       1   1    1

(stopwords <- findFreqTerms(dtm, 1, 1))

# [1] "another"

corp <- tm_map(corp, removeWords, stopwords)


# [[1]]

# <<PlainTextDocument (metadata: 7)>>

# This is a doc

# [[2]]

# <<PlainTextDocument (metadata: 7)>>

# This is  doc

##        ^ 'another' is gone

