0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)
I have a huge textual data that I need to create its word cloud. I am using a Python library named word_cloud in order to create the word cloud which is quite configurable. The problem is that my textual data is really huge, so a high-end computer is not able to complete the task even for long hours.

The data is firstly stored in MongoDB. Due to Cursor issues while reading the data into a Python list, I have exported the whole data to a plain text file - simply a txt file which is 304 MB.

So I just want to know how can I handle this huge textual data? The word_cloud library needs a String parameter that contains the whole data separated with ' ' in order to create the Word Cloud.

1 Answer

0 votes
by (24.7k points)

Word cloud is also known as a Tag cloud is a visual representation of text data, typically used to depict keyword metadata on websites, or to visualize free form text. Tags are usually single words, and the importance of each tag is shown with its font size and color. This format is useful for quickly perceiving the most prominent terms and for locating a term alphabetically to determine its importance.

Now, since your data is stored in MongoDB and you are using python language, so I am hoping that you might have installed Python drivers and connected to MongoDB.

Almost everything is sorted, but for better handling of your word cloud, you just need not have to load all the files in memory

from wordcloud import WordCloud

from collections import Counter

wordc = WordCloud()

counts_all = Counter()

with open('path/to/file.txt', 'r') as f:

   for line in f:  # Here you can also use the Cursor

       counts_line = wordc.process_text(line)

       counts_all.update(counts_line)

wordc.generate_from_frequencies(counts_all)

wordc.to_file('/tmp/wc.png')

...