Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

I would like to create a corpus/vocabulary made by all the texts (tokenised) within a column in my data frame:

User Text

312  Include details about your goal

41   Describe expected and actual results

421  Include any error messages

What I would like to do is to remove first the stopwords, then appending all the tokenised word into a list, i.e.:

my_list=['Include', 'details', 'goal', 'Describe', 'expected', 'actual', 'results', 'Include', 'error', 'messages']

I tried as follows:

df['Text'].apply(lambda x: [item for item in x if item not in stop_words])

but it gives me character, not words.

1 Answer

0 votes
by (36.8k points)

You do not need to apply

l = df.Text.str.split(' ').sum()

yourlist = [x for x in l if x not in stop_words]

If you are a beginner and want to know more about Data Science the do check out the Data Science course

Browse Categories

...