0 votes
1 view
in Data Science by (7.5k points)

I would like to create a corpus/vocabulary made by all the texts (tokenised) within a column in my data frame:

User Text

312  Include details about your goal

41   Describe expected and actual results

421  Include any error messages

What I would like to do is to remove first the stopwords, then appending all the tokenised word into a list, i.e.:

my_list=['Include', 'details', 'goal', 'Describe', 'expected', 'actual', 'results', 'Include', 'error', 'messages']

I tried as follows:

df['Text'].apply(lambda x: [item for item in x if item not in stop_words])

but it gives me character, not words.

1 Answer

0 votes
by (15.3k points)

You do not need to apply

l = df.Text.str.split(' ').sum()

yourlist = [x for x in l if x not in stop_words]

If you are a beginner and want to know more about Data Science the do check out the Data Science course

Welcome to Intellipaat Community. Get your technical queries answered by top developers !