How to get “aggregate” word count from pandas Series elements

Question

asked Jul 10, 2019 in Data Science by sourav (17.6k points)

I'm currently using python3.7 in a Jupyter Notebook (v5.6.0) with pandas 0.23.4.

I've written code to tokenize some Japanese words and have successfully applied a word count function that returns the word counts from each row in a pandas Series like so:

0 [(かげ, 20), (モリア, 17), (たち, 15), (お前, 14), (おれ,...
1 [(お前, 11), (ゾロ, 10), (うっ, 10), (たち, 9), (サンジ, ...
2 [(おれ, 11), (男, 6), (てめえ, 6), (お前, 5), (首, 5), ...
3 [(おれ, 19), (たち, 14), (ヨホホホ, 12), (お前, 10), (みん...
4 [(ラブーン, 32), (たち, 14), (おれ, 12), (お前, 12), (船長...
5 [(ヨホホホ, 19), (おれ, 13), (ラブーン, 12), (船長, 11), (...
6 [(わたし, 20), (おれ, 16), (海賊, 9), (お前, 9), (もう, 9...
7 [(たち, 21), (あたし, 15), (宝石, 14), (おれ, 12), (ハッ,...
8 [(おれ, 13), (あれ, 9), (もう, 7), (ヨホホホ, 7), (見え, 7...
9 [(ケイミー, 23), (人魚, 20), (はっち, 14), (おれ, 13), (め...
10 [(ケイミー, 18), (おれ, 17), (め, 14), (たち, 12), (はっち...

From this previously asked question:

Creating a dictionary of word count of multiple text files in a directory

I thought I could use the answer to help with my objective.

I want to consolidate all the above pairs in each row into a dictionary where the key is the Japanese text, and the value is the sum of all the instances of the text appearing within the data set. I thought I could accomplish this with the collections.Counter module by turning each row in the series into a dictionary, like this:

vocab_list = []
for i in range(len(wordcount)):
vocab_list.append(dict(wordcount[i]))

Which gives me the dictionary format that I want, where each row in the Series is now a dictionary, like so:

[{'かげ': 20,
'モリア': 17,
'たち': 15,
'お前': 14,
'おれ': 11,
'もう': 9,
'船長': 7,
'っ': 7,
'七武海': 7,
'言っ': 6, ...

My problem comes when I try to use the sum() function and Counter() to aggregate the totals:

vocab_list = sum(vocab_list, Counter())
print(vocab_list)

Instead of getting the expected "aggregated dictionary", I receive the following error:

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-37-3c66e97f4559> in <module>()
3 vocab_list.append(dict(wordcount[i]))
4
----> 5 vocab_list = sum(vocab_list, Counter())
6 vocab_list
TypeError: unsupported operand type(s) for +: 'Counter' and 'dict'

Could you explain what exactly is wrong in the code and how to fix it?

1 Answer

Shlok Pandey · Answer 1 · 2019-07-12T07:24:50+0000

You can simply aggregate by sum If the elements in your series are of type Counter

df.agg(sum)

So, here is an illustration:

from collections import Counter
df = pd.Series([[('かげ', 20), ('男', 17), ('たち', 15), ('お前', 14)],[('お前', 11), ('ゾロ', 10), ('うっ', 10), ('たち', 9)],[('おれ', 11), ('男', 6), ('てめえ', 6), ('お前', 5), ('首', 5)]])
df = df.apply(lambda x: Counter({y[0]:y[1] for y in x}))
df
# Out:
# 0 {'かげ': 20, '男': 17, 'たち': 15, 'お前': 14}
# 1 {'お前': 11, 'ゾロ': 10, 'うっ': 10, 'たち': 9}
# 2 {'おれ': 11, '男': 6, 'てめえ': 6, 'お前': 5, '首': 5}
# dtype: object
df.agg(sum)
# Out:
# Counter({'うっ': 10,
# 'おれ': 11,
# 'お前': 30,
# 'かげ': 20,
# 'たち': 24,
# 'てめえ': 6,
# 'ゾロ': 10,
# '男': 23,
# '首': 5})

If you wish to learn more about how to use python for data science, then go through data science python programming course by Intellipaat for more insights.

How to get “aggregate” word count from pandas Series elements

1 Answer

Related questions

Browse Categories