Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I'm currently using python3.7 in a Jupyter Notebook (v5.6.0) with pandas 0.23.4.

I've written code to tokenize some Japanese words and have successfully applied a word count function that returns the word counts from each row in a pandas Series like so:

0       [(かげ, 20), (モリア, 17), (たち, 15), (お前, 14), (おれ,...

1       [(お前, 11), (ゾロ, 10), (うっ, 10), (たち, 9), (サンジ, ...

2       [(おれ, 11), (男, 6), (てめえ, 6), (お前, 5), (首, 5), ...

3       [(おれ, 19), (たち, 14), (ヨホホホ, 12), (お前, 10), (みん...

4       [(ラブーン, 32), (たち, 14), (おれ, 12), (お前, 12), (船長...

5       [(ヨホホホ, 19), (おれ, 13), (ラブーン, 12), (船長, 11), (...

6       [(わたし, 20), (おれ, 16), (海賊, 9), (お前, 9), (もう, 9...

7       [(たち, 21), (あたし, 15), (宝石, 14), (おれ, 12), (ハッ,...

8       [(おれ, 13), (あれ, 9), (もう, 7), (ヨホホホ, 7), (見え, 7...

9       [(ケイミー, 23), (人魚, 20), (はっち, 14), (おれ, 13), (め...

10      [(ケイミー, 18), (おれ, 17), (め, 14), (たち, 12), (はっち... 

From this previously asked question:

Creating a dictionary of word count of multiple text files in a directory

I thought I could use the answer to help with my objective.

I want to consolidate all the above pairs in each row into a dictionary where the key is the Japanese text, and the value is the sum of all the instances of the text appearing within the data set. I thought I could accomplish this with the collections.Counter module by turning each row in the series into a dictionary, like this:

vocab_list = []

for i in range(len(wordcount)):

    vocab_list.append(dict(wordcount[i]))

Which gives me the dictionary format that I want, where each row in the Series is now a dictionary, like so:

[{'かげ': 20,

 'モリア': 17,

 'たち': 15,

 'お前': 14,

 'おれ': 11,

 'もう': 9,

 '船長': 7,

 'っ': 7,

 '七武海': 7,

 '言っ': 6, ...

My problem comes when I try to use the sum() function and Counter() to aggregate the totals:

vocab_list = sum(vocab_list, Counter())

print(vocab_list)

Instead of getting the expected "aggregated dictionary", I receive the following error:

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-37-3c66e97f4559> in <module>()

      3     vocab_list.append(dict(wordcount[i]))

      4 

----> 5 vocab_list = sum(vocab_list, Counter())

      6 vocab_list

TypeError: unsupported operand type(s) for +: 'Counter' and 'dict'

Could you explain what exactly is wrong in the code and how to fix it?

1 Answer

0 votes
by (41.4k points)

You can simply aggregate by sum If the elements in your series are of type Counter

df.agg(sum)

So, here is an illustration:

from collections import Counter

df = pd.Series([[('かげ', 20), ('男', 17), ('たち', 15), ('お前', 14)],[('お前', 11), ('ゾロ', 10), ('うっ', 10), ('たち', 9)],[('おれ', 11), ('男', 6), ('てめえ', 6), ('お前', 5), ('首', 5)]])   

df = df.apply(lambda x: Counter({y[0]:y[1] for y in x}))

df

# Out:

# 0          {'かげ': 20, '男': 17, 'たち': 15, 'お前': 14}

# 1          {'お前': 11, 'ゾロ': 10, 'うっ': 10, 'たち': 9}

# 2    {'おれ': 11, '男': 6, 'てめえ': 6, 'お前': 5, '首': 5}

# dtype: object

df.agg(sum)

# Out:

# Counter({'うっ': 10,

#          'おれ': 11,

#          'お前': 30,

#          'かげ': 20,

#          'たち': 24,

#          'てめえ': 6,

#          'ゾロ': 10,

#          '男': 23,

#          '首': 5})

If you wish to learn more about how to use python for data science, then go through data science python programming course by Intellipaat for more insights.

Related questions

...