Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I'm currently using python3.7 in a Jupyter Notebook (v5.6.0) with pandas 0.23.4.

I've written code to tokenize some Japanese words and have successfully applied a word count function that returns the word counts from each row in a pandas Series like so:

0       [(かげ, 20), (モリア, 17), (たち, 15), (お前, 14), (おれ,...

1       [(お前, 11), (ゾロ, 10), (うっ, 10), (たち, 9), (サンジ, ...

2       [(おれ, 11), (男, 6), (てめえ, 6), (お前, 5), (首, 5), ...

3       [(おれ, 19), (たち, 14), (ヨホホホ, 12), (お前, 10), (みん...

4       [(ラブーン, 32), (たち, 14), (おれ, 12), (お前, 12), (船長...

5       [(ヨホホホ, 19), (おれ, 13), (ラブーン, 12), (船長, 11), (...

6       [(わたし, 20), (おれ, 16), (海賊, 9), (お前, 9), (もう, 9...

7       [(たち, 21), (あたし, 15), (宝石, 14), (おれ, 12), (ハッ,...

8       [(おれ, 13), (あれ, 9), (もう, 7), (ヨホホホ, 7), (見え, 7...

9       [(ケイミー, 23), (人魚, 20), (はっち, 14), (おれ, 13), (め...

10      [(ケイミー, 18), (おれ, 17), (め, 14), (たち, 12), (はっち... 

From this previously asked question:

Creating a dictionary of word count of multiple text files in a directory

I thought I could use the answer to help with my objective.

I want to consolidate all the above pairs in each row into a dictionary where the key is the Japanese text, and the value is the sum of all the instances of the text appearing within the data set. I thought I could accomplish this with the collections.Counter module by turning each row in the series into a dictionary, like this:

vocab_list = []

for i in range(len(wordcount)):

    vocab_list.append(dict(wordcount[i]))

Which gives me the dictionary format that I want, where each row in the Series is now a dictionary, like so:

[{'かげ': 20,

 'モリア': 17,

 'たち': 15,

 'お前': 14,

 'おれ': 11,

 'もう': 9,

 '船長': 7,

 'っ': 7,

 '七武海': 7,

 '言っ': 6, ...

My problem comes when I try to use the sum() function and Counter() to aggregate the totals:

vocab_list = sum(vocab_list, Counter())

print(vocab_list)

Instead of getting the expected "aggregated dictionary", I receive the following error:

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-37-3c66e97f4559> in <module>()

      3     vocab_list.append(dict(wordcount[i]))

      4 

----> 5 vocab_list = sum(vocab_list, Counter())

      6 vocab_list

TypeError: unsupported operand type(s) for +: 'Counter' and 'dict'

Could you explain what exactly is wrong in the code and how to fix it?

1 Answer

0 votes
by (41.4k points)

You can simply aggregate by sum If the elements in your series are of type Counter

df.agg(sum)

So, here is an illustration:

from collections import Counter

df = pd.Series([[('かげ', 20), ('男', 17), ('たち', 15), ('お前', 14)],[('お前', 11), ('ゾロ', 10), ('うっ', 10), ('たち', 9)],[('おれ', 11), ('男', 6), ('てめえ', 6), ('お前', 5), ('首', 5)]])   

df = df.apply(lambda x: Counter({y[0]:y[1] for y in x}))

df

# Out:

# 0          {'かげ': 20, '男': 17, 'たち': 15, 'お前': 14}

# 1          {'お前': 11, 'ゾロ': 10, 'うっ': 10, 'たち': 9}

# 2    {'おれ': 11, '男': 6, 'てめえ': 6, 'お前': 5, '首': 5}

# dtype: object

df.agg(sum)

# Out:

# Counter({'うっ': 10,

#          'おれ': 11,

#          'お前': 30,

#          'かげ': 20,

#          'たち': 24,

#          'てめえ': 6,

#          'ゾロ': 10,

#          '男': 23,

#          '首': 5})

If you wish to learn more about how to use python for data science, then go through data science python programming course by Intellipaat for more insights.

Related questions

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...