Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series

Question

asked Oct 5, 2019 in Data Science by ashely (50.2k points)

I have two specific questions:

Why is grouper more efficient than count? I expected the count to be more efficient, as it is implemented in C. The superior performance of grouper persists even if the number of columns is increased from 2 to 4.

Why does value_counter underperform grouper by so much? Is this due to the cost of constructing a list, or series from the list?

I understand the outputs are different, and this should also inform choice. For example, filtering by count is more efficient with contiguous numpy arrays versus a dictionary comprehension:

x, z = grouper(df), count(df)
%timeit x[x.values > 10] # 749µs
%timeit {k: v for k, v in z.items() if v > 10} # 9.37ms

However, the focus of my question is on the performance of building comparable results in a series versus dictionary. My C knowledge is limited, yet I would appreciate any answer which can point to the logic underlying these methods.

Benchmarking code

import pandas as pd
import numpy as np
from collections import Counter
np.random.seed(0)
m, n = 1000, 100000
df = pd.DataFrame({'A': np.random.randint(0, m, n),
                   'B': np.random.randint(0, m, n)})
def grouper(df):
    return df.groupby(['A', 'B'], sort=False).size()
def value_counter(df):
    return pd.Series(list(zip(df.A, df.B))).value_counts(sort=False)
def count(df):
    return Counter(zip(df.A.values, df.B.values))
x = value_counter(df).to_dict()
y = grouper(df).to_dict()
z = count(df)
assert (x == y) & (y == z), "Dictionary mismatch!"
for m, n in [(100, 10000), (1000, 10000), (100, 100000), (1000, 100000)]:
    df = pd.DataFrame({'A': np.random.randint(0, m, n),
                       'B': np.random.randint(0, m, n)})
    print(m, n)
    %timeit grouper(df)
    %timeit value_counter(df)
    %timeit count(df)

Benchmarking results

Run on python 3.6.2, pandas 0.20.3, numpy 1.13.1
Machine specs: Windows 7 64-bit, Dual-Core 2.5 GHz, 4GB RAM.
Key: g = grouper, v = value_counter, c = count.
m n g v c
100 10000 2.91 18.30 8.41
1000 10000 4.10 27.20 6.98[1]
100 100000 17.90 130.00 84.50
1000 100000 43.90 309.00 93.50
1 This is not a typo.

1 Answer

vinita · Answer 1 · 2019-10-05T07:02:26+0000

As an example special to this question, observe the following timings comparing the performance of Counter on zipped numpy arrays vs. the corresponding zipped Python lists.

In [2]: a = np.random.randint(10**4, size=10**6)
   ...: b = np.random.randint(10**4, size=10**6)
   ...: a_list = a.tolist()
   ...: b_list = b.tolist()
In [3]: %timeit Counter(zip(a, b))
455 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit Counter(zip(a_list, b_list))
334 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The difference between these two timings gives you a reasonable estimate of the overhead.

This isn't quite the conclusion of the story though. Constructing a groupby object in pandas includes some overhead too, at least as related to this problem since there's some groupby metadata that isn't strictly required just to get the size, whereas Counter does the one singular thing you care about. Usually, this overhead is far less than the overhead associated with Counter, but from some instantaneous experimentation, I've found that you can actually get marginally better execution from Counter when the bulk of your groups just contains single elements.

Consider the following timings (sort=False) that go along the spectrum of few large groups <--> many small groups:

def grouper(df):
    return df.groupby(['A', 'B'], sort=False).size()
def count(df):
    return Counter(zip(df.A.values, df.B.values))
for m, n in [(10, 10**6), (10**3, 10**6), (10**7, 10**6)]:
    df = pd.DataFrame({'A': np.random.randint(0, m, n),
                       'B': np.random.randint(0, m, n)})
    print(m, n)
    %timeit grouper(df)
    %timeit count(df)

Which delivers me the following table:

m grouper counter
10 62.9 ms 315 ms
10**3 191 ms 535 ms
10**7 514 ms 459 ms

Of course, any gains from Counter would be offset by converting back to a Series, if that's what you want as your final object.

Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series

Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions