Explore Courses Blog Tutorials Interview Questions
0 votes
in Data Science by (50.2k points)

I have two specific questions:

Why is grouper more efficient than count? I expected the count to be more efficient, as it is implemented in C. The superior performance of grouper persists even if the number of columns is increased from 2 to 4.

Why does value_counter underperform grouper by so much? Is this due to the cost of constructing a list, or series from the list?

I understand the outputs are different, and this should also inform choice. For example, filtering by count is more efficient with contiguous numpy arrays versus a dictionary comprehension:

x, z = grouper(df), count(df)

%timeit x[x.values > 10]                        # 749µs

%timeit {k: v for k, v in z.items() if v > 10}  # 9.37ms

However, the focus of my question is on the performance of building comparable results in a series versus dictionary. My C knowledge is limited, yet I would appreciate any answer which can point to the logic underlying these methods.

Benchmarking code

import pandas as pd

import numpy as np

from collections import Counter


m, n = 1000, 100000

df = pd.DataFrame({'A': np.random.randint(0, m, n),

                   'B': np.random.randint(0, m, n)})

def grouper(df):

    return df.groupby(['A', 'B'], sort=False).size()

def value_counter(df):

    return pd.Series(list(zip(df.A, df.B))).value_counts(sort=False)

def count(df):

    return Counter(zip(df.A.values, df.B.values))

x = value_counter(df).to_dict()

y = grouper(df).to_dict()

z = count(df)

assert (x == y) & (y == z), "Dictionary mismatch!"

for m, n in [(100, 10000), (1000, 10000), (100, 100000), (1000, 100000)]:

    df = pd.DataFrame({'A': np.random.randint(0, m, n),

                       'B': np.random.randint(0, m, n)})

    print(m, n)

    %timeit grouper(df)

    %timeit value_counter(df)

    %timeit count(df)

Benchmarking results

Run on python 3.6.2, pandas 0.20.3, numpy 1.13.1

Machine specs: Windows 7 64-bit, Dual-Core 2.5 GHz, 4GB RAM.

Key: g = grouper, v = value_counter, c = count.

m           n g       v c

100     10000   2.91 18.30    8.41

1000    10000   4.10 27.20    6.98[1]

100    100000   17.90 130.00   84.50

1000   100000   43.90 309.00   93.50

1 This is not a typo.

1 Answer

0 votes
by (108k points)

As an example special to this question, observe the following timings comparing the performance of Counter on zipped numpy arrays vs. the corresponding zipped Python lists.

In [2]: a = np.random.randint(10**4, size=10**6)

   ...: b = np.random.randint(10**4, size=10**6)

   ...: a_list = a.tolist()

   ...: b_list = b.tolist()

In [3]: %timeit Counter(zip(a, b))

455 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit Counter(zip(a_list, b_list))

334 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The difference between these two timings gives you a reasonable estimate of the overhead.

This isn't quite the conclusion of the story though. Constructing a groupby object in pandas includes some overhead too, at least as related to this problem since there's some groupby metadata that isn't strictly required just to get the size, whereas Counter does the one singular thing you care about. Usually, this overhead is far less than the overhead associated with Counter, but from some instantaneous experimentation, I've found that you can actually get marginally better execution from Counter when the bulk of your groups just contains single elements.

Consider the following timings (sort=False) that go along the spectrum of few large groups <--> many small groups:

def grouper(df):

    return df.groupby(['A', 'B'], sort=False).size()

def count(df):

    return Counter(zip(df.A.values, df.B.values))

for m, n in [(10, 10**6), (10**3, 10**6), (10**7, 10**6)]:

    df = pd.DataFrame({'A': np.random.randint(0, m, n),

                       'B': np.random.randint(0, m, n)})

    print(m, n)

    %timeit grouper(df)

    %timeit count(df)

Which delivers me the following table:

m       grouper   counter

10      62.9 ms   315 ms

10**3    191 ms 535 ms

10**7    514 ms 459 ms

Of course, any gains from Counter would be offset by converting back to a Series, if that's what you want as your final object.

Browse Categories