I have two specific questions:
Why is grouper more efficient than count? I expected the count to be more efficient, as it is implemented in C. The superior performance of grouper persists even if the number of columns is increased from 2 to 4.
Why does value_counter underperform grouper by so much? Is this due to the cost of constructing a list, or series from the list?
I understand the outputs are different, and this should also inform choice. For example, filtering by count is more efficient with contiguous numpy arrays versus a dictionary comprehension:
x, z = grouper(df), count(df)
%timeit x[x.values > 10] # 749µs
%timeit {k: v for k, v in z.items() if v > 10} # 9.37ms
However, the focus of my question is on the performance of building comparable results in a series versus dictionary. My C knowledge is limited, yet I would appreciate any answer which can point to the logic underlying these methods.
Benchmarking code
import pandas as pd
import numpy as np
from collections import Counter
np.random.seed(0)
m, n = 1000, 100000
df = pd.DataFrame({'A': np.random.randint(0, m, n),
'B': np.random.randint(0, m, n)})
def grouper(df):
return df.groupby(['A', 'B'], sort=False).size()
def value_counter(df):
return pd.Series(list(zip(df.A, df.B))).value_counts(sort=False)
def count(df):
return Counter(zip(df.A.values, df.B.values))
x = value_counter(df).to_dict()
y = grouper(df).to_dict()
z = count(df)
assert (x == y) & (y == z), "Dictionary mismatch!"
for m, n in [(100, 10000), (1000, 10000), (100, 100000), (1000, 100000)]:
df = pd.DataFrame({'A': np.random.randint(0, m, n),
'B': np.random.randint(0, m, n)})
print(m, n)
%timeit grouper(df)
%timeit value_counter(df)
%timeit count(df)
Benchmarking results
Run on python 3.6.2, pandas 0.20.3, numpy 1.13.1
Machine specs: Windows 7 64-bit, Dual-Core 2.5 GHz, 4GB RAM.
Key: g = grouper, v = value_counter, c = count.
m n g v c
100 10000 2.91 18.30 8.41
1000 10000 4.10 27.20 6.98[1]
100 100000 17.90 130.00 84.50
1000 100000 43.90 309.00 93.50
1 This is not a typo.