I often use Pandas mask
and where
methods for cleaner logic when updating values in a series conditionally. However, for relatively performance-critical code I notice a significant performance drop relative to numpy.where
.
While I'm happy to accept this for specific cases, I'm interested to know:
- Do Pandas
mask
/ where
methods offer any additional functionality, apart from inplace
/ errors
/ try-cast
parameters? I understand those 3 parameters but rarely use them. For example, I have no idea what the level
parameter refers to. - Is there any non-trivial counter-example where
mask
/ where
outperforms numpy.where
? If such an example exists, it could influence how I choose appropriate methods going forward.
For reference, here's some benchmarking on Pandas 0.19.2 / Python 3.6.0:
np.random.seed(0)
n = 10000000
df = pd.DataFrame(np.random.random(n))
assert (df[0].mask(df[0] > 0.5, 1).values == np.where(df[0] > 0.5, 1, df[0])).all()
%timeit df[0].mask(df[0] > 0.5, 1) # 145 ms per loop
%timeit np.where(df[0] > 0.5, 1, df[0]) # 113 ms per loop
The performance appears to diverge further for non-scalar values:
%timeit df[0].mask(df[0] > 0.5, df[0]*2) # 338 ms per loop
%timeit np.where(df[0] > 0.5, df[0]*2, df[0]) # 153 ms per loop