Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (50.2k points)

I often use Pandas mask and where methods for cleaner logic when updating values in a series conditionally. However, for relatively performance-critical code I notice a significant performance drop relative to numpy.where.

While I'm happy to accept this for specific cases, I'm interested to know:

  1. Do Pandas mask / where methods offer any additional functionality, apart from inplace / errors / try-cast parameters? I understand those 3 parameters but rarely use them. For example, I have no idea what the level parameter refers to.
  2. Is there any non-trivial counter-example where mask / where outperforms numpy.where? If such an example exists, it could influence how I choose appropriate methods going forward.

For reference, here's some benchmarking on Pandas 0.19.2 / Python 3.6.0:

np.random.seed(0)

n = 10000000

df = pd.DataFrame(np.random.random(n))

assert (df[0].mask(df[0] > 0.5, 1).values == np.where(df[0] > 0.5, 1, df[0])).all()

%timeit df[0].mask(df[0] > 0.5, 1)       # 145 ms per loop

%timeit np.where(df[0] > 0.5, 1, df[0])  # 113 ms per loop

The performance appears to diverge further for non-scalar values:

%timeit df[0].mask(df[0] > 0.5, df[0]*2)       # 338 ms per loop

%timeit np.where(df[0] > 0.5, df[0]*2, df[0])  # 153 ms per loop

1 Answer

0 votes
by (108k points)

Pandas have the potential to be at least slightly faster than numpy (because it is possible to be faster). Though, pandas' somewhat opaque handling of data-copying makes it hard to predict when this potential is overshadowed by (unnecessary) data copying. When the execution of where/mask is the bottleneck, I would use numba/cython to improve the performance

The idea is to take

np.where(df[0] > 0.5, df[0]*2, df[0])

version and to eliminate the need to create a temporary - i.e, df[0]*2.

Browse Categories

...