In the following, male_trips is a big pandas data frame and stations are a small pandas data frame. For each station id, I'd like to know how many male trips took place. The following does the job, but takes a long time:
mc = [ sum( male_trips['start_station_id'] == id ) for id in stations['id'] ]
how should I go about this instead?
Update! So there were two main approaches: groupby() followed by size(), and the simpler .value_counts(). I did a quick timeit, and the groupby approach wins by quite a large margin! Here is the code:
from timeit import Timer
setup = "import pandas; male_trips=pandas.load('maletrips')"
a = "male_trips.start_station_id.value_counts()"
b = "male_trips.groupby('start_station_id').size()"
Timer(a,setup).timeit(100)
Timer(b,setup).timeit(100)
and here is the result:
In [4]: Timer(a,setup).timeit(100) # <- this is value_counts
Out[4]: 9.709594964981079
In [5]: Timer(b,setup).timeit(100) # <- this is groupby / size
Out[5]: 1.5574288368225098
Note that, at this speed, for exploring data typing value_counts is marginally quicker and less remembering!