Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (50.2k points)

In the following, male_trips is a big pandas data frame and stations are a small pandas data frame. For each station id, I'd like to know how many male trips took place. The following does the job, but takes a long time:

mc = [ sum( male_trips['start_station_id'] == id ) for id in stations['id'] ]

how should I go about this instead?

Update! So there were two main approaches: groupby() followed by size(), and the simpler .value_counts(). I did a quick timeit, and the groupby approach wins by quite a large margin! Here is the code:

from timeit import Timer

setup = "import pandas; male_trips=pandas.load('maletrips')"

a  = "male_trips.start_station_id.value_counts()"

b = "male_trips.groupby('start_station_id').size()"

Timer(a,setup).timeit(100)

Timer(b,setup).timeit(100)

and here is the result:

In [4]: Timer(a,setup).timeit(100) # <- this is value_counts

Out[4]: 9.709594964981079

In [5]: Timer(b,setup).timeit(100) # <- this is groupby / size

Out[5]: 1.5574288368225098

Note that, at this speed, for exploring data typing value_counts is marginally quicker and less remembering!

1 Answer

0 votes
by (108k points)

This is what the pandas.Series.value_counts the method is for:

count_series = male_trips.start_station_id.value_counts()

It should be straight-forward to then inspect count_series based on the values in stations['id']. However, if you insist on only considering those values, you could do the following:

count_series = (

                male_trips[male_trips.start_station_id.isin(stations.id.values)]

                    .start_station_id

                    .value_counts()

               )

and this will only give counts for station IDs actually found in stations.id.

Browse Categories

...