Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I need to sort viewers by hour to a histogram. I have some experience using Matplotlib to do that, but I can't find out what is the most pragmatic way to sort the dates by hour.

First I read the data from a JSON file, then store the two relevant datatypes in a pandas Dataframe, like this:

data = pd.read_json('data/data.json')

session_duration = pd.to_datetime(data.session_duration, unit='s').dt.time

time = pd.to_datetime(data.time, format='%H:%M:%S').dt.time

viewers = []

for x, y in zip(time, session_duration):

    viewers.append({str(x):str(y)})

EDIT: The source file looks like this, leaving out the irrelevant parts.

{

    "time": "00:00:09",

    "session_duration": 91

},

{

    "time": "00:00:16",

    "session_duration": 29

},

{

    "time": "00:00:33",

    "session_duration": 102

},

{

    "time": "00:00:35",

    "session_duration": 203

}

Note that the session_duration is in seconds.

I have to distinguish two types of viewers:

  • Those who spent <= 1 minutes on the stream
  • Those who spent >= 1 minutes on the stream
For that I do:
import datetime
for element in viewers:
    for time, session_duration in element.items():
        if datetime.strptime(session_duration, '%H:%M:%S').time() >= datetime.strptime('00:01:00', '%H:%M:%S').time():
            viewers_more_than_1min.append(element)
        else:
            viewers_less_than_1min.append(element)
As a result I have my values in a dictionary like this: {session_duration:time} Where the key is the time when the session ended the stream and the value is the time spent watching.
[{'00:00:09': '00:01:31'},
 {'00:00:16': '00:00:29'},
 {'00:00:33': '00:01:42'},
 {'00:00:35': '00:03:23'},
 {'00:00:36': '00:00:32'},
 {'00:00:37': '00:04:47'},
 {'00:00:47': '00:00:42'},
 {'00:00:53': '00:00:44'},
 {'00:00:56': '00:00:28'},
 {'00:00:58': '00:01:17'},
 {'00:01:04': '00:01:16'},
 {'00:01:09': '00:00:46'},
 {'00:01:29': '00:01:07'},
 {'00:01:31': '00:01:02'},
 {'00:01:32': '00:01:01'},
 {'00:01:32': '00:00:36'},
 {'00:01:37': '00:03:03'},
 {'00:01:49': '00:00:57'},
 {'00:02:01': '00:02:15'},
 {'00:02:18': '00:01:16'}]
As a final step I wish to create a histogram withMatplotlib representing the viewercount for each our from the two viewertypes mentioned above per hour. I assume it would go something like this:
import matplotlib.pyplot as plt
import datetime as dt
hours = [(dt.time(i).strftime('%H:00')) for i in range(24)]
plt.xlabel('Hour')
plt.ylabel('Viewer count')
plt.bar(hours, sorted_viewcount_byhour)

1 Answer

0 votes
by (41.4k points)

Refer to this code:

df = pd.read_json('data/data.json')

df['time'] = pd.to_datetime(df['time'])

#timedelta is a more appropriate data type for session_duration

df['session_duration'] = pd.to_timedelta(df['session_duration'], unit='s')

df_short_duration = df[df['session_duration'].dt.total_seconds() <= 60]

#  creating histogram

df_hist = df_short_duration.groupby(df['time'].dt.hour).count()

# Now just plot df_hist as a bar chart using matplotlib

 plt.bar(df_hist.index, df_hist['count'])

Gain practical exposure with data science projects in Intellipaat's Data Science course online.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

30.5k questions

32.5k answers

500 comments

108k users

Browse Categories

...