0 votes
1 view
in Data Science by (17.6k points)

I need to sort viewers by hour to a histogram. I have some experience using Matplotlib to do that, but I can't find out what is the most pragmatic way to sort the dates by hour.

First I read the data from a JSON file, then store the two relevant datatypes in a pandas Dataframe, like this:

data = pd.read_json('data/data.json')

session_duration = pd.to_datetime(data.session_duration, unit='s').dt.time

time = pd.to_datetime(data.time, format='%H:%M:%S').dt.time

viewers = []

for x, y in zip(time, session_duration):

    viewers.append({str(x):str(y)})

EDIT: The source file looks like this, leaving out the irrelevant parts.

{

    "time": "00:00:09",

    "session_duration": 91

},

{

    "time": "00:00:16",

    "session_duration": 29

},

{

    "time": "00:00:33",

    "session_duration": 102

},

{

    "time": "00:00:35",

    "session_duration": 203

}

Note that the session_duration is in seconds.

I have to distinguish two types of viewers:

  • Those who spent <= 1 minutes on the stream
  • Those who spent >= 1 minutes on the stream
For that I do:
import datetime
for element in viewers:
    for time, session_duration in element.items():
        if datetime.strptime(session_duration, '%H:%M:%S').time() >= datetime.strptime('00:01:00', '%H:%M:%S').time():
            viewers_more_than_1min.append(element)
        else:
            viewers_less_than_1min.append(element)
As a result I have my values in a dictionary like this: {session_duration:time} Where the key is the time when the session ended the stream and the value is the time spent watching.
[{'00:00:09': '00:01:31'},
 {'00:00:16': '00:00:29'},
 {'00:00:33': '00:01:42'},
 {'00:00:35': '00:03:23'},
 {'00:00:36': '00:00:32'},
 {'00:00:37': '00:04:47'},
 {'00:00:47': '00:00:42'},
 {'00:00:53': '00:00:44'},
 {'00:00:56': '00:00:28'},
 {'00:00:58': '00:01:17'},
 {'00:01:04': '00:01:16'},
 {'00:01:09': '00:00:46'},
 {'00:01:29': '00:01:07'},
 {'00:01:31': '00:01:02'},
 {'00:01:32': '00:01:01'},
 {'00:01:32': '00:00:36'},
 {'00:01:37': '00:03:03'},
 {'00:01:49': '00:00:57'},
 {'00:02:01': '00:02:15'},
 {'00:02:18': '00:01:16'}]
As a final step I wish to create a histogram withMatplotlib representing the viewercount for each our from the two viewertypes mentioned above per hour. I assume it would go something like this:
import matplotlib.pyplot as plt
import datetime as dt
hours = [(dt.time(i).strftime('%H:00')) for i in range(24)]
plt.xlabel('Hour')
plt.ylabel('Viewer count')
plt.bar(hours, sorted_viewcount_byhour)

1 Answer

0 votes
by (40.4k points)

Refer to this code:

df = pd.read_json('data/data.json')

df['time'] = pd.to_datetime(df['time'])

#timedelta is a more appropriate data type for session_duration

df['session_duration'] = pd.to_timedelta(df['session_duration'], unit='s')

df_short_duration = df[df['session_duration'].dt.total_seconds() <= 60]

#  creating histogram

df_hist = df_short_duration.groupby(df['time'].dt.hour).count()

# Now just plot df_hist as a bar chart using matplotlib

 plt.bar(df_hist.index, df_hist['count'])

Gain practical exposure with data science projects in Intellipaat's Data Science course online.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...