Filtering pandas dataframe by date to count views for timeline of programs

Question

asked Jul 27, 2019 in Data Science by sourav (17.6k points)

I need to count viewers by program for a streaming channel from a json logfile. I identify the programs by their starttimes, such as:

So far I have two Dataframes like this:

The first one contains all the timestamps from the logfile

viewers_from_log = pd.read_json('sqllog.json', encoding='UTF-8')
# Convert date string to pandas datetime object:
viewers_from_log['time'] = pd.to_datetime(viewers_from_log['time'])

enter image description here

Source JSON file:

[
{
"logid": 191605,
"time": "0:00:17"
},
{
"logid": 191607,
"time": "0:00:26"
},
{
"logid": 191611,
"time": "0:01:20"
}
]

The second contains the starting times and titles of the programs

programs_start_time = pd.DataFrame.from_dict('programs.json', orient='index')

enter image description here

Source JSON file:

{
"2019-05-29": [
{
"title": "\"Amiről a kövek mesélnek\"",
"startTime_dt": "2019-05-29T00:00:40Z"
},
{
"title": "Koffer - Kedvcsináló Kul(t)túrák Külföldön",
"startTime_dt": "2019-05-29T00:22:44Z"
},
{
"title": "Gubancok",
"startTime_dt": "2019-05-29T00:48:08Z"
}
]
}

So what I need to do is to count the entries / program in the log file and link them to the program titles.

1 Answer

Shlok Pandey · Answer 1 · 2019-08-01T06:20:36+0000

1.Slice log data for each date range from program data and get the shape.

2.Next add column for program data with results:

import pandas as pd
# setup test data
log_data = {'Time': ['2019-05-30 00:00:26', '2019-05-30 00:00:50', '2019-05-30 00:05:50','2019-05-30 00:23:26']}
log_data = pd.DataFrame(data=log_data)
program_data = {'Time': ['2019-05-30 00:00:00', '2019-05-30 00:22:44'],
'Program': ['Program 1', 'Program 2']}
program_data = pd.DataFrame(data=program_data)
counts = []
for index, row in program_data.iterrows():
# get counts on selected range
try:
log_range = log_data[(log_data['Time'] > program_data.loc[index].values[0]) & (log_data['Time'] < program_data.loc[index+1].values[0])]
counts.append(log_range.shape[0])
except:
log_range = log_data[log_data['Time'] > program_data.loc[index].values[0]]
counts.append(log_range.shape[0])
# add additional column with collected counts
program_data['Counts'] = counts

Output:

Time Program Counts
0 2019-05-30 00:00:00 Program 1 3
1 2019-05-30 00:22:44 Program 2 1

Filtering pandas dataframe by date to count views for timeline of programs

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources