Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I need to count viewers by program for a streaming channel from a json logfile. I identify the programs by their starttimes, such as:

So far I have two Dataframes like this:

The first one contains all the timestamps from the logfile

viewers_from_log = pd.read_json('sqllog.json', encoding='UTF-8')

# Convert date string to pandas datetime object:

viewers_from_log['time'] = pd.to_datetime(viewers_from_log['time'])

enter image description here

Source JSON file:

[

    {

        "logid": 191605,

        "time": "0:00:17"

    },

    {

        "logid": 191607,

        "time": "0:00:26"

    },

    {

        "logid": 191611,

        "time": "0:01:20"

    }

]

The second contains the starting times and titles of the programs

programs_start_time = pd.DataFrame.from_dict('programs.json', orient='index')

enter image description here

Source JSON file:

{

    "2019-05-29": [

        {

            "title": "\"Amiről a kövek mesélnek\"",

            "startTime_dt": "2019-05-29T00:00:40Z"

        },

        {

            "title": "Koffer - Kedvcsináló Kul(t)túrák Külföldön",

            "startTime_dt": "2019-05-29T00:22:44Z"

        },

        {

            "title": "Gubancok",

            "startTime_dt": "2019-05-29T00:48:08Z"

        }

    ]

}

So what I need to do is to count the entries / program in the log file and link them to the program titles.

1 Answer

0 votes
by (41.4k points)

1.Slice log data for each date range from program data and get the shape.

2.Next add column for program data with results:

import pandas as pd

# setup test data

log_data = {'Time': ['2019-05-30 00:00:26', '2019-05-30 00:00:50', '2019-05-30 00:05:50','2019-05-30 00:23:26']}

log_data = pd.DataFrame(data=log_data)

program_data = {'Time': ['2019-05-30 00:00:00', '2019-05-30 00:22:44'],

            'Program': ['Program 1', 'Program 2']}

program_data = pd.DataFrame(data=program_data)

counts = []

for index, row in program_data.iterrows():

    # get counts on selected range

    try:

        log_range = log_data[(log_data['Time'] > program_data.loc[index].values[0]) & (log_data['Time'] < program_data.loc[index+1].values[0])]

        counts.append(log_range.shape[0])

    except:

        log_range = log_data[log_data['Time'] > program_data.loc[index].values[0]]

        counts.append(log_range.shape[0])

 # add additional column with collected counts

 program_data['Counts'] = counts

Output:

                  Time    Program  Counts

 0  2019-05-30 00:00:00  Program 1       3

 1  2019-05-30 00:22:44  Program 2       1

Browse Categories

...