Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Data Science by (17.6k points)

I have online user data with the following information,

df.head()

USER Timestamp  day_of_week Busi_days   Busi_hours

AAD 2017-07-11 09:31:44 TRUE    TRUE

AAD 2017-07-11 23:24:43 TRUE    FALSE

AAD 2017-07-12 13:24:43 TRUE    TRUE

SAP 2017-07-23 14:24:34 FALSE   FALSE

SAP 2017-07-24 16:58:49 TRUE    TRUE

YAS 2017-07-31 21:10:35 TRUE    FALSE

I wanted to compute the activity of USER column and create three new columns namely: 1. Activity: With the information based on how active the user is, meaning if the same user clicked more than twice then call it TRUE else false. 2.Multiple_days: If the user clicked the website on the more one day, if the same user clicked more than 2 days call the column TRUE else FALSE. 3. Busniess_days: whether the user clicked on the weekdays, if the user clicked the website on businesses days within business hours then call it True else FALSE

I have the following script performing the above-mentioned task, but its really slow for my huge data frame my data frame is 117Mb in size. Any better solutions would be great

My try:

df.Timestamp = pd.to_datetime(df.Timestamp)

df['date'] = [x.date() for x in df.Timestamp]

target_df = pd.DataFrame()

target_df['USER'] = df.USER.unique()

a = df.groupby(['USER', 'date']).size()

a = a[a>1]

UID=pd.DataFrame(a).reset_index().USER.values

target_df['Active'] = [True if x in UID else False for x in target_df.USER.values]

a = df.groupby('USER')['Timestamp'].nunique()

a = a[a>1]

UUID2=pd.DataFrame(a).reset_index().USER.values 

target_df['Multiple_days'] = [True if x in UUID2 else False for x in target_df.USER.values]

a = df[(df.Busi_days==True)&(df.Busi_hours==True)].USER.unique()

target_df['Busi_weekday'] = [True if x in a else False for x in target_df.USER.values]

target_df.head()

USER Active  Multiple_days   Busi_weekday

AAD TRUE    TRUE    TRUE

SAP FALSE   TRUE    FALSE

YAS FALSE   FALSE   FALSE

1 Answer

0 votes
by (41.4k points)

Use this below code which uses custom functi

on:

pr

int (df1)

  USER   Timestamp day_of_week  Busi_days  Busi_hours

0  AAD  2017-07-11    09:31:44       True        True

1  AAD  2017-07-11    23:24:43       True       False

2  AAD  2017-07-12    13:24:43       True        True

3  SAP  2017-07-23    14:24:34      False       False

4  SAP  2017-07-24    16:58:49       True        True

5  YAS  2017-07-31    21:10:35       True       False

def func(df, time_col, user_col):

    df[time_col] = pd.to_datetime(df[time_col])

    df['date'] = df[time_col].dt.floor('d')

    u = df.USER.unique()

    a = df.groupby([user_col, 'date']).size().reset_index(level=1, drop=True)

    a = a[a>1]

    target_df = (a[~a.index.duplicated()]

                    .astype(bool).reindex(u, fill_value=False).to_frame(name='Active'))

    a = df.groupby(user_col)[time_col].nunique()

    target_df['Multiple_days'] = a[a>1].astype(bool).reindex(u, fill_value=False)

    a = df.loc[(df.Busi_days==True)&(df.Busi_hours==True), user_col].unique()

    target_df['Busi_weekday'] = target_df.index.isin(a)

    return target_df

#inputs are name of DataFrame, column for timestamp and column for user    

print (func(df1, 'Timestamp', 'USER'))

      Active  Multiple_days  Busi_weekday

USER                                     

AAD     True           True          True

SAP    False           True          True

YAS    False          False         False

If you wish to learn Data Science then visit this Data Science Course.

Browse Categories

...