0 votes
1 view
in Data Science by (17.6k points)

I have online user data with the following information,

df.head()

USER Timestamp  day_of_week Busi_days   Busi_hours

AAD 2017-07-11 09:31:44 TRUE    TRUE

AAD 2017-07-11 23:24:43 TRUE    FALSE

AAD 2017-07-12 13:24:43 TRUE    TRUE

SAP 2017-07-23 14:24:34 FALSE   FALSE

SAP 2017-07-24 16:58:49 TRUE    TRUE

YAS 2017-07-31 21:10:35 TRUE    FALSE

I wanted to compute the activity of USER column and create three new columns namely: 1. Activity: With the information based on how active the user is, meaning if the same user clicked more than twice then call it TRUE else false. 2.Multiple_days: If the user clicked the website on the more one day, if the same user clicked more than 2 days call the column TRUE else FALSE. 3. Busniess_days: whether the user clicked on the weekdays, if the user clicked the website on businesses days within business hours then call it True else FALSE

I have the following script performing the above-mentioned task, but its really slow for my huge data frame my data frame is 117Mb in size. Any better solutions would be great

My try:

df.Timestamp = pd.to_datetime(df.Timestamp)

df['date'] = [x.date() for x in df.Timestamp]

target_df = pd.DataFrame()

target_df['USER'] = df.USER.unique()

a = df.groupby(['USER', 'date']).size()

a = a[a>1]

UID=pd.DataFrame(a).reset_index().USER.values

target_df['Active'] = [True if x in UID else False for x in target_df.USER.values]

a = df.groupby('USER')['Timestamp'].nunique()

a = a[a>1]

UUID2=pd.DataFrame(a).reset_index().USER.values 

target_df['Multiple_days'] = [True if x in UUID2 else False for x in target_df.USER.values]

a = df[(df.Busi_days==True)&(df.Busi_hours==True)].USER.unique()

target_df['Busi_weekday'] = [True if x in a else False for x in target_df.USER.values]

target_df.head()

USER Active  Multiple_days   Busi_weekday

AAD TRUE    TRUE    TRUE

SAP FALSE   TRUE    FALSE

YAS FALSE   FALSE   FALSE

1 Answer

0 votes
by (32.5k points)

Use this below code which uses custom functi

on:

pr

int (df1)

  USER   Timestamp day_of_week  Busi_days  Busi_hours

0  AAD  2017-07-11    09:31:44       True        True

1  AAD  2017-07-11    23:24:43       True       False

2  AAD  2017-07-12    13:24:43       True        True

3  SAP  2017-07-23    14:24:34      False       False

4  SAP  2017-07-24    16:58:49       True        True

5  YAS  2017-07-31    21:10:35       True       False

def func(df, time_col, user_col):

    df[time_col] = pd.to_datetime(df[time_col])

    df['date'] = df[time_col].dt.floor('d')

    u = df.USER.unique()

    a = df.groupby([user_col, 'date']).size().reset_index(level=1, drop=True)

    a = a[a>1]

    target_df = (a[~a.index.duplicated()]

                    .astype(bool).reindex(u, fill_value=False).to_frame(name='Active'))

    a = df.groupby(user_col)[time_col].nunique()

    target_df['Multiple_days'] = a[a>1].astype(bool).reindex(u, fill_value=False)

    a = df.loc[(df.Busi_days==True)&(df.Busi_hours==True), user_col].unique()

    target_df['Busi_weekday'] = target_df.index.isin(a)

    return target_df

#inputs are name of DataFrame, column for timestamp and column for user    

print (func(df1, 'Timestamp', 'USER'))

      Active  Multiple_days  Busi_weekday

USER                                     

AAD     True           True          True

SAP    False           True          True

YAS    False          False         False

If you wish to learn Data Science then visit this Data Science Course.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...