I have online user data with the following information,
df.head()
USER Timestamp day_of_week Busi_days Busi_hours
AAD 2017-07-11 09:31:44 TRUE TRUE
AAD 2017-07-11 23:24:43 TRUE FALSE
AAD 2017-07-12 13:24:43 TRUE TRUE
SAP 2017-07-23 14:24:34 FALSE FALSE
SAP 2017-07-24 16:58:49 TRUE TRUE
YAS 2017-07-31 21:10:35 TRUE FALSE
I wanted to compute the activity of USER column and create three new columns namely: 1. Activity: With the information based on how active the user is, meaning if the same user clicked more than twice then call it TRUE else false. 2.Multiple_days: If the user clicked the website on the more one day, if the same user clicked more than 2 days call the column TRUE else FALSE. 3. Busniess_days: whether the user clicked on the weekdays, if the user clicked the website on businesses days within business hours then call it True else FALSE
I have the following script performing the above-mentioned task, but its really slow for my huge data frame my data frame is 117Mb in size. Any better solutions would be great
My try:
df.Timestamp = pd.to_datetime(df.Timestamp)
df['date'] = [x.date() for x in df.Timestamp]
target_df = pd.DataFrame()
target_df['USER'] = df.USER.unique()
a = df.groupby(['USER', 'date']).size()
a = a[a>1]
UID=pd.DataFrame(a).reset_index().USER.values
target_df['Active'] = [True if x in UID else False for x in target_df.USER.values]
a = df.groupby('USER')['Timestamp'].nunique()
a = a[a>1]
UUID2=pd.DataFrame(a).reset_index().USER.values
target_df['Multiple_days'] = [True if x in UUID2 else False for x in target_df.USER.values]
a = df[(df.Busi_days==True)&(df.Busi_hours==True)].USER.unique()
target_df['Busi_weekday'] = [True if x in a else False for x in target_df.USER.values]
target_df.head()
USER Active Multiple_days Busi_weekday
AAD TRUE TRUE TRUE
SAP FALSE TRUE FALSE
YAS FALSE FALSE FALSE