I have a 2-column DataFrame, column-1 corresponds to customer, column-2 corresponds to the city this customer has visited. The DataFrame looks like the following:
print(df)
customer visited_city
0 John London
1 Mary Melbourne
2 Steve Paris
3 John New_York
4 Peter New_York
5 Mary London
6 John Melbourne
7 John New_York
I would like to convert the above DataFrame into a row-vector format, such that each row represents a unique user with the row vector indicating the cities visited.
print(wide_format_df)
London Melbourne New_York Paris
John 1.0 1.0 1.0 0.0
Mary 1.0 1.0 0.0 0.0
Steve 0.0 0.0 0.0 1.0
Peter 0.0 0.0 1.0 0.0
Below is the code I used to generate the wide format. It iterates through each user one by one. I was wondering is there any more efficient way to do so?
import pandas as pd
import numpy as np
UNIQUE_CITIESS = np.sort(df['visited_city'].unique())
p = len(UNIQUE_CITIESS)
unique_customers = df['customer'].unique().tolist()
X = []
for customer in unique_customers:
x = np.zeros(p)
city_visited = np.sort(df[df['customer'] == customer]['visited_city'].unique())
visited_idx = np.searchsorted(UNIQUE_CITIESS, city_visited)
x[visited_idx] = 1
X.append(x)
wide_format_df = pd.DataFrame(np.array(X), columns=UNIQUE_CITIESS, index=unique_customers)
wide_format_df