2 views

I have the table that looks like that:

id   feature_1  feature_2  feature_3

1       A          x          r

23      A          v          r

56      B          z          r

I want to create some sort of Jaccard distance between my rows (ids) in an efficient way and without converting a feature to one-hot encoding cols (Lot of possibilities).

How can I get my result of sum of similarity between rows? something like that:

id_1  id_2  similarity  num_of_features  distance

1     23       2              3            2/3

1     56       1              3            1/3

23    56       1              3            1/3

My code:

def create_pairs(ids_list):

pairs_list = []

for (p1,p2) in itertools.combinations(ids_list,2):

pair = [p1,p2]

pairs_list.append(pair)

return pairs_list

def get_distance(id_1, id_2, df):

????

return distance

ids_list= list(df['id'].unique())

pairs_list = create_pairs(ids_list=ids_list)

pairs_df = pd.DataFrame(pairs_list,columns=['id_1','id_2'])

distance_list = []

for [id_1, id_2] in pairs_list:

distance = get_distance(id_1=id_1, id_2=id_2,df=df)

pairs_df['distance'] = distance_list

by (36.8k points)

Try this, it  will be faster:

def get_distance(id_1, id_2, df):

similarity_count = df.loc[id_1]==df.loc[id_2]

similarity_count = similarity_count.sum(0)

distance = 1-(similarity_count /len(list(df.columns)))

return distance

If you want to know more about the Data Science then do check out the following Data Science which will help you in understanding Data Science from scratch