Explore Courses Blog Tutorials Interview Questions
0 votes
in Data Science by (18.4k points)

I have the table that looks like that:

id   feature_1  feature_2  feature_3

1       A          x          r

23      A          v          r

56      B          z          r

I want to create some sort of Jaccard distance between my rows (ids) in an efficient way and without converting a feature to one-hot encoding cols (Lot of possibilities).

How can I get my result of sum of similarity between rows? something like that:

id_1  id_2  similarity  num_of_features  distance

 1     23       2              3            2/3

 1     56       1              3            1/3

 23    56       1              3            1/3

My code:

def create_pairs(ids_list):

  pairs_list = []

  for (p1,p2) in itertools.combinations(ids_list,2):

      pair = [p1,p2]


  return pairs_list

def get_distance(id_1, id_2, df):


    return distance

ids_list= list(df['id'].unique())

pairs_list = create_pairs(ids_list=ids_list)

pairs_df = pd.DataFrame(pairs_list,columns=['id_1','id_2'])

distance_list = []

for [id_1, id_2] in pairs_list:

  distance = get_distance(id_1=id_1, id_2=id_2,df=df)

pairs_df['distance'] = distance_list

1 Answer

0 votes
by (36.8k points)

Try this, it  will be faster:

def get_distance(id_1, id_2, df):

    similarity_count = df.loc[id_1]==df.loc[id_2]

    similarity_count = similarity_count.sum(0)

    distance = 1-(similarity_count /len(list(df.columns)))

    return distance

 If you want to know more about the Data Science then do check out the following Data Science which will help you in understanding Data Science from scratch

Browse Categories