0 votes
1 view
in Data Science by (13.1k points)

1.I am working with two csv files and imported as dataframe, df1 and df2

2.df1 has 50000 rows and df2 has 150000 rows.

3.I want to compare (iterate through each row) the 'time' of df2 with df1, find the difference in time and return the values of all column corresponding to similar row, save it in df3 (time synchronization)

4.For example, 35427949712 (of 'time' in df1) is nearest or equal to 35427949712 (of 'time' in df2), So I would like to return the contents to df1 ('velocity_x' and 'yaw') and df2 ('velocity' and 'yawrate') and save in df3

5.For this i used two techniques, shown in code.

6.Code 1 takes very long time to execute 72 hours which is not practice since i have lot of csv files

7.Code 2 gives me "memory error" and kernel dies.

Would be great, if I get a more robust solution for the problem considering computational time, memory and power(Intel Core i7-6700HQ, 8 GB Ram)

Here is the sample data,

import pandas as pd

df1 = pd.DataFrame({'time': [35427889701, 35427909854, 35427929709,35427949712, 35428009860], 

                    'velocity_x':[12.5451, 12.5401,12.5351,12.5401,12.5251],

                   'yaw' : [-0.0787806, -0.0784749, -0.0794889,-0.0795915,-0.0795472]})

df2 = pd.DataFrame({'time': [35427929709, 35427949712, 35427009860,35427029728, 35427049705], 

                    'velocity':[12.6583, 12.6556,12.6556,12.6556,12.6444],

                    'yawrate' : [-0.0750492, -0.0750492, -0.074351,-0.074351,-0.074351]})

df3 = pd.DataFrame(columns=['time','velocity_x','yaw','velocity','yawrate'])

Code1

 for index, row in df1.iterrows():

    min=100000

    for indexer, rows in df2.iterrows():

        if abs(float(row['time'])-float(rows['time']))<min:

            min = abs(float(row['time'])-float(rows['time']))

            #storing the position 

            pos = indexer

    df3.loc[index,'time'] = df1['time'][pos]

    df3.loc[index,'velocity_x'] = df1['velocity_x'][pos]

    df3.loc[index,'yaw'] = df1['yaw'][pos]

    df3.loc[index,'velocity'] = df2['velocity'][pos]

    df3.loc[index,'yawrate'] = df2['yawrate'][pos]

Code2

df1['key'] = 1

df2['key'] = 1

df1.rename(index=str, columns ={'time' : 'time_x'}, inplace=True)

df = df2.merge(df1, on='key', how ='left').reset_index()

df['diff'] = df.apply(lambda x: abs(x['time']  - x['time_x']), axis=1)

df.sort_values(by=['time', 'diff'], inplace=True)

df=df.groupby(['time']).first().reset_index()[['time', 'velocity_x', 'yaw', 'velocity', 'yawrate']]

1 Answer

0 votes
by (19.9k points)

You can use pandas.merge_asof. It allows you to combine 2 DataFrames on a key, in this case time, without the requirement that they are an exact match. 

A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.

The condition is that you need to sort things for merge_asof to work.

import pandas as pd

pd.merge_asof(df2.sort_values('time'), df1.sort_values('time'), on='time', direction='nearest')

#          time  velocity   yawrate  velocity_x       yaw

#0  35427009860   12.6556 -0.074351     12.5451 -0.078781

#1  35427029728   12.6556 -0.074351     12.5451 -0.078781

#2  35427049705   12.6444 -0.074351     12.5451 -0.078781

#3  35427929709   12.6583 -0.075049     12.5351 -0.079489

#4  35427949712   12.6556 -0.075049     12.5401 -0.079591

The results depend on which dataframe we use, either left or right.So, we need to select it carefully.

...