Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

I have two data frames with the closest matching DateTime index, sometimes matching. An object is to merge two of them using one index as a reference and appending a second to the closest matching (within 1 minute) on the first one.

My code and output:

import pandas as pd

masterdf = pd.DataFrame({"AA":[77.368607,77.491655,77.425134,76.490991]})

masterdf.index = ['2019-10-01 07:52:07','2019-10-01 07:53:01','2019-10-01 07:53:54','2019-10-01 07:54:47']

masterdf.index.name = 'datetime'

slavedf = pd.DataFrame({"BB":[50,60,70,80]})

slavedf.index = ['2019-10-01 07:53:00','2019-10-01 07:53:54','2019-10-01 10:54:47','2019-10-01 10:00:00']

slavedf.index.name = 'datetime'

maindf = masterdf.merge(slavedf,left_index=True,right_index=True)

Present output:

masterdf = 

                            AA

datetime                      

2019-10-01 07:52:07  77.368607

2019-10-01 07:53:01  77.491655

2019-10-01 07:53:54  77.425134

2019-10-01 07:54:47  76.490991

slavedf = 

                     BB

datetime               

2019-10-01 07:53:00  50

2019-10-01 07:53:54  60

2019-10-01 10:54:47  70

2019-10-01 10:00:00  80

maindf = 

datetime                   AA         BB

2019-10-01 07:53:54    77.425134      60

Expected output:

maindf = 

datetime                   AA          BB

2019-10-01 07:53:01    77.491655       50

2019-10-01 07:53:54    77.425134       60

How do I achieve this?

1 Answer

0 votes
by (36.8k points)

Logic here use a merge_asof , we need to adjust it due to , merge_asof will use the 2nd dataframe mutiple times , then we need additional key here is datetime to drop a duplicate

masterdf.index=pd.to_datetime(masterdf.index)

masterdf=masterdf.sort_index().reset_index()

slavedf.index=pd.to_datetime(slavedf.index)

slavedf=slavedf.sort_index().reset_index()

slavedf['datetime2']=slavedf['datetime']

slavedf['key']=slavedf.index

newdf=pd.merge_asof(masterdf,slavedf,on='datetime',tolerance=pd.Timedelta('60s'),direction='nearest')

newdf['diff']=(newdf.datetime-newdf.datetime2).abs()

newdf=newdf.sort_values('diff').drop_duplicates('key')

newdf

Out[35]: 

             datetime         AA  BB           datetime2     diff

2 2019-10-01 07:53:54  77.425134  60 2019-10-01 07:53:54 00:00:00

1 2019-10-01 07:53:01  77.491655  50 2019-10-01 07:53:00 00:00:01

If you are a beginner and want to know more about Data Science the do check out the Data Science course

Related questions

Browse Categories

...