How to modify DataFrames so that they only have rows with shared index values in Pandas?

Question

asked Mar 28, 2020 in Data Science by blackindya (18.4k points)

I am working on data science using python pandas. where I have two datasets that are of CSV format, I am trying to select only those columns which are baring the same column number. I am trying to do this to find the correlation in the data frames.

To achieve my desired results, I have been working on the code from a couple of hours and I have been failed. I am providing the code below which I have tried

common_dates = list(set(df1.index.values).intersection(df2.index.values))
print(common_dates)
print(normalized_house_index_data.index.values)
df1= df1.take(common_dates)
df2= df2.take(common_dates)

But I am getting the error telling that 'index out of bound' as shown below, even though my datasets have the same number of columns

[numpy.datetime64('2000-12-31T00:00:00.000000000'), numpy.datetime64('2001-12-31T00:00:00.000000000'), numpy.datetime64('2004-12-31T00:00:00.000000000'), numpy.datetime64('2003-12-31T00:00:00.000000000'), #and more values
And for df1.index.values, I am getting this
['2000-12-31T00:00:00.000000000' '2001-12-31T00:00:00.000000000'
'2002-12-31T00:00:00.000000000' '2003-12-31T00:00:00.000000000' #and more values
And for df2.index.values, I am getting this
['1947-12-31T00:00:00.000000000' '1948-12-31T00:00:00.000000000'
#lots of values
'1997-12-31T00:00:00.000000000' '1998-12-31T00:00:00.000000000'
'1999-12-31T00:00:00.000000000' '2000-12-31T00:00:00.000000000'
'2001-12-31T00:00:00.000000000' '2002-12-31T00:00:00.000000000'
#more values

I tried converting the date format to string format thinking that there may be some kind of typo mistake. But it is throwing me an error telling "invalid literal for int()" I even tried converting them to int, even then i am getting the error.

So I decide to try in another way to achieve my results using the code below:

droplist = []
for date, value in df1.iterrows():
if date not in common_dates:
droplist.append(date)
df1= df1.drop(droplist)

I am not able to get results even when I try with a different approach can anyone tell me where am I going wrong and how to approach this problem.

1 Answer

supriya · Answer 1 · 2020-03-28T09:53:53+0000

You can use the Index.Intersection for finding the common indexing columns.

a = pd.DatetimeIndex(['2000-12-31T00:00:00.000000000',
'2001-12-31T00:00:00.000000000',
'2002-12-31T00:00:00.000000000',
'2003-12-31T00:00:00.000000000'])
b = pd.DatetimeIndex(['1947-12-31T00:00:00.000000000',
'1948-12-31T00:00:00.000000000',
'1997-12-31T00:00:00.000000000',
'1998-12-31T00:00:00.000000000',
'1999-12-31T00:00:00.000000000',
'2000-12-31T00:00:00.000000000',
'2001-12-31T00:00:00.000000000',
'2002-12-31T00:00:00.000000000'])
df1 = pd.DataFrame(index=a)
df2 = pd.DataFrame(index=b)
common_dates = list(set(df1.index.values).intersection(df2.index.values))
print(common_dates)
[numpy.datetime64('2000-12-31T00:00:00.000000000'),
numpy.datetime64('2001-12-31T00:00:00.000000000'),
numpy.datetime64('2002-12-31T00:00:00.000000000')]

You can also use the loc function in order to select columns using numbers as shown below in the below code.

common_dates = df1.index.intersection(df2.index)
print(common_dates)
DatetimeIndex(['2000-12-31', '2001-12-31', '2002-12-31'],
dtype='datetime64[ns]', freq='A-DEC')
df1= df1.loc[common_dates]
df2= df2.loc[common_dates]
print (df1)
Empty DataFrame
Columns: []
Index: [2000-12-31 00:00:00, 2001-12-31 00:00:00, 2002-12-31 00:00:00]
print (df2)
Empty DataFrame
Columns: []
Index: [2000-12-31 00:00:00, 2001-12-31 00:00:00, 2002-12-31 00:00:00]

Improve your knowledge in data science from scratch by click on the click Data Science

How to modify DataFrames so that they only have rows with shared index values in Pandas?

1 Answer

Related questions

Browse Categories