Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

I am working on data science using python pandas. where I have two datasets that are of CSV format, I am trying to select only those columns which are baring the same column number. I am trying to do this to find the correlation in the data frames.

To achieve my desired results, I have been working on the code from a couple of hours and I have been failed. I am providing the code below which I have tried 

common_dates = list(set(df1.index.values).intersection(df2.index.values))

print(common_dates)

print(normalized_house_index_data.index.values)

df1= df1.take(common_dates)

df2= df2.take(common_dates)

But I am getting the error telling that 'index out of bound' as shown below, even though my datasets have the same number of columns 

[numpy.datetime64('2000-12-31T00:00:00.000000000'), numpy.datetime64('2001-12-31T00:00:00.000000000'), numpy.datetime64('2004-12-31T00:00:00.000000000'), numpy.datetime64('2003-12-31T00:00:00.000000000'), #and more values

And for df1.index.values, I am getting this

['2000-12-31T00:00:00.000000000' '2001-12-31T00:00:00.000000000'

 '2002-12-31T00:00:00.000000000' '2003-12-31T00:00:00.000000000' #and more values

And for df2.index.values, I am getting this

['1947-12-31T00:00:00.000000000' '1948-12-31T00:00:00.000000000'

#lots of values

 '1997-12-31T00:00:00.000000000' '1998-12-31T00:00:00.000000000'

 '1999-12-31T00:00:00.000000000' '2000-12-31T00:00:00.000000000'

 '2001-12-31T00:00:00.000000000' '2002-12-31T00:00:00.000000000'

#more values

I tried converting the date format to string format thinking that there may be some kind of typo mistake. But it is throwing me an error telling "invalid literal for int()" I even tried converting them to int, even then i am getting the error.

So I decide to try in another way to achieve my results using the code below:

droplist = []

for date, value in df1.iterrows():

    if date not in common_dates:

        droplist.append(date)

df1= df1.drop(droplist)

I am not able to get results even when I try with a different approach can anyone tell me where am I going wrong and how to approach this problem.

1 Answer

0 votes
by (36.8k points)

You can use the Index.Intersection for finding the common indexing columns.

a = pd.DatetimeIndex(['2000-12-31T00:00:00.000000000',

                      '2001-12-31T00:00:00.000000000',

                      '2002-12-31T00:00:00.000000000', 

                      '2003-12-31T00:00:00.000000000'])

b = pd.DatetimeIndex(['1947-12-31T00:00:00.000000000',

                      '1948-12-31T00:00:00.000000000',

                      '1997-12-31T00:00:00.000000000',

                      '1998-12-31T00:00:00.000000000',

                      '1999-12-31T00:00:00.000000000',

                      '2000-12-31T00:00:00.000000000',

                      '2001-12-31T00:00:00.000000000',

                      '2002-12-31T00:00:00.000000000'])

df1 = pd.DataFrame(index=a)

df2 = pd.DataFrame(index=b)

common_dates = list(set(df1.index.values).intersection(df2.index.values))

print(common_dates)

[numpy.datetime64('2000-12-31T00:00:00.000000000'), 

 numpy.datetime64('2001-12-31T00:00:00.000000000'), 

 numpy.datetime64('2002-12-31T00:00:00.000000000')]

You can also use the loc function in order to select columns using numbers as shown below in the below code.

common_dates = df1.index.intersection(df2.index)

print(common_dates)

DatetimeIndex(['2000-12-31', '2001-12-31', '2002-12-31'], 

              dtype='datetime64[ns]', freq='A-DEC')

df1= df1.loc[common_dates]

df2= df2.loc[common_dates]

print (df1)

Empty DataFrame

Columns: []

Index: [2000-12-31 00:00:00, 2001-12-31 00:00:00, 2002-12-31 00:00:00]

print (df2)

Empty DataFrame

Columns: []

Index: [2000-12-31 00:00:00, 2001-12-31 00:00:00, 2002-12-31 00:00:00]

Improve your knowledge in data science from scratch by click on the click Data Science

Browse Categories

...