Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (50.2k points)

I am curious why a simple concatenation of two data frames in pandas:

shape: (66441, 1)

dtypes: prediction    int64

dtype: object

isnull().sum(): prediction    0

dtype: int64

shape: (66441, 1)

CUSTOMER_ID    int64

dtype: object

isnull().sum() CUSTOMER_ID    0

dtype: int64

of the same shape and both without NaN values

foo = pd.concat([initId, ypred], join='outer', axis=1)

print(foo.shape)

print(foo.isnull().sum())

can result in a lot of NaN values if joined.

(83384, 2)

CUSTOMER_ID    16943

prediction     16943

How can I fix this problem and prevent NaN values from being introduced?

Trying to reproduce it like

aaa  = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'])

print(aaa)

bbb  = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])

print(bbb)

pd.concat([aaa, bbb], axis=1)

failed e.g. worked just fine as no NaN values were introduced.

1 Answer

0 votes
by (108k points)

There is problem with different index values, so where concat is not able to align is getting the NaN:

aaa  = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'], index=[4,5,8,7,10,12])

print(aaa)

    prediction

4            0

5            1

8            0

7            1

10           0

12           0

bbb  = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])

print(bbb)

   groundTruth

0            0

1            0

2            1

3            0

4            1

5            1

print (pd.concat([aaa, bbb], axis=1))

    prediction  groundTruth

0          NaN   0.0

1          NaN   0.0

2          NaN   1.0

3          NaN   0.0

4          0.0   1.0

5          1.0   1.0

7          1.0   NaN

8          0.0   NaN

10         0.0   NaN

12         0.0   NaN

So, the solution for this is reset_index if indexes values are not necessary:

aaa.reset_index(drop=True, inplace=True)

bbb.reset_index(drop=True, inplace=True)

print(aaa)

   prediction

0           0

1           1

2           0

3           1

4           0

5           0

print(bbb)

   groundTruth

0            0

1            0

2            1

3            0

4            1

5            1

print (pd.concat([aaa, bbb], axis=1))

   prediction  groundTruth

0           0 0

1           1 0

2           0 1

3           1 0

4           0 1

5           0 1

If you are interested in learning Pandas and want to become an expert in Python Programming, then check out this Python Course and upskill yourself.

by (140 points)
@vinita I tries this but still it gives me nan values in one of the columns. Here is my code:


 ohe = OneHotEncoder(handle_unknown = 'ignore', sparse = False)
 train_x_encoded = pd.DataFrame(ohe.fit_transform(train_x[['model', '
 vehicleType', 'brand']]))
 train_x_encoded.columns = ohe.get_feature_names(['model', 'vehicleType',
 'brand'])
 train_x.drop(['model', 'vehicleType', 'brand'], axis = 1, inplace = True)
 train_x = train_x.reset_index(drop = True)
 train_x_encoded = train_x_encoded.reset_index(drop = True)
 train_x_final = pd.concat([train_x_encoded, train_x], axis = 1)

Browse Categories

...