Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I have two CSV files(Training set and Test Set). Since there are visible NaN values in few of the columns (status, hedge_value, indicator_code, portfolio_id, desk_id, office_id).

I start the process by replacing the NaN values with some huge value corresponding to the column. Then I am doing LabelEncoding to remove the text data and convert them into Numerical data. Now, when I try to do OneHotEncoding on the categorical data, I get the error. I tried giving input one by one into the OneHotEncoding constructor, but I get the same error for every column.

Basically, my end goal is to predict the return values, but I am stuck in the data preprocessing part because of this. How do I solve this issue?

I am using Python3.6 with Pandas and Sklearn for data processing.

Code

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

test_data = pd.read_csv('test.csv')

train_data = pd.read_csv('train.csv')

# Replacing Nan values here

train_data['status']=train_data['status'].fillna(2.0)

train_data['hedge_value']=train_data['hedge_value'].fillna(2.0)

train_data['indicator_code']=train_data['indicator_code'].fillna(2.0)

train_data['portfolio_id']=train_data['portfolio_id'].fillna('PF99999999')

train_data['desk_id']=train_data['desk_id'].fillna('DSK99999999')

train_data['office_id']=train_data['office_id'].fillna('OFF99999999')

x_train = train_data.iloc[:, :-1].values

y_train = train_data.iloc[:, 17].values

# =============================================================================

# from sklearn.preprocessing import Imputer

# imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)

# imputer.fit(x_train[:, 15:17])

# x_train[:, 15:17] = imputer.fit_transform(x_train[:, 15:17])

# imputer.fit(x_train[:, 12:13])

# x_train[:, 12:13] = imputer.fit_transform(x_train[:, 12:13])

# =============================================================================

# Encoding categorical data, i.e. Text data, since calculation happens on numbers only, so having text like 

# Country name, Purchased status will give trouble

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X = LabelEncoder()

x_train[:, 0] = labelencoder_X.fit_transform(x_train[:, 0])

x_train[:, 1] = labelencoder_X.fit_transform(x_train[:, 1])

x_train[:, 2] = labelencoder_X.fit_transform(x_train[:, 2])

x_train[:, 3] = labelencoder_X.fit_transform(x_train[:, 3])

x_train[:, 6] = labelencoder_X.fit_transform(x_train[:, 6])

x_train[:, 8] = labelencoder_X.fit_transform(x_train[:, 8])

x_train[:, 14] = labelencoder_X.fit_transform(x_train[:, 14])

# =============================================================================

# import numpy as np

# x_train[:, 3] = x_train[:, 3].reshape(x_train[:, 3].size,1)

# x_train[:, 3] = x_train[:, 3].astype(np.float64, copy=False)

# np.isnan(x_train[:, 3]).any()

# =============================================================================

# =============================================================================

# from sklearn.preprocessing import StandardScaler

# sc_X = StandardScaler()

# x_train = sc_X.fit_transform(x_train)

# =============================================================================

onehotencoder = OneHotEncoder(categorical_features=[0,1,2,3,6,8,14])

x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

Error

Traceback (most recent call last):

  File "<ipython-input-4-4992bf3d00b8>", line 58, in <module>

    x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 2019, in fit_transform

    self.categorical_features, copy=True)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 1809, in _transform_selected

    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array

    _assert_all_finite(array)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite

    " or a value too large for %r." % X.dtype)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

1 Answer

0 votes
by (41.4k points)

You can use the Pandas function to get the list of columns that had NaN

pd.isnull(train_data).sum() > 0

After handling the new Nan the code will work fine and give the result:

Result

 

portfolio_id      False

desk_id           False

office_id         False

pf_category       False

start_date        False

sold               True

country_code      False

euribor_rate      False

currency          False

libor_rate         True

bought             True

creation_date     False

indicator_code    False

sell_date         False

type              False

hedge_value       False

status            False

return            False

dtype: bool

If You want to learn data science with python visit this data science online course by intellipaat.

Browse Categories

...