Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Data Science by (17.6k points)

I have two CSV files(Training set and Test Set). Since there are visible NaN values in few of the columns (status, hedge_value, indicator_code, portfolio_id, desk_id, office_id).

I start the process by replacing the NaN values with some huge value corresponding to the column. Then I am doing LabelEncoding to remove the text data and convert them into Numerical data. Now, when I try to do OneHotEncoding on the categorical data, I get the error. I tried giving input one by one into the OneHotEncoding constructor, but I get the same error for every column.

Basically, my end goal is to predict the return values, but I am stuck in the data preprocessing part because of this. How do I solve this issue?

I am using Python3.6 with Pandas and Sklearn for data processing.

Code

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

test_data = pd.read_csv('test.csv')

train_data = pd.read_csv('train.csv')

# Replacing Nan values here

train_data['status']=train_data['status'].fillna(2.0)

train_data['hedge_value']=train_data['hedge_value'].fillna(2.0)

train_data['indicator_code']=train_data['indicator_code'].fillna(2.0)

train_data['portfolio_id']=train_data['portfolio_id'].fillna('PF99999999')

train_data['desk_id']=train_data['desk_id'].fillna('DSK99999999')

train_data['office_id']=train_data['office_id'].fillna('OFF99999999')

x_train = train_data.iloc[:, :-1].values

y_train = train_data.iloc[:, 17].values

# =============================================================================

# from sklearn.preprocessing import Imputer

# imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)

# imputer.fit(x_train[:, 15:17])

# x_train[:, 15:17] = imputer.fit_transform(x_train[:, 15:17])

# imputer.fit(x_train[:, 12:13])

# x_train[:, 12:13] = imputer.fit_transform(x_train[:, 12:13])

# =============================================================================

# Encoding categorical data, i.e. Text data, since calculation happens on numbers only, so having text like 

# Country name, Purchased status will give trouble

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X = LabelEncoder()

x_train[:, 0] = labelencoder_X.fit_transform(x_train[:, 0])

x_train[:, 1] = labelencoder_X.fit_transform(x_train[:, 1])

x_train[:, 2] = labelencoder_X.fit_transform(x_train[:, 2])

x_train[:, 3] = labelencoder_X.fit_transform(x_train[:, 3])

x_train[:, 6] = labelencoder_X.fit_transform(x_train[:, 6])

x_train[:, 8] = labelencoder_X.fit_transform(x_train[:, 8])

x_train[:, 14] = labelencoder_X.fit_transform(x_train[:, 14])

# =============================================================================

# import numpy as np

# x_train[:, 3] = x_train[:, 3].reshape(x_train[:, 3].size,1)

# x_train[:, 3] = x_train[:, 3].astype(np.float64, copy=False)

# np.isnan(x_train[:, 3]).any()

# =============================================================================

# =============================================================================

# from sklearn.preprocessing import StandardScaler

# sc_X = StandardScaler()

# x_train = sc_X.fit_transform(x_train)

# =============================================================================

onehotencoder = OneHotEncoder(categorical_features=[0,1,2,3,6,8,14])

x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

Error

Traceback (most recent call last):

  File "<ipython-input-4-4992bf3d00b8>", line 58, in <module>

    x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 2019, in fit_transform

    self.categorical_features, copy=True)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 1809, in _transform_selected

    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array

    _assert_all_finite(array)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite

    " or a value too large for %r." % X.dtype)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

1 Answer

0 votes
by (41.4k points)

You can use the Pandas function to get the list of columns that had NaN

pd.isnull(train_data).sum() > 0

After handling the new Nan the code will work fine and give the result:

Result

 

portfolio_id      False

desk_id           False

office_id         False

pf_category       False

start_date        False

sold               True

country_code      False

euribor_rate      False

currency          False

libor_rate         True

bought             True

creation_date     False

indicator_code    False

sell_date         False

type              False

hedge_value       False

status            False

return            False

dtype: bool

If You want to learn data science with python visit this data science online course by intellipaat.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94.2k users

Browse Categories

...