0 votes
1 view
in Data Science by (12.5k points)

I'm working on the  Kaggle House Prices competition and the dataset has a lot of categorical data. I'm trying to set some them as ordered categories like this:

for col in ordered_category_rating_cols:

    data[col] = data[col].astype(pd.api.types.CategoricalDtype(ordered=True, categories = ["GLQ", "ALQ", "BLQ", "Rec", "LwQ", "Unf", "NA"]))

However when I get to passing the data into model.fit() is throws this error (full stack is below):

ValueError: could not convert string to float: 'GLQ'

By stripping out a bunch of columns, I narrowed it down to one - but if I print the dtype for that, it looks correct:

> train_x["BsmtFinType1"].dtype

> CategoricalDtype(categories=['GLQ', 'ALQ', 'BLQ', 'Rec', 'LwQ', 'Unf', 'NA'], ordered=True)

I've searched high and low, but can't find any solution to this. Do I need to do something to tell Keras to treat the categories as floats?

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-144-c86afee8eb19> in <module>()

      4     batch_size=128,

      5     epochs=6,

----> 6     validation_split=0.1

      7 )

      8 

3 frames

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)

    778           validation_steps=validation_steps,

    779           validation_freq=validation_freq,

--> 780           steps_name='steps_per_epoch')

    781 

    782   def evaluate(self,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py in model_iteration(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq, mode, validation_in_fit, prepared_feed_values_from_dataset, steps_name, **kwargs)

    361 

    362         # Get outputs.

--> 363         batch_outs = f(ins_batch)

    364         if not isinstance(batch_outs, list):

    365           batch_outs = [batch_outs]

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)

   3275         tensor_type = dtypes_module.as_dtype(tensor.dtype)

   3276         array_vals.append(np.asarray(value,

-> 3277                                      dtype=tensor_type.as_numpy_dtype))

   3278 

   3279     if self.feed_dict:

/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in asarray(a, dtype, order)

    536 

    537     """

--> 538     return array(a, dtype, copy=False, order=order)

    539 

    540 

ValueError: could not convert string to float: 'GLQ'

1 Answer

0 votes
by (12.5k points)

Converting categorical columns to data:

import pandas as pd

df = pd.DataFrame(data={"gender":["male","female"]})

df['gender'] = df['gender'].astype('category').cat.codes

  gender

0   1

1   0

#If multiple columns contains categorical data

category_columns = list(df.select_dtypes(['category']).columns)

df[category_columns] = df[category_columns].apply(lambda x: x.cat.codes)

...