2 views

I'm sorry if the title of the question is not that clear, I could not sum up the problem in one line.

Here are the simplified datasets for an explanation. Basically, the number of categories in the training set is much larger than the categories in the test set, because of which there is a difference in the number of columns in the test and training set after OneHotEncoding. How can I handle this problem?

Training Set

+-------+----------+

| Value | Category |

+-------+----------+

| 100   | SE1      |

+-------+----------+

| 200   | SE2      |

+-------+----------+

| 300   | SE3      |

+-------+----------+

Training set after OneHotEncoding

+-------+-----------+-----------+-----------+

| Value | DummyCat1 | DummyCat2 | DummyCat3 |

+-------+-----------+-----------+-----------+

| 100   | 1         | 0         | 0         |

+-------+-----------+-----------+-----------+

| 200   | 0         | 1         | 0         |

+-------+-----------+-----------+-----------+

| 300   | 0         | 0         | 1         |

+-------+-----------+-----------+-----------+

Test Set

+-------+----------+

| Value | Category |

+-------+----------+

| 100   | SE1      |

+-------+----------+

| 200   | SE1      |

+-------+----------+

| 300   | SE2      |

+-------+----------+

Test set after OneHotEncoding

+-------+-----------+-----------+

| Value | DummyCat1 | DummyCat2 |

+-------+-----------+-----------+

| 100   | 1         | 0         |

+-------+-----------+-----------+

| 200   | 1         | 0         |

+-------+-----------+-----------+

| 300   | 0         | 1         |

+-------+-----------+-----------+

As you can notice, the training set after the OneHotEncoding is of shape (3,4) while the test set after OneHotEncoding is of shape (3,3). Because of this, when I do the following code (y_train is a column vector of shape (3,))

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

regressor.fit(x_train, y_train)

x_pred = regressor.predict(x_test)

I get the error at the predict function. As you can see, the dimensions in the error are quite large, unlike the basic examples.

Traceback (most recent call last):

File "<ipython-input-2-5bac76b24742>", line 30, in <module>

x_pred = regressor.predict(x_test)

File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 256, in predict

return self._decision_function(X)

File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 241, in _decision_function

dense_output=True) + self.intercept_

File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/extmath.py", line 140, in safe_sparse_dot

return np.dot(a, b)

ValueError: shapes (4801,2236) and (4033,) not aligned: 2236 (dim 1) != 4033 (dim 0)

by (41.4k points)

Here, transform the x_test the same way in which x_train was transformed.

Also, you should use the same onehotencoder object which was used to fit() on x_train.

x_test = onehotencoder.transform(x_test)

x_pred = regressor.predict(x_test)

So, here an assumption is made that fit_transform() is used on test data. fit_transform() forgets the previously learnt data and re-fits the oneHotEncoder. It will consider that there are only two distinct values present in the column and hence it will change the shape of output.

If you wish to learn more about how to use python for data science, then go through this data science python course by Intellipaat for more insights.