0 votes
1 view
in AI and Deep Learning by (43.2k points)

I'm trying to put together a linear regression model but some of my featured are not numerical e.g. "Car Colour" whereas others are e.g. "Engine Size". In non-numerical cases, I'm unsure of how to represent this when adding as an input feature. The only way i could think of doing this would be to represent each color with a different value e.g. (red = 1, blue = 2, green = 3...) however this doesn't seem acceptable as this implies that green is "better" than red.

Can anybody help... I'm implementing this in Java so I'd appreciate an algorithm expressed in this language or to be language independent.

1 Answer

0 votes
by (92.8k points)

To encode the car color -I am guessing car color can take only 3 values: red, blue, green.

You can encode it as follows:











In the above table, Green will become the reference level. In your situation, if your color takes 'n' values then you will need to include n-1 dummy variables.

You can get the implementation in Java in the Weka filter NominalToBinary, this will create n variables for n categories.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !