Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

Do we always need to remove a column for one-hot encoding to prevent multicollinearity? In the solution here (https://www.kaggle.com/omarelgabry/titanic/a-journey-through-titanic/comments#138896) it mentions

@Kevin Chang You need to delete one column of the dummy variables to avoid the state of Multicollinearity. It's a state of very high correlations among the columns(independent variables); meaning that one can be predicted from the others. It is therefore, a type of disturbance in the data, and if present in the data the statistical conclusions made about the data may not be reliable.

In the solutions here, there is not catering for multicollinearityhttps://www.kaggle.com/sharmasanthosh/allstate-claims-severity/exploratory-study-on-ml-algorithms

May I know is it a must, or in what situation we ned to cater that?

1 Answer

0 votes
by (41.4k points)

The answer for your question is, yes.

By removing highly correlated predictors from the model multicollinearity can be prevented. If there are  two or more factors with a high VIF, then remove one from the model because they supply redundant information, removing one of the correlated factors usually doesn't reduce the R-squared.

Or you can also use Partial Least Squares Regression (PLS) or Principal Components Analysis, regression methods that cut the number of predictors to a smaller set of uncorrelated components.

If you want to know more about Partial least Squares Regression then visit this Python Course.
 

Browse Categories

...