Comparing Results from StandardScaler vs Normalizer in Linear Regression

Question

asked Jul 24, 2019 in Machine Learning by ParasSharma1 (19k points)

I'm working through some examples of Linear Regression under different scenarios, comparing the results from using Normalizer and StandardScaler, and the results are puzzling.

I'm using the Boston housing dataset, and prepping it this way:

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
#load the data
df = pd.DataFrame(boston.data)
df.columns = boston.feature_names
df['PRICE'] = boston.target

I'm currently trying to reason about the results I get from the following scenarios:

Initializing Linear Regression with the parameter normalize=True vs using Normalizer
Initializing Linear Regression with the parameter fit_intercept = False with and without standardization.

Collectively, I find the results confusing.

Here's how I'm setting everything up:

# Prep the data
X = df.iloc[:, :-1]
y = df.iloc[:, -1:]
normal_X = Normalizer().fit_transform(X)
scaled_X = StandardScaler().fit_transform(X)
#now prepare some of the models
reg1 = LinearRegression().fit(X, y)
reg2 = LinearRegression(normalize=True).fit(X, y)
reg3 = LinearRegression().fit(normal_X, y)
reg4 = LinearRegression().fit(scaled_X, y)
reg5 = LinearRegression(fit_intercept=False).fit(scaled_X, y)

I have some questions that I can't reconcile:

Why is there absolutely no difference between the first two models? It appears that setting normalize=False does nothing. I can understand having predictions and R^2 values that are the same, but my features have different numerical scales, so I'm not sure why normalizing would have no effect at all. This is doubly confusing when you consider that using StandardScaler changes the coefficients considerably.
I don't understand why the model using Normalizer causes such radically different coefficient values from the others, especially when the model with LinearRegression(normalize=True) makes no change at all.

1 Answer

Anurag · Answer 1 · 2019-07-24T12:34:28+0000

I am assuming that what you mean with the first 2 models is reg1 and reg2. Let us know if that is not the case.

A linear regression has the same predictive power if you normalize the data or not. Therefore, using normalize=True has no impact on the predictions. One way to understand this is to see that normalization (column-wise) is a linear operation on each of the columns ((x-a)/b) and linear transformations of the data on a Linear regression does not affect coefficient estimation, only change their values. Notice that this statement is not true for Lasso/Ridge/ElasticNet.

So, why aren't the coefficients different? Well, normalize=True also takes into account that what the user normally wants is the coefficients on the original features, not the normalised features. As such, it adjusts the coefficients. One way to check that this makes sense is to use a simpler example:

# two features, normal distributed with sigma=10
x1 = np.random.normal(0, 10, size=100)
x2 = np.random.normal(0, 10, size=100)
# y is related to each of them plus some noise
y = 3 + 2*x1 + 1*x2 + np.random.normal(0, 1, size=100)
X = np.array([x1, x2]).T # X has two columns
reg1 = LinearRegression().fit(X, y)
reg2 = LinearRegression(normalize=True).fit(X, y)
# check that coefficients are the same and equal to [2,1]
np.testing.assert_allclose(reg1.coef_, reg2.coef_)
np.testing.assert_allclose(reg1.coef_, np.array([2, 1]), rtol=0.01)

Which confirms that both methods correctly capture the real signal between [x1,x2] and y, namely, the 2 and 1 respectively. For more details on this study the Linear Regression blog.

Comparing Results from StandardScaler vs Normalizer in Linear Regression

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources