2 views

I'm working through some examples of Linear Regression under different scenarios, comparing the results from using Normalizer and StandardScaler, and the results are puzzling.

I'm using the Boston housing dataset, and prepping it this way:

import numpy as np

import pandas as pd

from sklearn.preprocessing import Normalizer

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression

df = pd.DataFrame(boston.data)

df.columns = boston.feature_names

df['PRICE'] = boston.target

I'm currently trying to reason about the results I get from the following scenarios:

• Initializing Linear Regression with the parameter normalize=True vs using Normalizer
• Initializing Linear Regression with the parameter fit_intercept = False with and without standardization.

Collectively, I find the results confusing.

Here's how I'm setting everything up:

# Prep the data

X = df.iloc[:, :-1]

y = df.iloc[:, -1:]

normal_X = Normalizer().fit_transform(X)

scaled_X = StandardScaler().fit_transform(X)

#now prepare some of the models

reg1 = LinearRegression().fit(X, y)

reg2 = LinearRegression(normalize=True).fit(X, y)

reg3 = LinearRegression().fit(normal_X, y)

reg4 = LinearRegression().fit(scaled_X, y)

reg5 = LinearRegression(fit_intercept=False).fit(scaled_X, y)

I have some questions that I can't reconcile:

• Why is there absolutely no difference between the first two models? It appears that setting normalize=False does nothing. I can understand having predictions and R^2 values that are the same, but my features have different numerical scales, so I'm not sure why normalizing would have no effect at all. This is doubly confusing when you consider that using StandardScaler changes the coefficients considerably.
• I don't understand why the model using Normalizer causes such radically different coefficient values from the others, especially when the model with LinearRegression(normalize=True) makes no change at all.

by (33.1k points)

I am assuming that what you mean with the first 2 models is reg1 and reg2. Let us know if that is not the case.

A linear regression has the same predictive power if you normalize the data or not. Therefore, using normalize=True has no impact on the predictions. One way to understand this is to see that normalization (column-wise) is a linear operation on each of the columns ((x-a)/b) and linear transformations of the data on a Linear regression does not affect coefficient estimation, only change their values. Notice that this statement is not true for Lasso/Ridge/ElasticNet.

So, why aren't the coefficients different? Well, normalize=True also takes into account that what the user normally wants is the coefficients on the original features, not the normalised features. As such, it adjusts the coefficients. One way to check that this makes sense is to use a simpler example:

# two features, normal distributed with sigma=10

x1 = np.random.normal(0, 10, size=100)

x2 = np.random.normal(0, 10, size=100)

# y is related to each of them plus some noise

y = 3 + 2*x1 + 1*x2 + np.random.normal(0, 1, size=100)

X = np.array([x1, x2]).T  # X has two columns

reg1 = LinearRegression().fit(X, y)

reg2 = LinearRegression(normalize=True).fit(X, y)

# check that coefficients are the same and equal to [2,1]

np.testing.assert_allclose(reg1.coef_, reg2.coef_)

np.testing.assert_allclose(reg1.coef_, np.array([2, 1]), rtol=0.01)

Which confirms that both methods correctly capture the real signal between [x1,x2] and y, namely, the 2 and 1 respectively. For more details on this study the Linear Regression blog.