0 votes
1 view
in Machine Learning by (19k points)

Given some randomly generated data with

2 columns,

50 rows and

integer range between 0-100

With R, the Poisson glm and diagnostics plot can be achieved as such:

> col=2

> row=50

> range=0:100

> df <- data.frame(replicate(col,sample(range,row,rep=TRUE)))

> model <- glm(X2 ~ X1, data = df, family = poisson)

> glm.diag.plots(model)

In Python, this would give me the line predictor vs residual plot:

import numpy as np

import pandas as pd

import statsmodels.formula.api

from statsmodels.genmod.families import Poisson

import seaborn as sns

import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randint(100, size=(50,2)))

df.rename(columns={0:'X1', 1:'X2'}, inplace=True)

glm = statsmodels.formula.api.gee

model = glm("X2 ~ X1", groups=None, data=df, family=Poisson())

results = model.fit()

And to plot the diagnostics in Python:

model_fitted_y = results.fittedvalues  # fitted values (need a constant term for intercept)

model_residuals = results.resid # model residuals

model_abs_resid = np.abs(model_residuals)  # absolute residuals


 

plot_lm_1 = plt.figure(1)

plot_lm_1.set_figheight(8)

plot_lm_1.set_figwidth(12)

plot_lm_1.axes[0] = sns.residplot(model_fitted_y, 'X2', data=df, lowess=True, scatter_kws={'alpha': 0.5}, line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_1.axes[0].set_xlabel('Line Predictor')

plot_lm_1.axes[0].set_ylabel('Residuals')

plt.show()

But when I try to get the cook statistics,

# cook's distance, from statsmodels internals

model_cooks = results.get_influence().cooks_distance[0]

it threw an error saying:

AttributeError                            Traceback (most recent call last)

<ipython-input-66-0f2bedfa1741> in <module>()

      4 model_residuals = results.resid

      5 # normalized residuals

----> 6 model_norm_residuals = results.get_influence().resid_studentized_internal

      7 # absolute squared normalized residuals

      8 model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))

/opt/conda/lib/python3.6/site-packages/statsmodels/base/wrapper.py in __getattribute__(self, attr)

     33             pass

     34 

---> 35         obj = getattr(results, attr)

     36         data = results.model.data

     37         how = self._wrap_attrs.get(attr)

AttributeError: 'GEEResults' object has no attribute 'get_influence'

Is there a way to plot out all 4 diagnostic plots in Python like in R?

How do I retrieve the cook statistics of the fitted model results in Python using statsmodels?

1 Answer

0 votes
by (33.2k points)

A generalized estimating equations API should give you a different result than R's GLM model estimation. To get similar estimates in statsmodels, you need to use the following code:

import pandas as pd

import statsmodels.api as sm

# Read data generated in R using pandas or something similar

df = pd.read_csv(...) # file name goes here

# Add a column of ones for the intercept to create input X

X = np.column_stack( (np.ones((df.shape[0], 1)), df.X1) )

# Relabel dependent variable as y (standard notation)

y = df.X2

# Fit GLM in statsmodels using Poisson link function

sm.GLM(y, X, family = Poisson()).fit().summary()

Below is a script I wrote based on some data generated in R. I compared my values against those in R calculated using the cooks.distance function and the values matched.

from __future__ import division, print_function

import numpy as np

import pandas as pd

import statsmodels.api as sm

PATH = '/Users/robertmilletich/test_reg.csv'


 

def _weight_matrix(fitted_model):

    return np.diag(fitted_model.fittedvalues)


 

def _hessian(X, W):

   

    return -np.dot(X.T, np.dot(W, X))


 

def _hat_matrix(X, W):

    # W^(1/2)

    Wsqrt = W**(0.5)

    # (X'*W*X)^(-1)

    XtWX     = -_hessian(X = X, W = W)

    XtWX_inv = np.linalg.inv(XtWX)

    # W^(1/2)*X

    WsqrtX = np.dot(Wsqrt, X)

    # X'*W^(1/2)

    XtWsqrt = np.dot(X.T, Wsqrt)

    return np.dot(WsqrtX, np.dot(XtWX_inv, XtWsqrt))


 

def main():

    # Load data and separate into X and y

    df = pd.read_csv(PATH)

    X  = np.column_stack( (np.ones((df.shape[0], 1)), df.X1 ) )

    y  = df.X2

    # Fit model

    model = sm.GLM(y, X, family=sm.families.Poisson()).fit()

    # Weight matrix

    W = _weight_matrix(model)

    # Hat matrix

    H   = _hat_matrix(X, W)

    hii = np.diag(H) # Diagonal values of hat matrix

    # Pearson residuals

    r = model.resid_pearson

    # Cook's distance (formula used by R = (res/(1 - hat))^2 * hat/(dispersion * p))

    # Note: dispersion is 1 since we aren't modeling overdispersion

    cooks_d = (r/(1 - hii))**2 * hii/(1*2)

Hope this answer helps you! To know more about Poisson's Regression, go through Data Science Course.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...