Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Machine Learning by (19k points)

Given some randomly generated data with

2 columns,

50 rows and

integer range between 0-100

With R, the Poisson glm and diagnostics plot can be achieved as such:

> col=2

> row=50

> range=0:100

> df <- data.frame(replicate(col,sample(range,row,rep=TRUE)))

> model <- glm(X2 ~ X1, data = df, family = poisson)

> glm.diag.plots(model)

In Python, this would give me the line predictor vs residual plot:

import numpy as np

import pandas as pd

import statsmodels.formula.api

from statsmodels.genmod.families import Poisson

import seaborn as sns

import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randint(100, size=(50,2)))

df.rename(columns={0:'X1', 1:'X2'}, inplace=True)

glm = statsmodels.formula.api.gee

model = glm("X2 ~ X1", groups=None, data=df, family=Poisson())

results = model.fit()

And to plot the diagnostics in Python:

model_fitted_y = results.fittedvalues  # fitted values (need a constant term for intercept)

model_residuals = results.resid # model residuals

model_abs_resid = np.abs(model_residuals)  # absolute residuals


 

plot_lm_1 = plt.figure(1)

plot_lm_1.set_figheight(8)

plot_lm_1.set_figwidth(12)

plot_lm_1.axes[0] = sns.residplot(model_fitted_y, 'X2', data=df, lowess=True, scatter_kws={'alpha': 0.5}, line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_1.axes[0].set_xlabel('Line Predictor')

plot_lm_1.axes[0].set_ylabel('Residuals')

plt.show()

But when I try to get the cook statistics,

# cook's distance, from statsmodels internals

model_cooks = results.get_influence().cooks_distance[0]

it threw an error saying:

AttributeError                            Traceback (most recent call last)

<ipython-input-66-0f2bedfa1741> in <module>()

      4 model_residuals = results.resid

      5 # normalized residuals

----> 6 model_norm_residuals = results.get_influence().resid_studentized_internal

      7 # absolute squared normalized residuals

      8 model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))

/opt/conda/lib/python3.6/site-packages/statsmodels/base/wrapper.py in __getattribute__(self, attr)

     33             pass

     34 

---> 35         obj = getattr(results, attr)

     36         data = results.model.data

     37         how = self._wrap_attrs.get(attr)

AttributeError: 'GEEResults' object has no attribute 'get_influence'

Is there a way to plot out all 4 diagnostic plots in Python like in R?

How do I retrieve the cook statistics of the fitted model results in Python using statsmodels?

1 Answer

0 votes
by (33.1k points)

A generalized estimating equations API should give you a different result than R's GLM model estimation. To get similar estimates in statsmodels, you need to use the following code:

import pandas as pd

import statsmodels.api as sm

# Read data generated in R using pandas or something similar

df = pd.read_csv(...) # file name goes here

# Add a column of ones for the intercept to create input X

X = np.column_stack( (np.ones((df.shape[0], 1)), df.X1) )

# Relabel dependent variable as y (standard notation)

y = df.X2

# Fit GLM in statsmodels using Poisson link function

sm.GLM(y, X, family = Poisson()).fit().summary()

Below is a script I wrote based on some data generated in R. I compared my values against those in R calculated using the cooks.distance function and the values matched.

from __future__ import division, print_function

import numpy as np

import pandas as pd

import statsmodels.api as sm

PATH = '/Users/robertmilletich/test_reg.csv'


 

def _weight_matrix(fitted_model):

    return np.diag(fitted_model.fittedvalues)


 

def _hessian(X, W):

   

    return -np.dot(X.T, np.dot(W, X))


 

def _hat_matrix(X, W):

    # W^(1/2)

    Wsqrt = W**(0.5)

    # (X'*W*X)^(-1)

    XtWX     = -_hessian(X = X, W = W)

    XtWX_inv = np.linalg.inv(XtWX)

    # W^(1/2)*X

    WsqrtX = np.dot(Wsqrt, X)

    # X'*W^(1/2)

    XtWsqrt = np.dot(X.T, Wsqrt)

    return np.dot(WsqrtX, np.dot(XtWX_inv, XtWsqrt))


 

def main():

    # Load data and separate into X and y

    df = pd.read_csv(PATH)

    X  = np.column_stack( (np.ones((df.shape[0], 1)), df.X1 ) )

    y  = df.X2

    # Fit model

    model = sm.GLM(y, X, family=sm.families.Poisson()).fit()

    # Weight matrix

    W = _weight_matrix(model)

    # Hat matrix

    H   = _hat_matrix(X, W)

    hii = np.diag(H) # Diagonal values of hat matrix

    # Pearson residuals

    r = model.resid_pearson

    # Cook's distance (formula used by R = (res/(1 - hat))^2 * hat/(dispersion * p))

    # Note: dispersion is 1 since we aren't modeling overdispersion

    cooks_d = (r/(1 - hii))**2 * hii/(1*2)

Hope this answer helps you! To know more about Poisson's Regression, go through Data Science Course.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94k users

Browse Categories

...