Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I got good use out of pandas' MovingOLS class (source here) within the deprecated stats/ols module. Unfortunately, it was gutted completely with pandas 0.20.

The question of how to run rolling OLS regression in an efficient manner has been asked several times (here, for instance), but phrased a little broadly and left without a great answer, in my view.

Here are my questions:

How can I best mimic the basic framework of pandas' MovingOLS? The most attractive feature of this class was the ability to view multiple methods/attributes as separate time series--i.e. coefficients, r-squared, t-statistics, etc without needing to re-run regression. For example, you could create something like model = pd.MovingOLS(y, x) and then call .t_stat, .rmse, .std_err, and the like. In the example below, conversely, I don't see a way around being forced to compute each statistic separately. Is there a method that doesn't involve creating sliding/rolling "blocks" (strides) and running regressions/using linear algebra to get model parameters for each?

More broadly, what's going on under the hood in pandas that makes rolling.apply not able to take more complex functions?* When you create a .rolling object, in layman's terms, what's going on internally--is it fundamentally different from looping over each window and creating a higher-dimensional array as I'm doing below?

*Namely, func passed to .apply:

Must produce a single value from an ndarray input *args and **kwargs are passed to the function

Here's where I'm currently at with some sample data, regressing percentage changes in the trade weighted dollar on interest rate spreads and the price of copper. (This doesn't make a ton of sense; just picked these randomly.) I've taken it out of a class-based implementation and tried to strip it down to a simpler script.

from datetime import date

from pandas_datareader.data import DataReader

import statsmodels.formula.api as smf

syms = {'TWEXBMTH' : 'usd', 

        'T10Y2YM' : 'term_spread', 

        'PCOPPUSDM' : 'copper'

       }

start = date(2000, 1, 1)

data = (DataReader(syms.keys(), 'fred', start)

        .pct_change()

        .dropna())

data = data.rename(columns = syms)

data = data.assign(intercept = 1.) # required by statsmodels OLS

def sliding_windows(x, window):

    """Create rolling/sliding windows of length ~window~.

    Given an array of shape (y, z), it will return "blocks" of shape

    (x - window + 1, window, z)."""

    return np.array([x[i:i + window] for i 

                    in range(0, x.shape[0] - window + 1)])

data.head(3)

Out[33]: 

                 usd  term_spread    copper  intercept

DATE                                                  

2000-02-01  0.012573    -1.409091 -0.019972        1.0

2000-03-01 -0.000079     2.000000 -0.037202        1.0

2000-04-01  0.005642     0.518519 -0.033275        1.0

window = 36

wins = sliding_windows(data.values, window=window)

y, x = wins[:, :, 0], wins[:, :, 1:]

coefs = []

for endog, exog in zip(y, x):

    model = smf.OLS(endog, exog).fit()

        # The full set of model attributes gets lost with each loop

    coefs.append(model.params)

df = pd.DataFrame(coefs, columns=data.iloc[:, 1:].columns,

                  index=data.index[window - 1:])

df.head(3) # rolling 36m coefficients

Out[70]: 

            term_spread    copper  intercept

DATE                                        

2003-01-01    -0.000122 -0.018426   0.001937

2003-02-01     0.000391 -0.015740   0.001597

2003-03-01     0.000655 -0.016811   0.001546

1 Answer

0 votes
by (41.4k points)

I created an ols module designed to mimic pandas' deprecated MovingOLS; it is here.

It has three core classes:

OLS : static (single-window) ordinary least-squares regression. The output are NumPy arrays

RollingOLS : rolling (multi-window) ordinary least-squares regression. The output are higher-dimension NumPy arrays.

PandasRollingOLS : wraps the results of RollingOLS in pandas Series & DataFrames. Designed to mimic the look of the deprecated pandas module.

Note that the module is part of a package (which I'm currently in the process of uploading to PyPi) and it requires one inter-package import.

The first two classes above are implemented entirely in NumPy and primarily use matrix algebra. RollingOLS takes advantage of broadcasting extensively also. Attributes largely mimic statsmodels' OLS RegressionResultsWrapper.

An example:

import urllib.parse

import pandas as pd

from pyfinance.ols import PandasRollingOLS

# You can also do this with pandas-datareader; here's the hard way

url = "https://fred.stlouisfed.org/graph/fredgraph.csv"

syms = {

    "TWEXBMTH" : "usd", 

    "T10Y2YM" : "term_spread", 

    "GOLDAMGBD228NLBM" : "gold",

}

params = {

    "fq": "Monthly,Monthly,Monthly",

    "id": ",".join(syms.keys()),

    "cosd": "2000-01-01",

    "coed": "2019-02-01",

}

data = pd.read_csv(

    url + "?" + urllib.parse.urlencode(params, safe=","),

    na_values={"."},

    parse_dates=["DATE"],

    index_col=0

).pct_change().dropna().rename(columns=syms)

print(data.head())

#                  usd  term_spread      gold

# DATE                                       

# 2000-02-01  0.012580    -1.409091  0.057152

# 2000-03-01 -0.000113     2.000000 -0.047034

# 2000-04-01  0.005634     0.518519 -0.023520

# 2000-05-01  0.022017    -0.097561 -0.016675

# 2000-06-01 -0.010116     0.027027  0.036599

y = data.usd

x = data.drop('usd', axis=1)

window = 12  # months

model = PandasRollingOLS(y=y, x=x, window=window)

print(model.beta.head())  # Coefficients excluding the intercept

#             term_spread      gold

# DATE                             

# 2001-01-01     0.000033 -0.054261

# 2001-02-01     0.000277 -0.188556

# 2001-03-01     0.002432 -0.294865

# 2001-04-01     0.002796 -0.334880

# 2001-05-01     0.002448 -0.241902

print(model.fstat.head())

# DATE

# 2001-01-01    0.136991

# 2001-02-01    1.233794

# 2001-03-01    3.053000

# 2001-04-01    3.997486

# 2001-05-01    3.855118

# Name: fstat, dtype: float64

print(model.rsq.head())  # R-squared

# DATE

# 2001-01-01    0.029543

# 2001-02-01    0.215179

# 2001-03-01    0.404210

# 2001-04-01    0.470432

# 2001-05-01    0.461408

# Name: rsq, dtype: float64

Related questions

Browse Categories

...