Pandas rolling regression: alternatives to looping

Question

asked Oct 5, 2019 in Data Science by sourav (17.6k points)

I got good use out of pandas' MovingOLS class (source here) within the deprecated stats/ols module. Unfortunately, it was gutted completely with pandas 0.20.

The question of how to run rolling OLS regression in an efficient manner has been asked several times (here, for instance), but phrased a little broadly and left without a great answer, in my view.

Here are my questions:

How can I best mimic the basic framework of pandas' MovingOLS? The most attractive feature of this class was the ability to view multiple methods/attributes as separate time series--i.e. coefficients, r-squared, t-statistics, etc without needing to re-run regression. For example, you could create something like model = pd.MovingOLS(y, x) and then call .t_stat, .rmse, .std_err, and the like. In the example below, conversely, I don't see a way around being forced to compute each statistic separately. Is there a method that doesn't involve creating sliding/rolling "blocks" (strides) and running regressions/using linear algebra to get model parameters for each?

More broadly, what's going on under the hood in pandas that makes rolling.apply not able to take more complex functions?* When you create a .rolling object, in layman's terms, what's going on internally--is it fundamentally different from looping over each window and creating a higher-dimensional array as I'm doing below?

*Namely, func passed to .apply:

Must produce a single value from an ndarray input *args and **kwargs are passed to the function

Here's where I'm currently at with some sample data, regressing percentage changes in the trade weighted dollar on interest rate spreads and the price of copper. (This doesn't make a ton of sense; just picked these randomly.) I've taken it out of a class-based implementation and tried to strip it down to a simpler script.

from datetime import date
from pandas_datareader.data import DataReader
import statsmodels.formula.api as smf
syms = {'TWEXBMTH' : 'usd',
'T10Y2YM' : 'term_spread',
'PCOPPUSDM' : 'copper'
}
start = date(2000, 1, 1)
data = (DataReader(syms.keys(), 'fred', start)
.pct_change()
.dropna())
data = data.rename(columns = syms)
data = data.assign(intercept = 1.) # required by statsmodels OLS
def sliding_windows(x, window):
"""Create rolling/sliding windows of length ~window~.
Given an array of shape (y, z), it will return "blocks" of shape
(x - window + 1, window, z)."""
return np.array([x[i:i + window] for i
in range(0, x.shape[0] - window + 1)])
data.head(3)
Out[33]:
usd term_spread copper intercept
DATE
2000-02-01 0.012573 -1.409091 -0.019972 1.0
2000-03-01 -0.000079 2.000000 -0.037202 1.0
2000-04-01 0.005642 0.518519 -0.033275 1.0
window = 36
wins = sliding_windows(data.values, window=window)
y, x = wins[:, :, 0], wins[:, :, 1:]
coefs = []
for endog, exog in zip(y, x):
model = smf.OLS(endog, exog).fit()
# The full set of model attributes gets lost with each loop
coefs.append(model.params)
df = pd.DataFrame(coefs, columns=data.iloc[:, 1:].columns,
index=data.index[window - 1:])
df.head(3) # rolling 36m coefficients
Out[70]:
term_spread copper intercept
DATE
2003-01-01 -0.000122 -0.018426 0.001937
2003-02-01 0.000391 -0.015740 0.001597
2003-03-01 0.000655 -0.016811 0.001546

1 Answer

Shlok Pandey · Answer 1 · 2019-10-05T10:57:16+0000

I created an ols module designed to mimic pandas' deprecated MovingOLS; it is here.

It has three core classes:

OLS : static (single-window) ordinary least-squares regression. The output are NumPy arrays

RollingOLS : rolling (multi-window) ordinary least-squares regression. The output are higher-dimension NumPy arrays.

PandasRollingOLS : wraps the results of RollingOLS in pandas Series & DataFrames. Designed to mimic the look of the deprecated pandas module.

Note that the module is part of a package (which I'm currently in the process of uploading to PyPi) and it requires one inter-package import.

The first two classes above are implemented entirely in NumPy and primarily use matrix algebra. RollingOLS takes advantage of broadcasting extensively also. Attributes largely mimic statsmodels' OLS RegressionResultsWrapper.

An example:

import urllib.parse
import pandas as pd
from pyfinance.ols import PandasRollingOLS
# You can also do this with pandas-datareader; here's the hard way
url = "https://fred.stlouisfed.org/graph/fredgraph.csv"
syms = {
"TWEXBMTH" : "usd",
"T10Y2YM" : "term_spread",
"GOLDAMGBD228NLBM" : "gold",
}
params = {
"fq": "Monthly,Monthly,Monthly",
"id": ",".join(syms.keys()),
"cosd": "2000-01-01",
"coed": "2019-02-01",
}
data = pd.read_csv(
url + "?" + urllib.parse.urlencode(params, safe=","),
na_values={"."},
parse_dates=["DATE"],
index_col=0
).pct_change().dropna().rename(columns=syms)
print(data.head())
# usd term_spread gold
# DATE
# 2000-02-01 0.012580 -1.409091 0.057152
# 2000-03-01 -0.000113 2.000000 -0.047034
# 2000-04-01 0.005634 0.518519 -0.023520
# 2000-05-01 0.022017 -0.097561 -0.016675
# 2000-06-01 -0.010116 0.027027 0.036599
y = data.usd
x = data.drop('usd', axis=1)
window = 12 # months
model = PandasRollingOLS(y=y, x=x, window=window)
print(model.beta.head()) # Coefficients excluding the intercept
# term_spread gold
# DATE
# 2001-01-01 0.000033 -0.054261
# 2001-02-01 0.000277 -0.188556
# 2001-03-01 0.002432 -0.294865
# 2001-04-01 0.002796 -0.334880
# 2001-05-01 0.002448 -0.241902
print(model.fstat.head())
# DATE
# 2001-01-01 0.136991
# 2001-02-01 1.233794
# 2001-03-01 3.053000
# 2001-04-01 3.997486
# 2001-05-01 3.855118
# Name: fstat, dtype: float64
print(model.rsq.head()) # R-squared
# DATE
# 2001-01-01 0.029543
# 2001-02-01 0.215179
# 2001-03-01 0.404210
# 2001-04-01 0.470432
# 2001-05-01 0.461408
# Name: rsq, dtype: float64

Pandas rolling regression: alternatives to looping

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources