Day 4: Cointegration

time series
code
analysis
Author

Robert W. Walker

Published

July 24, 2025

Slides

A replication

This is an example with an ECM from a model presented by DeBoef and Keele in their excellent paper “Taking Time Seriously: Dynamic Regression.” The data come from a paper Gilmour and Wolbrecht.

use "https://github.com/robertwwalker/Essex-Data/raw/main/dgw.dta", clear
* These data have already been time set:
tsset
* The dependent variable in this example is Congressional Approval
*Table 2
reg capp l.capp econexp nytavg kg hb vetoes override intrasum mbill 
bgodfrey, lag(1 2 3)
library(haven); library(tidyverse); library(fpp3)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Registered S3 method overwritten by 'tsibble':
  method               from 
  as_tibble.grouped_df dplyr

── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──

✔ tsibble     1.1.6     ✔ feasts      0.4.1
✔ tsibbledata 0.4.1     ✔ fable       0.4.1

── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
✖ lubridate::date()    masks base::date()
✖ dplyr::filter()      masks stats::filter()
✖ tsibble::intersect() masks base::intersect()
✖ tsibble::interval()  masks lubridate::interval()
✖ dplyr::lag()         masks stats::lag()
✖ tsibble::setdiff()   masks base::setdiff()
✖ tsibble::union()     masks base::union()
dgw.data <- read_stata("https://github.com/robertwwalker/Essex-Data/raw/main/dgw.dta")
dgw.data <- dgw.data %>% filter(!is.na(year))
dgw.data <- dgw.data %>% mutate(date = paste0(year," Q",quarter, sep="")) %>% mutate(dateQ = yearquarter(date))
dgw.ts <- dgw.data %>% as_tsibble(index=dateQ)
# A dynamic linear model
dgw.ts %>% model(TSLM(capp~lag(capp,1)+econexp+nytavg+kg+hb+vetoes+override+intrasum+mbills)) %>% report()
Series: capp 
Model: TSLM 

Residuals:
    Min      1Q  Median      3Q     Max 
-9.1137 -1.3908 -0.1014  1.5446  7.0873 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.37793    3.33043   2.816  0.00634 ** 
lag(capp, 1)  0.79605    0.05166  15.409  < 2e-16 ***
econexp       0.07173    0.03090   2.321  0.02323 *  
nytavg        0.20716    0.06982   2.967  0.00413 ** 
kg           -1.29097    1.08646  -1.188  0.23881    
hb           -4.68564    1.69975  -2.757  0.00746 ** 
vetoes        0.24382    0.08898   2.740  0.00781 ** 
override     -0.99198    0.55252  -1.795  0.07697 .  
intrasum     -0.16907    0.12199  -1.386  0.17021    
mbills       -0.43939    0.28270  -1.554  0.12469    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.976 on 69 degrees of freedom
Multiple R-squared: 0.885,  Adjusted R-squared: 0.8699
F-statistic: 58.97 on 9 and 69 DF, p-value: < 2.22e-16
*Table 2 with Pres. Approval
reg capp l.capp p_prap econexp nytavg kg hb vetoes override intrasum mbill 
bgodfrey, lag(1 2 3)
dgw.ts %>% model(TSLM(capp~lag(capp,1)+p_prap+econexp+nytavg+kg+hb+vetoes+override+intrasum+mbills)) %>% report()
Series: capp 
Model: TSLM 

Residuals:
    Min      1Q  Median      3Q     Max 
-8.4892 -1.3161 -0.1978  1.7633  6.3624 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.13865    3.38461   2.996  0.00382 ** 
lag(capp, 1)  0.77084    0.05585  13.803  < 2e-16 ***
p_prap        0.04708    0.04024   1.170  0.24610    
econexp       0.08161    0.03195   2.554  0.01290 *  
nytavg        0.20342    0.06971   2.918  0.00477 ** 
kg           -1.61049    1.11745  -1.441  0.15411    
hb           -4.87343    1.70280  -2.862  0.00559 ** 
vetoes        0.24012    0.08880   2.704  0.00865 ** 
override     -0.96229    0.55163  -1.744  0.08560 .  
intrasum     -0.14809    0.12298  -1.204  0.23267    
mbills       -0.46652    0.28290  -1.649  0.10375    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.968 on 68 degrees of freedom
Multiple R-squared: 0.8872, Adjusted R-squared: 0.8706
F-statistic:  53.5 on 10 and 68 DF, p-value: < 2.22e-16

The Breusch-Godfrey test

sapply(c(1:3), function(x) {lmtest::bgtest(lm(capp~lag(capp,1)+p_prap+econexp+nytavg+kg+hb+vetoes+override+intrasum+mbills, data=dgw.ts), order = x)})
             [,1]                                                          
statistic    0.5038368                                                     
parameter    1                                                             
method       "Breusch-Godfrey test for serial correlation of order up to 1"
p.value      0.4778191                                                     
data.name    character,2                                                   
coefficients numeric,12                                                    
vcov         numeric,144                                                   
             [,2]                                                          
statistic    1.043041                                                      
parameter    2                                                             
method       "Breusch-Godfrey test for serial correlation of order up to 2"
p.value      0.5936172                                                     
data.name    character,2                                                   
coefficients numeric,13                                                    
vcov         numeric,169                                                   
             [,3]                                                          
statistic    1.531487                                                      
parameter    3                                                             
method       "Breusch-Godfrey test for serial correlation of order up to 3"
p.value      0.6750225                                                     
data.name    character,2                                                   
coefficients numeric,14                                                    
vcov         numeric,196                                                   
* An Alternate Measure
* ADL
reg capp l.capp p_prap l.p_prap econexp l.econexp nytavg l.nytavg kg hb vetoes override intrasum mbill 
fitstat
bgodfrey, lag(1 2 3)
dgw.ts %>% model(TSLM(capp~lag(capp, 1)+p_prap+lag(p_prap, 1)+econexp+lag(econexp,1)+nytavg+lag(nytavg, 1)+kg+hb+vetoes+override+intrasum+mbills)) %>% gg_tsresiduals()
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).

reg capp l.capp p_prap l.p_prap econexp l.econexp nytavg l.nytavg l2.nytavg kg hb vetoes override intrasum mbill 
fitstat
test l.p_prap 
summary(lm(capp~ dplyr::lag(capp, 1)+p_prap+dplyr::lag(p_prap, 1)+econexp+ dplyr::lag(econexp, 1)+ nytavg+ dplyr::lag(nytavg, 1) + dplyr::lag(nytavg, 2)+ kg+hb+vetoes+override+intrasum+mbills, data=dgw.data))

Call:
lm(formula = capp ~ dplyr::lag(capp, 1) + p_prap + dplyr::lag(p_prap, 
    1) + econexp + dplyr::lag(econexp, 1) + nytavg + dplyr::lag(nytavg, 
    1) + dplyr::lag(nytavg, 2) + kg + hb + vetoes + override + 
    intrasum + mbills, data = dgw.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.2813 -1.1587 -0.2843  1.5055  5.5623 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            13.255444   4.027775   3.291  0.00164 ** 
dplyr::lag(capp, 1)     0.726775   0.065844  11.038 2.31e-16 ***
p_prap                  0.138223   0.057373   2.409  0.01892 *  
dplyr::lag(p_prap, 1)  -0.093813   0.055274  -1.697  0.09459 .  
econexp                 0.005848   0.070969   0.082  0.93459    
dplyr::lag(econexp, 1)  0.082163   0.072223   1.138  0.25959    
nytavg                  0.192404   0.071465   2.692  0.00908 ** 
dplyr::lag(nytavg, 1)   0.023006   0.069422   0.331  0.74144    
dplyr::lag(nytavg, 2)   0.138535   0.078110   1.774  0.08097 .  
kg                     -1.566093   1.122181  -1.396  0.16774    
hb                     -4.340598   1.715171  -2.531  0.01389 *  
vetoes                  0.283759   0.101779   2.788  0.00700 ** 
override               -1.061008   0.580564  -1.828  0.07235 .  
intrasum               -0.107917   0.122254  -0.883  0.38074    
mbills                 -0.715715   0.304646  -2.349  0.02196 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.914 on 63 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.8992,    Adjusted R-squared:  0.8768 
F-statistic: 40.14 on 14 and 63 DF,  p-value: < 2.2e-16
anova(lm(capp~ dplyr::lag(capp, 1)+p_prap+econexp+ dplyr::lag(econexp, 1)+ nytavg+ dplyr::lag(nytavg, 1) + dplyr::lag(nytavg, 2)+ kg+hb+vetoes+override+intrasum+mbills, data=dgw.data), lm(capp~ dplyr::lag(capp, 1)+p_prap+dplyr::lag(p_prap, 1)+econexp+ dplyr::lag(econexp, 1)+ nytavg+ dplyr::lag(nytavg, 1) + dplyr::lag(nytavg, 2)+ kg+hb+vetoes+override+intrasum+mbills, data=dgw.data)) 
Analysis of Variance Table

Model 1: capp ~ dplyr::lag(capp, 1) + p_prap + econexp + dplyr::lag(econexp, 
    1) + nytavg + dplyr::lag(nytavg, 1) + dplyr::lag(nytavg, 
    2) + kg + hb + vetoes + override + intrasum + mbills
Model 2: capp ~ dplyr::lag(capp, 1) + p_prap + dplyr::lag(p_prap, 1) + 
    econexp + dplyr::lag(econexp, 1) + nytavg + dplyr::lag(nytavg, 
    1) + dplyr::lag(nytavg, 2) + kg + hb + vetoes + override + 
    intrasum + mbills
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1     64 559.29                              
2     63 534.83  1    24.454 2.8806 0.09459 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*ECM
reg d.capp l.capp d.p_prap l.p_prap d.econexp l.econexp d.nytavg l.nytavg kg hb vetoes override intrasum mbill 
bgodfrey, lag(1 2 3)
summary(lm(difference(capp)~dplyr::lag(capp, 1) + difference(p_prap) +  dplyr::lag(p_prap, 1)+ difference(econexp)+lag(econexp, 1)+difference(nytavg) + dplyr::lag(nytavg) +  kg +  hb +  vetoes + override +  intrasum +  mbills, data=dgw.data))

Call:
lm(formula = difference(capp) ~ dplyr::lag(capp, 1) + difference(p_prap) + 
    dplyr::lag(p_prap, 1) + difference(econexp) + lag(econexp, 
    1) + difference(nytavg) + dplyr::lag(nytavg) + kg + hb + 
    vetoes + override + intrasum + mbills, data = dgw.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.0575 -1.2991 -0.2338  1.6723  6.0297 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            9.92398    3.59485   2.761 0.007490 ** 
dplyr::lag(capp, 1)   -0.22348    0.05979  -3.738 0.000394 ***
difference(p_prap)     0.11480    0.05548   2.069 0.042508 *  
dplyr::lag(p_prap, 1)  0.02166    0.04356   0.497 0.620667    
difference(econexp)    0.02214    0.07027   0.315 0.753732    
lag(econexp, 1)        0.08156    0.03311   2.463 0.016420 *  
difference(nytavg)     0.18074    0.07179   2.518 0.014290 *  
dplyr::lag(nytavg)     0.22229    0.09152   2.429 0.017921 *  
kg                    -1.36022    1.12380  -1.210 0.230519    
hb                    -4.24191    1.72938  -2.453 0.016865 *  
vetoes                 0.20605    0.09261   2.225 0.029567 *  
override              -0.75686    0.55956  -1.353 0.180870    
intrasum              -0.12549    0.12277  -1.022 0.310504    
mbills                -0.53209    0.28906  -1.841 0.070217 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.939 on 65 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.4115,    Adjusted R-squared:  0.2938 
F-statistic: 3.496 on 13 and 65 DF,  p-value: 0.0003862
*Bewley
ivreg capp p_prap d.p_prap econexp d.econexp nytavg d.nytavg kg hb vetoes override intrasum mbill (d.capp = l.capp p_prap l.p_prap econexp l.econexp nytavg l.nytavg)

Cointegration and Error Correction

Studying long-run relationships between two variables \(Y_t\) and \(X_t\).
Question: do Y and X travel together thru time in dynamic equilibrium?

Methodology developed by Engle and Granger 1987

  1. Determine if \(Y_t\) and \(X_t\) are nonstationary

  2. If \(Y_t\) and \(X_t\) are nonstationary, regress \(Y_t\) on \(X_t\) and save residuals. This is called the cointegrating regression

  3. Do unit-root test on residuals - if residuals are stationary, \(Y_t\) and \(X_t\) cointegrate (travel together over time). If residuals are not stationary then \(Y_t\) and \(X_t\) do not cointegrate

Engle and Granger show that if \(Y_t\) and \(X_t\) cointegrate, then their relationship can be specified as an error correction model:

\[ (1-L)Y_t = B_0 + B_1(1-L)X_t + a(\underbrace{Y_{t-1} - c_0 - c_1X_{t-1}}_{\textrm{Cointegration residual, lagged}}) + e_t \]

In this model \(B_1\) captures the short-run effect of X on Y

\((Y_{t-1} -c_0 - c_1X_{t-1})\) is called the error correction mechanism and it captures the long-run effect of X on Y. In the Engle-Granger “two-step” methodology, the error correction mechanism is measured by the residuals from the cointegrating regression lagged back 1 period

The coefficient \(a\) is an adjustment parameter which is expected to be less than 1 in absolute value and to carry a negative sign. The larger the absolute value of a, the stronger the error correction mechanism. a indicates the speed with which a shock to Y is eroded by the cointegrating realtionship between Y and X

The error correction model is estimated by OLS regression.

Note that cointegration does not imply causation, and it is up to the investigator to determine if \(Y_t\) or \(X_t\) is the dependent variable and to defend the implicit assumption of 1-way causation.

ADL and Cointegration

The error correction model is a restricted autoregressive distributed lag (ADL) (1,1) model (see Hendry, 1995):

\[ (1-L)Yt = B_0 + B_1(1-L)X_t + a(Y_{t-1} - C_0 - C_1X_{t-1}) + e_t \] \[ Y_t - Y_{t-1} = B_0 + B_1X_t - B_1X_{t-1} + aY_{t-1} + aC_0 - aC_1X_{t-1} + e_t \] \[ Y_t = B_0 + aC_0 + Y_{t-1} + aY_{t-1} + B_1 X_t - (B_1 + aC_1)X_{t-1} + e_t \]

if Y and X cointegrate then a should be less than 1 in absolute value and carry a negative sign

ADL (1,1) model - 1 lag of Y and 1 lag of X

\[ Y_t = B_0 -aC_0 + (1-a)Y_{t-1} + B_1 X_t - (B_1 -aC_1)X_{t-1} + e_t \]

if a = -.5, this implies that the coefficient on aYt-1 = (1-.5) = .5, so 50% of a shock to Y at time t will erode in each subsequent period.

Reference: David Hendry. 1995. Dynamic Econometrics. Oxford: Oxford University Press