Skip to content

Forecasting: Principles and Practice 3rd (Ch7 Time series regression models)

https://otexts.com/fpp3/

Chapter 7 Time series regression models

In this chapter we discuss regression models. The basic concept is that we forecast the time series of interest \(y\) assuming that it has a linear relationship with other time series \(x\).

The forecast variable \(y\) is sometimes also called the regressand, dependent or explained variable. The predictor variables \(x\) are sometimes also called the regressors, independent or explanatory variables.

7.1 The linear model

Simple linear regression

\[ y_t = \beta_0 + \beta_1 x_t + \varepsilon_t. \]

\(\beta_0\)代表截距; \(\beta_1\)代表斜率

The coefficients \(\beta_0\) and \(\beta_1\) denote the intercept and the slope of the line respectively. The intercept \(\beta_0\) represents the predicted value of \(y\) when \(x=0\). The slope \(\beta_1\) represents the average predicted change in \(y\) resulting from a one unit increase in \(x\).

Example: US consumption expenditure

每季消費支出成長率 X軸:每季收入成長率 Y軸:每季消費成長率

解讀迴歸結果:

\[ \hat{y}_t=0.54 + 0.27x_t. \]
  • 梯度大於0代表收入和消費呈現正相關
  • 斜率說明1%的收入成長會導致增加平均0.27%的消費成長

The fitted line has a positive slope, reflecting the positive relationship between income and consumption. The slope coefficient shows that a one unit increase in \(x\) (a 1 percentage point increase in personal disposable income) results on average in 0.27 units increase in \(y\) (an average increase of 0.27 percentage points in personal consumption expenditure).

  • 迴歸公式說明每增加1%收入成長,會導致0.82%的消費成長

Multiple linear regression

\[ \begin{equation} y_t = \beta_{0} + \beta_{1} x_{1,t} + \beta_{2} x_{2,t} + \cdots + \beta_{k} x_{k,t} + \varepsilon_t, \tag{7.1} \end{equation} \]

The coefficients \(\beta_1,\dots,\beta_k\) measure the effect of each predictor after taking into account the effects of all the other predictors in the model. Thus, the coefficients measure the marginal effects of the predictor variables.

Example: US consumption expenditure

再增加3個變數(工業生產,失業率,存款),個別觀察對消費成長率的影響。

解讀scatterplot matrix:

The scatterplots show positive relationships with income and industrial production, and negative relationships with savings and unemployment. The strength of these relationships are shown by the correlation coefficients across the first row. The remaining scatterplots and correlation coefficients show the relationships between the predictors.

Assumptions

First, we assume that the model is a reasonable approximation to reality

Second, we make the following assumptions about the errors \((\varepsilon_{1},\dots,\varepsilon_{T})\):

  • they have mean zero; otherwise the forecasts will be systematically biased.
  • they are not autocorrelated; otherwise the forecasts will be inefficient, as there is more information in the data that can be exploited.
  • they are unrelated to the predictor variables; otherwise there would be more information that should be included in the systematic part of the model.

It is also useful to have the errors being normally distributed with a constant variance \(\sigma^2\) in order to easily produce prediction intervals.

7.2 Least squares estimation

透過最小化squared error,有效率地選出最佳的參數。

The least squares principle provides a way of choosing the coefficients effectively by minimising the sum of the squared errors. That is, we choose the values of \(\beta_0,\beta_1,\dots,\beta_k\) that minimise

\[ \sum_{t=1}^T \varepsilon_t^2 = \sum_{t=1}^T (y_t - \beta_{0} - \beta_{1} x_{1,t} - \beta_{2} x_{2,t} - \cdots - \beta_{k} x_{k,t})^2. \]

This is called least squares estimation because it gives the least value for the sum of squared errors. Finding the best estimates of the coefficients is often called “fitting” the model to the data, or sometimes “learning” or “training” the model.

Example: US consumption expenditure

fit_consMR <- us_change %>%
  model(tslm = TSLM(Consumption ~ Income + Production +
                                    Unemployment + Savings))
report(fit_consMR)
#> Series: Consumption
#> Model: TSLM
#>
#> Residuals:
#>     Min      1Q  Median      3Q     Max
#> -0.9055 -0.1582 -0.0361  0.1362  1.1547
#>
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)
#> (Intercept)   0.25311    0.03447    7.34  5.7e-12 ***
#> Income        0.74058    0.04012   18.46  < 2e-16 ***
#> Production    0.04717    0.02314    2.04    0.043 *
#> Unemployment -0.17469    0.09551   -1.83    0.069 .
#> Savings      -0.05289    0.00292  -18.09  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.31 on 193 degrees of freedom
#> Multiple R-squared: 0.768,   Adjusted R-squared: 0.763
#> F-statistic:  160 on 4 and 193 DF, p-value: <2e-16

解讀迴歸結果:

For forecasting purposes, the final two columns are of limited interest. The “t value” is the ratio of an estimated \(\beta\) coefficient to its standard error and the last column gives the p-value: the probability of the estimated \(\beta\) coefficient being as large as it is if there was no real relationship between consumption and the corresponding predictor.

This is useful when studying the effect of each predictor, but is not particularly useful for forecasting.

Fitted values

\[ \begin{equation} \hat{y}_t = \hat\beta_{0} + \hat\beta_{1} x_{1,t} + \hat\beta_{2} x_{2,t} + \cdots + \hat\beta_{k} x_{k,t}. \tag{7.2} \end{equation} \]

Plugging in the values of \(x_{1,t},\dots,x_{k,t}\) for \(t=1,\dots,T\) returns predictions of \(y_t\) within the training set, referred to as fitted values. Note that these are predictions of the data used to estimate the model, not genuine forecasts of future values of \(y\).

Goodness-of-fit

A common way to summarise how well a linear regression model fits the data is via the coefficient of determination, or \(R^2\).

\[ R^2 = \frac{\sum_(\hat{y_{t}}-\bar{y})^2}{\sum_(y_{t}-\bar{y})^2}, \]

代表依變數\(y\)被自變數\(x\)所解釋的比率

where the summations are over all observations. Thus, it reflects the proportion of variation in the forecast variable that is accounted for (or explained) by the regression model.

用在簡單線性模型,\(R^2\)等於\(x\)\(y\)之間corr平方。

\(R^2\)介於0-1之間。接近1代表相關性越高;接近0代表越低。

In simple linear regression, the value of \(R^2\) is also equal to the square of the correlation between \(y\) and \(x\) (provided an intercept has been included).

注意!!! 過於追求\(R^2\)會導致over-fitting風險。 太關注training data的\(R^2\)不如驗證model在test data的預測表現

The \(R^2\) value is used frequently, though often incorrectly, in forecasting. The value of \(R^2\) will never decrease when adding an extra predictor to the model and this can lead to over-fitting. There are no set rules for what is a good \(R^2\) value, and typical values of \(R^2\) depend on the type of data used. Validating a model’s forecasting performance on the test data is much better than measuring the \(R^2\) value on the training data.

Standard error of the regression

Another measure of how well the model has fitted the data is the standard deviation of the residuals, which is often known as the “residual standard error.” This is shown in the above output with the value 0.31.

\[ \begin{equation} \hat{\sigma}_e=\sqrt{\frac{1}{T-k-1}\sum_{t=1}^{T}{e_t^2}}, \tag{7.3} \end{equation} \]

公式解釋略過

The standard error will be used when generating prediction intervals, discussed in Section 7.6.

7.3 Evaluating the regression model

這章節說明如何從不同角度評估訓練後的模型。

名詞residuals只用在fitting(training),而非test data

\[ e_t = y_t - \hat{y}_t \]

The residuals have some useful properties including the following two(完美fitting才會有的結果...):

\[ \sum_{t=1}^{T}{e_t}=0 \quad\text{and}\quad \sum_{t=1}^{T}{x_{k,t}e_t}=0\qquad\text{for all $k$}. \]

After selecting the regression variables and fitting a regression model, it is necessary to plot the residuals to check that the assumptions of the model have been satisfied. There are a series of plots that should be produced in order to check different aspects of the fitted model and the underlying assumptions. We will now discuss each of them in turn.

ACF plot of residuals

時間序列的資料,很有可能發生過去相同時段的觀測值是相似的。這就導致model fitting後的residuals產生autocorrelation現象。

With time series data, it is highly likely that the value of a variable observed in the current time period will be similar to its value in the previous period, or even the period before that, and so on. Therefore when fitting a regression model to time series data, it is common to find autocorrelation in the residuals.

不理解這段用意

The forecasts from a model with autocorrelated errors are still unbiased, and so are not “wrong,” but they will usually have larger prediction intervals than they need to. Therefore we should always look at an ACF plot of the residuals.

Histogram of residuals

觀察residuals分佈,目的不在於預測,而是在於預測的區間。

It is always a good idea to check whether the residuals are normally distributed. As we explained earlier, this is not essential for forecasting, but it does make the calculation of prediction intervals much easier.

直方圖顯示residuals的偏移情況,這也可能影響預測區間的覆蓋機率。

The histogram shows that the residuals seem to be slightly skewed, which may also affect the coverage probability of the prediction intervals.

Residual plots against predictors

作者沒講清楚systematic patterns的定義...

We would expect the residuals to be randomly scattered without showing any systematic patterns. A simple and quick way to check this is to examine scatterplots of the residuals against each of the predictor variables. If these scatterplots show a pattern, then the relationship may be nonlinear and the model will need to be modified accordingly. See Section 7.7 for a discussion of nonlinear regression.

It is also necessary to plot the residuals against any predictors that are not in the model. If any of these show a pattern, then the corresponding predictor may need to be added to the model (possibly in a nonlinear form).

Residual plots against fitted values

heteroscedasticity 異質性

homoscedastic 同質性 -> 期望看到的residual性質: residuals的變異性是不變的常數。

A plot of the residuals against the fitted values should also show no pattern. If a pattern is observed, there may be heteroscedasticity in the errors which means that the variance of the residuals may not be constant. If this problem occurs, a transformation of the forecast variable such as a logarithm or square root may be required (see Section 3.1). ... The random scatter suggests the errors are homoscedastic.

Outliers and influential observations

對coefficients具有影響力的observations稱之為influential observations, 通常都是outliers所引起。

Observations that take extreme values compared to the majority of the data are called outliers. Observations that have a large influence on the estimated coefficients of a regression model are called influential observations. Usually, influential observations are also outliers that are extreme in the \(x\) direction.

Spurious regression 假性迴歸

作者以澳洲每年航空客運量和幾內亞稻米年產量比較為例。

Regressing non-stationary time series can lead to spurious regressions.

Cases of spurious regression might appear to give reasonable short-term forecasts, but they will generally not continue to work into the future.

High \(R^2\) and high residual autocorrelation can be signs of spurious regression.

7.4 Some useful predictors

Trend

對於trend特性的時間序列,將時間因素\(t\)作為預測變數之一。

It is common for time series data to be trending. A linear trend can be modelled by simply using \(x_{1,t}=t\) as a predictor,

\[ y_{t}= \beta_0+\beta_1t+\varepsilon_t, \]

where \(t=1,\dots,T\).

Dummy variable

用途:

  1. 轉換類別變數 -> 0, 1
  2. 處理特殊事件(outlier), 舉例:預測巴西旅遊人數,並同時需要考量到2016里約奧運的影響。

A dummy variable is also known as an “indicator variable.” A dummy variable can also be used to account for an outlier in the data.

Seasonal dummy variables (不夠理解)

假設要以星期一~星期日作為預測變數,我們只需要6個dummy variables。

Sunday會以dummy variable都為0時表示。

Notice that only six dummy variables are needed to code seven categories. That is because the seventh category (in this case Sunday) is captured by the intercept, and is specified when the dummy variables are all set to zero.

dummy variable trap

Many beginners will try to add a seventh dummy variable for the seventh category. This is known as the “dummy variable trap”, because it will cause the regression to fail. There will be one too many parameters to estimate when an intercept is also included. The general rule is to use one fewer dummy variables than categories. So for quarterly data, use three dummy variables; for monthly data, use 11 dummy variables; and for daily data, use six dummy variables, and so on.

Example: Australian quarterly beer production

用另種方式呈現預測表現。 X軸: 實際值 Y軸: 預測值

Trading days

一個月中的交易日數可能會有很大差異,並對銷售數據產生重大影響。

為此,可以將每個月的交易天數作為預測指標。

我的理解: 次數、數量也當作預測變數

\[ \begin{align*} x_{1} &= \text{number of Mondays in month;} \\ x_{2} &= \text{number of Tuesdays in month;} \\ & \vdots \\ x_{7} &= \text{number of Sundays in month.} \end{align*} \]

Distributed lags

廣告支出作為預測變數,預測廣告效益。

考量到廣告影響力會延續超過該檔期。所以需要考量lags

以下公式,每個\(x\)代表不同歷史時間點的廣告。

\[ \begin{align*} x_{1} &= \text{advertising for previous month;} \\ x_{2} &= \text{advertising for two months previously;} \\ & \vdots \\ x_{m} &= \text{advertising for $m$ months previously.} \end{align*} \]

Easter

復活節與大多數假期不同,因為它不是每年都在同一天舉行,其影響可持續數天。 在這種情況下,當假期在特定時間段內時,dummy為1,否則為0。

對於月度數據,如果復活節在3月,則dummy在3月為1,如果在4月,虛擬變量在4月 為1。 當復活節從3月開始並在4月結束,dummy在月份之間按比例分配。

Fourier series

季節性資料的另一選擇 - 傅立葉

An alternative to using seasonal dummy variables, especially for long seasonal periods, is to use Fourier terms.

A regression model containing Fourier terms is often called a harmonic regression because the successive Fourier terms represent harmonics of the first two Fourier terms.

7.5 Selecting predictors

挑選預測變數,普遍常見但是不推薦的作法就是只看單一個x對y的關聯性。

A common approach that is not recommended is to plot the forecast variable against a particular predictor and if there is no noticeable relationship, drop that predictor from the model. This is invalid because it is not always possible to see the relationship from a scatterplot, especially when the effects of other predictors have not been accounted for.

另一種無效的常見方法是對所有預測變數進行多元線性回歸,忽略p值大於0.05的變數。統計顯著性並不總表示預測的價值,而且p值也會因為預測變數之間有correlation導致誤導。

Another common approach which is also invalid is to do a multiple linear regression on all the predictors and disregard all variables whose p-values are greater than 0.05. To start with, statistical significance does not always indicate predictive value. Even if forecasting is not the goal, this is not a good strategy because the p-values can be misleading when two or more predictors are correlated with each other (see Section 7.8).

反之,應該要用預測準確度的方法。

以下介紹5種評估預測能力的方法。

Adjusted \(R^2\)

Section 7.2介紹的\(R^2\)缺點:

  • 只適合用在模型fitting能力,不適合用在評估預測表現。
  • \(R^2\)不允許"自由度"。越多變數會讓\(R^2\)分數越高,會導致over-fitting問題
  • 最小化SSE等同於最大化\(R^2\),不適用在評估預測表現。
\[ \text{SSE} = \sum_{t=1}^T e_{t}^2. \]

\(R^2\)修正版: \(\bar{R}^2\)

  • 增加變數不見得會提昇分數,但帶有這傾向。
  • 最大化\(\bar{R}^2\)等同於最小化standard error \(\hat{\sigma}_e\)
\[ \bar{R}^2 = 1-(1-R^2)\frac{T-1}{T-k-1}, \]

where \(T\) is the number of observations and \(k\) is the number of predictors.

Cross-validation

Akaike’s Information Criterion

For large values of \(T\), minimising the AIC is equivalent to minimising the CV value.

Corrected Akaike’s Information Criterion

For small values of \(T\), the AIC tends to select too many predictors, and so a bias-corrected version of the AIC has been developed,

Schwarz’s Bayesian Information Criterion

Which measure should we use?

Consequently, we recommend that one of the AICc, AIC, or CV statistics be used, each of which has forecasting as their objective. If the value of \(T\) is large enough, they will all lead to the same model. In most of the examples in this book, we use the AICc value to select the forecasting model.

Example: US consumption

Stepwise regression (有效率地選擇預測變數)

如果你有40個predictors,就有\(2^{40}\)個模型訓練組合!

If there are a large number of predictors, it is not possible to fit all possible models. For example, 40 predictors leads to \(2^{40}>1\) trillion possible models!

如何有效率地選擇需要的預測變數:

  • backwards stepwise regression: 逐漸減少預測變數
  • forward stepwise regression:逐漸增加預測變數
  • hybrid:混合上述二種方法

作者沒講具體實做方法。只給參考書籍 James et al. (2014)

Beware of inference after selecting predictors

如果關注的是預測變數的統計顯著性,請注意任何涉及選擇預測變數的過程都會讓p值的假設無效。

這章節提到的方法只適用於模型預測;如果要研究任何x對y的影響,那這些方法並沒有幫助。

If you do wish to look at the statistical significance of the predictors, beware that any procedure involving selecting predictors first will invalidate the assumptions behind the p-values. The procedures we recommend for selecting predictors are helpful when the model is used for forecasting; they are not helpful if you wish to study the effect of any predictor on the forecast variable.

7.6 Forecasting with regression (預測未來的y值)

Ex-ante versus ex-post forecasts

pass 看不懂用意

Example: Australian quarterly beer production

pass

Scenario based forecasting

輸入人為定義的情境(\(x\)),讓模型回答結果(\(\hat{y}\))

In this setting, the forecaster assumes possible scenarios for the predictor variables that are of interest.

舉例,政策制定者對於1%收入成長以及0.5%存款成長對比-1%收入成長以及-0.5%存款成長感興趣。

For example, a US policy maker may be interested in comparing the predicted change in consumption when there is a constant growth of 1% and 0.5% respectively for income and savings with no change in the employment rate, versus a respective decline of 1% and 0.5%, for each of the four quarters following the end of the sample.

Building a predictive regression model

後續故事沒興趣看,pass

7.7 Nonlinear regression

本章節作者主要用分段式線性函數(piecewise linear)示範非線性函數的表現。

\[\begin{align*} x_{1,t} & = t \\ x_{2,t} &= (t-\tau)_+ = \left\{ \begin{array}{ll} 0 & \text{if } t < \tau\\ t-\tau & \text{if } t \ge \tau \end{array}\right. \end{align*}\]

Example: Boston marathon winning times

要注意的是,knots是人為決定的,導致看過歷史資料決定knots位置可能導致over-fitting。

We should warn here that subjective identification of knots can lead to over-fitting, which can be detrimental to the forecast performance of a model, and should be performed with caution.

7.8 Correlation, causation and forecasting

Correlation is not causation

變數\(x\)對於預測\(y\)很有幫助,但並不表示\(x\)導致\(y\),也有可能是\(y\)導致\(x\),或者是更複雜的變數關係。

It is important not to confuse correlation with causation, or causation with forecasting. A variable \(x\) may be useful for forecasting a variable y, but that does not mean \(x\) is causing \(y\). It is possible that \(x\) is causing \(y\), but it may be that \(y\) is causing \(x\), or that the relationship between them is more complicated than simple causality.

舉例:用冰淇淋銷售數量對海灘度假村的溺水人數進行建模。

該模型可以給出合理的預測,不是因為冰淇淋導致溺水,而是因為人們在大熱天吃更多冰淇淋,而他們也更有可能去游泳。所以這兩個變數(冰淇淋銷售和溺水)是相關的,但沒有因果關係。 它們都是由第三個變數(溫度)引起的。 這是一個confounding的例子: 一個被遺漏的變數(omitted variable)會導致反應變數和至少一個預測變量發生變化。

Omitted variable就是沒被列入建模的變數,少了它我們很難知道變數間的因果關係,但也不代表會讓預測變困難

We describe a variable that is not included in our forecasting model as a confounder when it influences both the response variable and at least one predictor variable. Confounding makes it difficult to determine what variables are causing changes in other variables, but it does not necessarily make forecasting more difficult.

另一個例子:通過觀察早上路上騎腳踏車的人數來預測下午是否會下雨。 當騎腳踏車的人比平時少時,晚一點的時候下雨的可能性就更大。模型可以給出合理的預測,不是因為騎腳踏車的人防止下雨,而是因為天氣預報影響人騎腳踏車。在這種情況下,存在因果關係,但與我們的預測模型方向相反。 騎腳踏車的人數下降是因為下雨預報。也就是說,是\(y\)(降雨量)影響\(x\)(騎腳踏車的人)。

重點是,correlations對於預測很有幫助,即便沒有因果關係或是\(x\),\(y\)因果關係顛倒,或是存在confounding。

It is important to understand that correlations are useful for forecasting, even when there is no causal relationship between the two variables, or when the causality runs in the opposite direction to the model, or when there is confounding.

如果可以確定因果,當然就能有更好的模型。

However, often a better model is possible if a causal mechanism can be determined. A better model for drownings will probably include temperatures and visitor numbers and exclude ice-cream sales. A good forecasting model for rainfall will not include cyclists, but it will include atmospheric observations from the previous few days.

Forecasting with correlated predictors

即使有correlated predictors也不影響預測。

但是運用在scenario forecasting就會造成問題。分析單一變數的貢獻也會有問題。

Having correlated predictors is not really a problem for forecasting, as we can still compute forecasts without needing to separate out the effects of the predictors. However, it becomes a problem with scenario forecasting as the scenarios should take account of the relationships between predictors. It is also a problem if some historical analysis of the contributions of various predictors is required.

Multicollinearity and forecasting

It can occur when two predictors are highly correlated with each other (that is, they have a correlation coefficient close to +1 or -1). Multicollinearity can also occur when a linear combination of predictors is highly correlated with another linear combination of predictors.

如果你不在乎預測變數的貢獻度,未來預測變數的區間沒超出歷史數據,就不用擔心 multicollinearity,關心correlation就好。

Note that if you are using good statistical software, if you are not interested in the specific contributions of each predictor, and if the future values of your predictor variables are within their historical ranges, there is nothing to worry about — multicollinearity is not a problem except when there is perfect correlation.

7.9 Matrix formulation (pass)

pass