Forecasting: Principles and Practice 3rd (ch1-ch6)

1.6 The basic steps in a forecasting task

Step 1: Problem definition.

了解未來預測流程中所有牽涉到的"人"。

A forecaster needs to spend time talking to everyone who will be involved in collecting data, maintaining databases, and using the forecasts for future planning.

Step 2: Gathering information.

Step 3: Preliminary (exploratory) analysis.

Step 4: Choosing and fitting models.

Step 5: Using and evaluating a forecasting model.

1.7 The statistical forecasting perspective

When we obtain a forecast, we are estimating the middle of the range of possible values the random variable could take. Often, a forecast is accompanied by a prediction interval giving a range of values the random variable could take with relatively high probability.

Rather than plotting individual possible futures as shown in Figure 1.2, we usually show these prediction intervals instead. Figure 1.3 shows 80% and 95% intervals for the future Australian international visitors.

The set of values that this random variable could take, along with their relative probabilities, is known as the “probability distribution” of $y_{t} |\mathcal{I}$. In forecasting, we call this the forecast distribution

當提到forecast,通常指的是預測分佈的平均值。

When we talk about the “forecast,” we usually mean the average value of the forecast distribution, and we put a “hat” over $y$ to show this. Thus, we write the forecast of $y_{t}$ as $\hat{y}_{t}$, meaning the average of the possible values that $y_{t}$ could take given everything we know.

2.7 Lag plots

下圖顯示澳洲2000Q1~2010Q1的啤酒產量

改用lag plot呈現，X軸是lagK產量，Y軸是原始產量

由上圖可以觀察到，當k=4、k=8時(代表處於過去時間同一個Q)，產量呈現高度正相關，反映的seasonality的現象。反觀k=2、k=6，可以發現高度負相關，Q4高產量對比了Q2低產量。

Here the colours indicate the quarter of the variable on the vertical axis. The relationship is strongly positive at lags 4 and 8, reflecting the strong seasonality in the data. The negative relationship seen for lags 2 and 6 occurs because peaks (in Q4) are plotted against troughs (in Q2)

2.8 Autocorrelation

Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series.

\[ r_{k}=\frac{\sum^{T}_{t=k+1}{(y_{t}-\overline{y})(y_{t-k}-\overline{y})}}{\sum^{T}_{t=1}{(y_{t}-\overline{y})^2}} \]

Trend and seasonality in ACF plots

透過ACF圖表觀察trend: 較小的lag對應的acf會是比較大的正數，並隨著lag值增加，acf跟著遞減

When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in value. So the ACF of a trended time series tends to have positive values that slowly decrease as the lags increase.

上圖包含了trend、seasonality

seasonality: acf也呈現扇形形狀

The slow decrease in the ACF as the lags increase is due to the trend, while the “scalloped” shape is due to the seasonality.

2.9 White Noise

沒有autocorrelation的時間序列稱之為white noise.

Time series that show no autocorrelation are called white noise

當95%的splike都低於藍線($±\frac{2}{\sqrt{T}}$)，可以把他視為white noise.

For a white noise series, we expect 95% of the spikes in the ACF to lie within $±\frac{2}{\sqrt{T}}$ where $T$ is the length of the time series.

Chapter 3 Time series decomposition

In Section 2.3 we discussed three types of time series patterns: trend, seasonality and cycles. When we decompose a time series into components, we usually combine the trend and cycle into a single trend-cycle component (often just called the trend for simplicity).

Thus we can think of a time series as comprising three components: a trend-cycle component, a seasonal component, and a remainder component (containing anything else in the time series).

3.1 Transformations and adjustments

調整和轉換的目的是通過移除已知的變異源或使整個數據集的pattern更加一致來簡化歷史數據pattern。簡單的pattern通常更容易建模並有更準確的預測。

The purpose of these adjustments and transformations is to simplify the patterns in the historical data by removing known sources of variation, or by making the pattern more consistent across the whole data set. Simpler patterns are usually easier to model and lead to more accurate forecasts.

Calendar adjustments (牽涉到日曆)

在季節性數據中看到的一些變化可能是由於簡單的日曆效應。

例如，如果您正在研究零售店的每月總銷售額，那麼除了一年中的季節性變化之外，由於每個月的交易日數不同，月份之間也會存在差異。通過計算每個月每個交易日的平均銷售額，而不是當月的總銷售額，很容易消除這種變化。然後我們有效地消除了日曆變化。

Population adjustments (牽涉到人口變動)

consider the data per person (or per thousand people, or per million people) rather than the total.

Example: 研究特定地區一段時間內的病床數量。

針對容易受到人口變動影響的資料，最好使用人均(per-capita)方式表達。

Example: GDP per-capita.

It is possible for the total number of beds to increase, but the number of beds per thousand people to decrease. This occurs when the population is increasing faster than the number of hospital beds. For most data that are affected by population changes, it is best to use per-capita data rather than the totals.

Inflation adjustments (牽涉到幣值)

Data which are affected by the value of money are best adjusted before modelling.

Example: Consumer Price Index (or CPI).

Mathematical transformations

Logarithms are useful because they are interpretable: changes in a log value are relative (or percentage) changes on the original scale.

Example: Box-Cox transformations

3.2 Time series components (Seasonality, Trend, Remainder)

\[ y_{t}=S_{t}+T{t}+R_{t} \]

\[ y_{t}=S_{t} \times T{t} \times R_{t} \]

$y_{t}$ is the data, $S_{t}$ is the seasonal component, $T_{t}$ is the trend-cycle component, and $R_{t}$ is the remainder component, all at period ${t}

level of the time series?

The additive decomposition is the most appropriate if the magnitude of the seasonal fluctuations, or the variation around the trend-cycle, does not vary with the level of the time series.

When the variation in the seasonal pattern, or the variation around the trend-cycle, appears to be proportional to the level of the time series, then a multiplicative decomposition is more appropriate. Multiplicative decompositions are common with economic time series.

\[ y_{t}=S_{t} \times T{t} \times R_{t} \text{ is equivalent to } \log{y_{t}}=\log{S_{t}}+\log{T_{t}}+\log{R_{t}} \]

Example: STL decomposition

Seasonally adjusted data

If the seasonal component is removed from the original data, the resulting values are the “seasonally adjusted” data.

If the variation due to seasonality is not of primary interest, the seasonally adjusted series can be useful.

3.3 Moving averages

The first step in a classical decomposition is to use a moving average method to estimate the trend-cycle.

Moving average smoothing

A moving average of order ${m}$ can be written as

\[ \hat{T}_{t} = \frac{1}{m}\sum^{k}_{j=-k}{y_{t}+j} \]

where $m=2k+1$. That is, the estimate of the trend-cycle at time $t$ is obtained by averaging values of the time series within $k$ periods of $t$.

Moving averages of moving averages

It is possible to apply a moving average to a moving average. One reason for doing this is to make an even-order moving average symmetric.

Example: $2 \times 4 \text{-MA}$

\[ \hat{T}_{t} = \frac{1}{2}[\frac{1}{4}(y_{t-2} + y_{t-1} + y_{t} + y_{t+1}) + \frac{1}{4}(y_{t-1} + y_{t} + y_{t+1} + y_{t+2})] \]

\[ = \frac{1}{8}y_{t-2}+\frac{1}{4}y_{t-1}+\frac{1}{4}y_{t}+\frac{1}{4}y_{t+1}+\frac{1}{8}y_{t+2} \]

Estimating the trend-cycle with seasonal data

pass

Weighted moving averages

pass

3.4 Classical decomposition

In classical decomposition, we assume that the seasonal component is constant from year to year. For multiplicative seasonality, the $m$ values that form the seasonal component are sometimes called the “seasonal indices.”

Comments on classical decomposition

經典方法的限制：

The estimate of the trend-cycle is unavailable for the first few and last few observations.
The trend-cycle estimate tends to over-smooth rapid rises and falls in the data.
Classical decomposition methods assume that the seasonal component repeats from year to year.
Occasionally, the values of the time series in a small number of periods may be particularly unusual.

3.5 Methods used by official statistics agencies

X-11 method

In particular, trend-cycle estimates are available for all observations including the end points, and the seasonal component is allowed to vary slowly over time. X-11 also handles trading day variation, holiday effects and the effects of known predictors.

相比於STL和classical decomposition，X11 trend-cycle可以捕捉到2007-2008的金融海嘯帶來的影響(irregular)。

SEATS method

pass

3.6 STL decomposition

STL 優點

Unlike SEATS and X-11, STL will handle any type of seasonality, not only monthly and quarterly data.
The seasonal component is allowed to change over time, and the rate of change can be controlled by the user.
The smoothness of the trend-cycle can also be controlled by the user.
It can be robust to outliers

4.1 Some simple statistics

All the autocorrelations of a series can be considered features of that series. We can also summarise the autocorrelations to produce new features; for example, the sum of the first ten squared autocorrelation coefficients is a useful summary of how much autocorrelation there is in a series, regardless of lag.

4.2 ACF features

All the autocorrelations of a series can be considered features of that series. We can also summarise the autocorrelations to produce new features

4.3 STL Features

Recall that the decomposition is written as

\[ y_t = T_t + S_t + R_t \]

For strongly trended data, the seasonally adjusted data should have much more variation than the remainder component.

觀察trend的強度：

\[ F_T = \max\left(0, 1 - \frac{\text{Var}(R_t)}{\text{Var}(T_t+R_t)}\right) \]

The strength of seasonality is defined similarly, but with respect to the detrended data rather than the seasonally adjusted data:

觀察seasonality的強度：

\[ F_S = \max\left(0, 1 - \frac{\text{Var}(R_t)}{\text{Var}(S_{t}+R_t)}\right) \]

下圖，每張圖代表每個地區的旅遊目的。 X軸是trend強度，Y軸是seasonality強度。

上圖觀察結果：

Holiday在各個地區都有明顯的seasonality
trend則傾向於發生在Western Australia和Victoria地區

下圖佐證New South Wales的seasonality現象。

4.4 Other features

pas 套件介紹

4.5 Exploring Australian tourism data

Chapter 5 The forecaster’s toolbox

5.1 A tidy forecasting workflow

Data preparation (tidy)
Plot the data (visualise)
Define a model (specify)
Train the model (estimate)
Check model performance (evaluate)
Produce forecasts (forecast)

5.2 Some simple forecasting methods (適合用於benchmark)

Mean method

歷史平均值

Naïve method

最後一筆歷史資料

Seasonal naïve method

特別適用於seasonality

A similar method is useful for highly seasonal data.

下圖：取去年同季的歷史資料

Drift method

歷史資料加上時間因素變化(drift)的組合

A variation on the naïve method is to allow the forecasts to increase or decrease over time, where the amount of change over time (called the drift) is set to be the average change seen in the historical data.

This is equivalent to drawing a line between the first and last observations, and extrapolating it into the future.

Example: Australian quarterly beer production

季節性明確的資料，seasonal naïve forecasts預測較精確。

Example: Google’s daily closing stock price

上述簡單方法都無法預測精確

簡單方法只是用來當作benchmark，新開發的算法必須優於簡單方法，否則沒有必要考慮。

Sometimes one of these simple methods will be the best forecasting method available; but in many cases, these methods will serve as benchmarks rather than the method of choice. That is, any forecasting methods we develop will be compared to these simple methods to ensure that the new method is better than these simple alternatives. If not, the new method is not worth considering.

5.3 Fitted values and residuals

Fitted values

Each observation in a time series can be forecast using all previous observations. We call these fitted values and they are denoted by $\hat{y}_{t|t-1}$, meaning the forecast of $y_{t}$ based on observations $y_{1},\dots,y_{t-1}$ .

Residuals

The “residuals” in a time series model are what is left over after fitting a model. The residuals are equal to the difference between the observations and the corresponding fitted values:

\[ e_{t} = y_{t}-\hat{y}_{t}. \]

如果資料scale先被轉換過(e.g. $\log{y_{t}}$)，那計算後的residual稱之為innovation residuals。

If a transformation has been used in the model, then it is often useful to look at residuals on the transformed scale. We call these innovation residuals.

5.4 Residual diagnostics

好的預測模型，它的innovation residuals具有以下4個性質：

residual互相獨立。如果具有相關性，代表還有info需要被加入預測模型之中
zero mean。如果不是0，代表有bias

只是為了檢查模型是否還有改善空間，並非用來挑選模型

Checking these properties is important in order to see whether a method is using all of the available information, but it is not a good way to select a forecasting method.

如果有bias，可以針對預測值減去對應$m$，就能滿足zero mean。至於correlation會在Chapter10提到。

homoscedasticity assumption
normality assumption

很難確保residual能有上述2個性質。

Sometimes applying a Box-Cox transformation may assist with these properties, but otherwise there is usually little that you can do to ensure that your innovation residuals have constant variance and a normal distribution.

Example: Forecasting Google daily closing stock prices

用簡單方法進行預測。

觀察結果：

上圖是residuals series：除了一個outlier，variation變化不大。

左下圖是ACF觀察correlation：沒有明顯的corr

右下圖是residual distribution：雖然平均接近0,但是右尾過長，不符normal.

不理解這段話的意義。

Consequently, forecasts from this method will probably be quite good, but prediction intervals that are computed assuming a normal distribution may be inaccurate.

Portmanteau tests for autocorrelation

pass 有點難懂

5.5 Distributional forecasts and prediction intervals

Prediction intervals

教你如何計算下個預測值的機率分佈。

A prediction interval gives an interval within which we expect $y_{t}$ to lie with a specified probability.

Assuming that distribution of future observations is normal, a 95% prediction interval for the $h$-step forecast is

\[ \hat{y}_{T+h|T} \pm 1.96 \hat\sigma_h, \]

\[ \hat{y}_{T+h|T} \pm c \hat\sigma_h, \]

95%的區間multiplier $c$ 怎麼計算？查表。

$\hat\sigma_h$：預測分佈的標準差。不同預測有不同的標準差計算方式。

One-step prediction intervals

\[ \begin{equation} \hat{\sigma} = \sqrt{\frac{1}{T-K}\sum_{t=1}^T e_t^2}, \tag{5.1} \end{equation} \]

用naïve forecast舉例：下一筆的預測值取決於上一筆的觀察值。 The forecast of the next value of the price: 758.88 standard deviation of the residuals: 11.19

下一筆預測的95%會落在:

\[ 758.88 \pm 1.96(11.19) = [736.9, 780.8]. \]

下一筆預測的80%會落在:

\[ 758.88 \pm 1.28(11.19) = [744.5, 773.2]. \]

Multi-step prediction intervals

Benchmark methods

Prediction intervals from bootstrapped residuals

透過bootstrapping，對歷史residual進行多次取樣，以此為基礎決定未來的預測值。

不需要對residual要求常態分佈的方法：bootstrapping。

When a normal distribution for the residuals is an unreasonable assumption, one alternative is to use bootstrapping, which only assumes that the residuals are uncorrelated with constant variance.

\[ y_t = \hat{y}_{t|t-1} + e_t \]

假設未來和過去的residual相似。

Assuming future errors will be similar to past errors, we can replace $e_{T+1}$ by sampling from the collection of errors we have seen in the past (i.e., the residuals).

下圖是5次的取樣結果：

5.6 Forecasting using transformations

這章節說明資料經過trainformation的預測值，在轉換過程會遇到的議題。

When forecasting from a model with transformations, we first produce forecasts of the transformed data. Then, we need to reverse the transformation (or back-transform) to obtain forecasts on the original scale.

Bias adjustments

back-transformed point forecast 不會是平均值而是中位數。

One issue with using mathematical transformations such as Box-Cox transformations is that the back-transformed point forecast will not be the mean of the forecast distribution. In fact, it will usually be the median of the forecast distribution (assuming that the distribution on the transformed space is symmetric).

在有些情境你希望他會是平均值：加總各城市的銷售量預測作為整個城市的預測值。

中位數不易受影響。但平均值會因此被抬高

For many purposes, this is acceptable, although the mean is usually preferable. For example, you may wish to add up sales forecasts from various regions to form a forecast for the whole country. But medians do not add up, whereas means do.

bias-adjusted

pass

5.7 Forecasting with decomposition

先將資料進行decomposition,再用個別方法處理不同的component series。

To forecast a decomposed time series, we forecast the seasonal component, $\hat{S}_{t}$, and the seasonally adjusted component $A_{t}$, separately.

再依據預測的component需求組合。

5.8 Evaluating point forecast accuracy

Forecast errors

區分residuals和forecast errors。

residuals：訓練集的error; one-step forecasts

errors: 測試集的error; multi-step forecasts

Note that forecast errors are different from residuals in two ways. First, residuals are calculated on the training set while forecast errors are calculated on the test set. Second, residuals are based on one-step forecasts while forecast errors can involve multi-step forecasts.

Scale-dependent errors

The two most commonly used scale-dependent measures are based on the absolute errors or squared errors:

\[ \begin{align*} \text{Mean absolute error: MAE} & = \text{mean}(|e_{t}|),\\ \text{Root mean squared error: RMSE} & = \sqrt{\text{mean}(e_{t}^2)}. \end{align*} \]

A forecast method that minimises the MAE will lead to forecasts of the median, while minimising the RMSE will lead to forecasts of the mean. Consequently, the RMSE is also widely used, despite being more difficult to interpret.

Percentage errors

MAPE

缺點：Measures based on percentage errors have the disadvantage of being infinite or undefined if $y_t=0$ for any $t$ in the period of interest, and having extreme values if any $y_t$ is close to zero.

Scaled errors

pass, 懶得看

5.9 Evaluating distributional forecast accuracy (跳過)

pass

5.10 Time series cross-validation

The forecast accuracy is computed by averaging over the test sets. This procedure is sometimes known as “evaluation on a rolling forecasting origin” because the “origin” at which the forecast is based rolls forward in time.

Suppose that we are interested in models that produce good 4-step-ahead forecasts. Then the corresponding diagram is shown below.

Chapter 6 Judgmental forecasts (跳過)

pass