D2L-CH8-Recurrent Neural Networks

8. Recurrent Neural Networks

In short, while convolutional neural networks can efficiently process spatial information, recurrent neural networks are designed to better handle sequential information. These networks introduce state variables to store past information, and then determine the current outputs, together with the current inputs.

8.1 Sequence Models

哪些因素會影響使用者行為

Anchoring 舉例,得到奧斯卡獎的電影,rating會在一段時間內提高.
Hedonic adaptation 舉例,看了很多好電影之後,會對接下來的電影期望很多,因此,一部普通電影可能會被評為爛電影.
Seasonality 觀看行為受季節影響(節慶電影).
在某些情況下，由於導演或演員在作品中的不當行為，電影變得不受歡迎.

8.1.1 Statistical Tools

Autoregressive Models

資料長度不斷增加,我們要有效率的方式計算 \(p(x_t \mid x_{t-1}, \ldots, x_1)\) :

假設太久遠的資料沒有必要性.只用固定時段內的資料進行計算.稱之為autoregressive models.
try and keep some summary \(h_t\) of the past observations, at the same time update \(h_t\) in addition to the prediction \(\hat{x}_t\). LSTMs and GRUs are examples of this.

Regardless of what we do, we will thus get an estimate of the entire time series via

\[p(x_1, \ldots, x_T) = \prod_{t=1}^T p(x_t \mid x_{t-1}, \ldots, x_1).\]

Markov Model

Markov condition?

first order Markov model?

dynamic programming?

Causality

Summary

Sequence models require specialized statistical tools for estimation. Two popular choices are autoregressive models and latent-variable autoregressive models.
As we predict further in time, the errors accumulate and the quality of the estimates degrades, often dramatically.
There is quite a difference in difficulty between interpolation and extrapolation. Consequently, if you have a time series, always respect the temporal order of the data when training, i.e., never train on future data.
For causal models (e.g., time going forward), estimating the forward direction is typically a lot easier than the reverse direction.

8.2 Text Preprocessing

文字處理, pass

8.3 Language Models and the Dataset

\[w_t \sim p(w_t \mid w_{t-1}, \ldots, w_1)\]

8.3.1 Estimating a Language Model

\[p(w_1, w_2, \ldots, w_T) = p(w_1) \prod_{t=2}^T p(w_t \mid w_1, \ldots, w_{t-1}).\]

In order to compute the language model, we need to calculate the probability of words and the conditional probability of a word given the previous few words,

直覺作法可以直接計算單詞以及單詞組合出現的頻率,如下圖. 但是,這只適用在字詞頻繁出現的場景,一般情況下,單詞組合同時出現的頻率並不高,甚至是3個單字以上的組合.

\[\hat{p}(\mathrm{is} \mid \mathrm{Statistics}) = \frac{n(\mathrm{Statistics~is})}{n(\mathrm{Statistics})}.\]

透過 Laplace smoothing, 對每個單詞的出現次數加上常數,避免上述問題.

\[\begin{aligned} \hat{p}(w) & = \frac{n(w) + \epsilon_1/m}{n + \epsilon_1}, \\ \hat{p}(w' \mid w) & = \frac{n(w, w') + \epsilon_2 \hat{p}(w')}{n(w) + \epsilon_2}, \\ \hat{p}(w'' \mid w',w) & = \frac{n(w, w',w'') + \epsilon_3 \hat{p}(w',w'')}{n(w, w') + \epsilon_3}. \end{aligned}\]

Here the coefficients \(\epsilon_i > 0\) determine how much we use the estimate for a shorter sequence as a fill-in for longer ones. Moreover, \(m\) is the total number of words we encounter.

但仍有以下缺點.

必須儲存全部單字的出現頻率
忽略的單字的意義,例如貓和貓科應該會出現在相關的上下文.
長單詞序列幾乎可以肯定是新穎的，因此簡單地計算先前看到的單詞序列頻率的模型勢必表現不佳。

8.3.2 Markov Models and n-grams

作者又提到了 Markov property.

A distribution over sequences satisfies the Markov property of first order if \(p(w_{t+1} \mid w_t, \ldots, w_1) = p(w_{t+1} \mid w_t)\).

Higher orders correspond to longer dependencies. This leads to a number of approximations that we could apply to model a sequence:

\[ \begin{aligned} p(w_1, w_2, w_3, w_4) &= p(w_1) p(w_2) p(w_3) p(w_4),\\ p(w_1, w_2, w_3, w_4) &= p(w_1) p(w_2 \mid w_1) p(w_3 \mid w_2) p(w_4 \mid w_3),\\ p(w_1, w_2, w_3, w_4) &= p(w_1) p(w_2 \mid w_1) p(w_3 \mid w_1, w_2) p(w_4 \mid w_2, w_3). \end{aligned} \]

以下是中文版的對Markov model 和 n-gram model 之間關聯性的解釋.

8.3.3 Natural Language Statistics

mxnet實做, 沒有TF2版本, pass

8.3.4 Training Data Preparation

pass

Summary

Language models are an important technology for natural language processing.
\(n\)-grams provide a convenient model for dealing with long sequences by truncating the dependence.
Long sequences suffer from the problem that they occur very rarely or never.
Zipf's law governs the word distribution for not only unigrams but also the other \(n\)-grams.
There is a lot of structure but not enough frequency to deal with infrequent word combinations efficiently via Laplace smoothing.
The main choices for sequence partitioning are picking between consecutive and random sequences.
Given the overall document length, it is usually acceptable to be slightly wasteful with the documents and discard half-empty minibatches.

8.4 Recurrent Neural Networks

n-grams 如果要取得更多歷史序列資訊,必須增加n,但缺點是模型參數的數量將隨之呈指數級增長.

We introduced \(n\)-gram models, where the conditional probability of word \(x_t\) at position \(t\) only depends on the \(n-1\) previous words. If we want to check the possible effect of words earlier than \(t-(n-1)\) on \(x_t\), we need to increase \(n\). However, the number of model parameters would also increase exponentially with it, as we need to store \(|V|^n\) numbers for a vocabulary \(V\).

因此,更好的作法是採用 latent variable model.

Hence, rather than modeling \(p(x_t \mid x_{t-1}, \ldots, x_{t-n+1})\) it is preferable to use a latent variable model in which we have

\[p(x_t \mid x_{t-1}, \ldots, x_1) \approx p(x_t \mid x_{t-1}, h_{t}).\]

\(h_{t}\) 是 latent variable(也稱作 hidden variable 或 hidden state 或 hidden state variable),負責儲存歷史序列資訊.

\[h_t = f(x_{t}, h_{t-1}).\]

看 8.1.1 的圖比較好理解 latent variable model.

8.4.1 Recurrent Networks Without Hidden States

\[\mathbf{H}_t = \phi(\mathbf{X}_t \mathbf{W}_{xh} + \mathbf{H}_{t-1} \mathbf{W}_{hh} + \mathbf{b}_h).\]

\[\mathbf{O}_t = \mathbf{H}_t \mathbf{W}_{hq} + \mathbf{b}_q.\]

\(q\) 表示最終輸出的數量(假設這是個分類任務).

8.4.2 Recurrent Networks with Hidden States

Specifically, the calculation of the hidden variable of the current timestep is determined by the input of the current timestep together with the hidden variable of the previous timestep:

\[\mathbf{H}_t = \phi(\mathbf{X}_t \mathbf{W}_{xh} + \mathbf{H}_{t-1} \mathbf{W}_{hh} + \mathbf{b}_h).\]

\[\mathbf{O}_t = \mathbf{H}_t \mathbf{W}_{hq} + \mathbf{b}_q.\]

\((\mathbf{X}_t \mathbf{W}_{xh} + \mathbf{H}_{t-1} \mathbf{W}_{hh})\) 等價於 \(\mathbf{X}_t\)和\(\mathbf{H}_{t-1}\) 連結後的矩陣乘以\(\mathbf{W}_{xh}\)和\(\mathbf{W}_{hh}\)連結的矩陣

X, W_xh = tf.random.normal(shape=(3, 1)), tf.random.normal(shape=(1, 4))
H, W_hh = tf.random.normal(shape=(3, 4)), tf.random.normal(shape=(4, 4))
tf.matmul(X, W_xh) + tf.matmul(H, W_hh)

tf.matmul(tf.concat([X,H],axis=1),tf.concat([W_xh,W_hh],axis=0))

8.4.3 Steps in a Language Model

8.4.4 Perplexity

如何評估語言模型: Perplexity

\[\mathrm{PPL} := \exp\left(-\frac{1}{n} \sum_{t=1}^n \log p(x_t \mid x_{t-1}, \ldots, x_1)\right).\]

perplexity 概括了 cross-entropy loss.

Note that perplexity naturally generalizes the notion of the cross-entropy loss defined when we introduced the softmax regression (:numref:sec_softmax). That is, for a single symbol both definitions are identical bar the fact that one is the exponential of the other.

Summary

A network that uses recurrent computation is called a recurrent neural network (RNN).
The hidden state of the RNN can capture historical information of the sequence up to the current timestep.
The number of RNN model parameters does not grow as the number of timesteps increases.
We can create language models using a character-level RNN.

8.5 Implementation of Recurrent Neural Networks from Scratch

8.5.1 One-hot Encoding

8.5.2 Initializing the Model Parameters

8.5.3 RNN Model

8.5.4 Prediction

8.5.5 Gradient Clipping

解決梯度爆炸問題,其實策略也就是控制梯度的大小.

One alternative is to clip the gradients by projecting them back to a ball of a given radius, say \(\theta\) via

\[\mathbf{g} \leftarrow \min\left(1, \frac{\theta}{|\mathbf{g}|}\right) \mathbf{g}.\]

8.5.6 Training

TF2版本 https://trickygo.github.io/Dive-into-DL-TensorFlow2.0/#/chapter06_RNN/6.4_rnn-scratch

注意 hidden state 的初始化時間點,如果使用隨機採樣，在每個小批量更新前會初始化hidden state.

Summary

Sequence models need state initialization for training.
Between sequential models you need to ensure to detach the gradients, to ensure that the automatic differentiation does not propagate effects beyond the current sample.
A simple RNN language model consists of an encoder, an RNN model, and a decoder.
Gradient clipping prevents gradient explosion (but it cannot fix vanishing gradients).
Perplexity calibrates model performance across different sequence length. It is the exponentiated average of the cross-entropy loss.
Sequential partitioning typically leads to better models.

8.6 Concise Implementation of Recurrent Neural Networks

TF2程式碼 https://trickygo.github.io/Dive-into-DL-TensorFlow2.0/#/chapter06_RNN/6.5_rnn-keras

8.7 Backpropagation Through Time

Forward propagation in a recurrent neural network is relatively straightforward. Backpropagation through time is actually a specific application of back propagation in recurrent neural networks. It requires us to expand the recurrent neural network one timestep at a time to obtain the dependencies between model variables and parameters. Then, based on the chain rule, we apply backpropagation to compute and store gradients. Since sequences can be rather long, the dependency can be rather lengthy. For instance, for a sequence of 1000 characters, the first symbol could potentially have significant influence on the symbol at position 1000. This is not really computationally feasible (it takes too long and requires too much memory) and it requires over 1000 matrix-vector products before we would arrive at that very elusive gradient. This is a process fraught with computational and statistical uncertainty. In the following we will elucidate what happens and how to address this in practice.

8.7.1 A Simplified Recurrent Network

對這節的內容,如何優化RNN的BP策略,還不夠理解

While we can use the chain rule to compute \(\partial_w h_t\) recursively, this chain can get very long whenever \(t\) is large. Let us discuss a number of strategies for dealing with this problem.

Compute the full sum. 速度非常慢，並且梯度可能會爆炸.
Truncate the sum after \(\tau\) steps. This is what we have been discussing so far. This leads to an approximation of the true gradient, simply by terminating the sum above at \(\partial_w h_{t-\tau}\). The approximation error is thus given by \(\partial_h f(x_t, h_{t-1}, w) \partial_w h_{t-1}\) (multiplied by a product of gradients involving \(\partial_h f\)). In practice this works quite well. It is what is commonly referred to as truncated BPTT (backpropgation through time). One of the consequences of this is that the model focuses primarily on short-term influence rather than long-term consequences. This is actually desirable, since it biases the estimate towards simpler and more stable models.
Randomized Truncation. Last we can replace \(\partial_{w_h} h_t\) by a random variable which is correct in expectation but which truncates the sequence. This is achieved by using a sequence of \(\xi_t\) where \(E[\xi_t] = 1\) and \(P(\xi_t = 0) = 1-\pi\) and furthermore \(P(\xi_t = \pi^{-1}) = \pi\). We use this to replace the gradient:

\[z_t = \partial_w f(x_t, h_{t-1}, w) + \xi_t \partial_h f(x_t, h_{t-1}, w) \partial_w h_{t-1}.\]

It follows from the definition of \(\xi_t\) that \(E[z_t] = \partial_w h_t\). Whenever \(\xi_t = 0\) the expansion terminates at that point. This leads to a weighted sum of sequences of varying lengths where long sequences are rare but appropriately overweighted. :cite:Tallec.Ollivier.2017 proposed this in their paper. Unfortunately, while appealing in theory, the model does not work much better than simple truncation, most likely due to a number of factors. First, the effect of an observation after a number of backpropagation steps into the past is quite sufficient to capture dependencies in practice. Second, the increased variance counteracts the fact that the gradient is more accurate. Third, we actually want models that have only a short range of interaction. Hence, BPTT has a slight regularizing effect which can be desirable.

8.7.2 The Computational Graph

值得背的RNN計算圖!

8.7.3 BPTT in Detail

需要再花時間回頭看這章節,如何計算

\[\begin{aligned} \partial_{\mathbf{W}_{hh}} L, \\ \partial_{\mathbf{W}_{hx}} L, \\ \partial_{\mathbf{W}_{oh}} L \end{aligned}\]

Summary

Backpropagation through time is merely an application of backprop to sequence models with a hidden state.
Truncation is needed for computational convenience and numerical stability.
High powers of matrices can lead to divergent and vanishing eigenvalues. This manifests itself in the form of exploding or vanishing gradients.
For efficient computation, intermediate values are cached.