D2L-CH9-Modern Recurrent Neural Networks

9. Modern Recurrent Neural Networks

9.1 Gated Recurrent Units (GRU)

從前一張章節得知,RNN因為矩陣連乘會導致梯度消失或爆炸.

作者提出3個梯度異常(gradient anomalies)的情境:

We might encounter a situation where an early observation is highly significant for predicting all future observations. 一個序列,第一個輸入包含了checksum,而模型目的是要預測checksum是否正確.這可以看出第一個輸入值很重要,必須要有 memory cell 機制保留初期資訊,否則,我們不得不為第一個輸入分配非常大的梯度，因為它會影響所有後續觀測值.
We might encounter situations where some symbols carry no pertinent observation. 以網頁內容情緒分析為例,網頁內容包含了許多不必要的 HTML symbols.需要有機制可以 略過(skip) 沒必要的 latent state representation.
We might encounter situations where there is a logical break between parts of a sequence. 例如,一本書的各個章節之間可能會有過渡,或者證券的熊市和牛市之間可能會有過渡.在這種情況下,最好有一種能 重置(reset) internal state representation的方法.

9.1.1 Gating the Hidden State

For instance, if the first symbol is of great importance we will learn not to update the hidden state after the first observation. Likewise, we will learn to skip irrelevant temporary observations. Last, we will learn to reset the latent state whenever needed.

Reset Gates and Update Gates

The first thing we need to introduce are reset and update gates. We engineer them to be vectors with entries in \((0, 1)\) such that we can perform convex combinations.

activation function 採用 sigmoid, 保證輸出區間是0,1之間.

For instance, a reset variable would allow us to control how much of the previous state we might still want to remember. Likewise, an update variable would allow us to control how much of the new state is just a copy of the old state.

For a given timestep \(t\), the minibatch input is \(\mathbf{X}_t \in \mathbb{R}^{n \times d}\) (number of examples: \(n\), number of inputs: \(d\)) and the hidden state of the last timestep is \(\mathbf{H}_{t-1} \in \mathbb{R}^{n \times h}\) (number of hidden states: \(h\)). Then, the reset gate \(\mathbf{R}_t \in \mathbb{R}^{n \times h}\) and update gate \(\mathbf{Z}_t \in \mathbb{R}^{n \times h}\) are computed as follows:

\[ \begin{aligned} \mathbf{R}_t = \sigma(\mathbf{X}_t \mathbf{W}_{xr} + \mathbf{H}_{t-1} \mathbf{W}_{hr} + \mathbf{b}_r),\\ \mathbf{Z}_t = \sigma(\mathbf{X}_t \mathbf{W}_{xz} + \mathbf{H}_{t-1} \mathbf{W}_{hz} + \mathbf{b}_z). \end{aligned} \]

Reset Gates in Action

以下是 Conventional RNN

\[\mathbf{H}_t = \tanh(\mathbf{X}_t \mathbf{W}_{xh} + \mathbf{H}_{t-1}\mathbf{W}_{hh} + \mathbf{b}_h).\]

以下是加入 Reset Gate 的 RNN

\[\tilde{\mathbf{H}}_t = \tanh(\mathbf{X}_t \mathbf{W}_{xh} + \left(\mathbf{R}_t \odot \mathbf{H}_{t-1}\right) \mathbf{W}_{hh} + \mathbf{b}_h).\]

如果 \(\mathbf{R}_t\) 內元素接近 1,則結果接近 Conventional RNN. 如果 \(\mathbf{R}_t\) 內元素接近 0,則 hidden state 等於\(\mathbf{X}_t\) 為輸入所搭建的 MLP.先前的 hidden state 都會被重設為預設值.

reset gate 結果為後續的 cadidate hidden state, update gate 會接續用到.

element-wise product 是重要的關鍵.

Update Gates in Action

\[\mathbf{H}_t = \mathbf{Z}_t \odot \mathbf{H}_{t-1} + (1 - \mathbf{Z}_t) \odot \tilde{\mathbf{H}}_t.\]

當 \(\mathbf{Z}_t\) 內元素接近 1,會持續保留 \(\mathbf{H}_{t-1}\)(加號左項不變,加號右項為0),表示不考慮 \(\mathbf{X}_t\) 的資訊. 當 \(\mathbf{Z}_t\) 內元素接近 0, new state 會接近 Candidate hidden state.

Whenever the update gate \(\mathbf{Z}_t\) is close to \(1\), we simply retain the old state. In this case the information from \(\mathbf{X}_t\) is essentially ignored, effectively skipping timestep \(t\) in the dependency chain.

In contrast, whenever \(\mathbf{Z}_t\) is close to \(0\), the new latent state \(\mathbf{H}_t\) approaches the candidate latent state \(\tilde{\mathbf{H}}_t\).

These designs can help us cope with the vanishing gradient problem in RNNs and better capture dependencies for time series with large timestep distances. In summary, GRUs have the following two distinguishing features:

Reset gates help capture short-term dependencies in time series.
Update gates help capture long-term dependencies in time series.

9.1.2 Implementation from Scratch

9.1.3 Concise Implementation

TF2 程式碼 https://trickygo.github.io/Dive-into-DL-TensorFlow2.0/#/chapter06_RNN/6.7_gru?id=_673-%e4%bb%8e%e9%9b%b6%e5%bc%80%e5%a7%8b%e5%ae%9e%e7%8e%b0

Summary

Gated recurrent neural networks are better at capturing dependencies for time series with large timestep distances.
Reset gates help capture short-term dependencies in time series.
Update gates help capture long-term dependencies in time series.
GRUs contain basic RNNs as their extreme case whenever the reset gate is switched on. They can ignore sequences as needed.

9.2 Long Short Term Memory (LSTM)

LSTM(1997)比GRU(2014)早出現20年,但是比GRU複雜.

9.2.1 Gated Memory Cells

Input Gates, Forget Gates, and Output Gates

Just like with GRUs, the data feeding into the LSTM gates is the input at the current timestep \(\mathbf{X}_t\) and the hidden state of the previous timestep \(\mathbf{H}_{t-1}\).

These inputs are processed by a fully connected layer and a sigmoid activation function to compute the values of input, forget and output gates. As a result, the three gates all output values in the range of \([0, 1]\).

\[ \begin{aligned} \mathbf{I}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xi} + \mathbf{H}_{t-1} \mathbf{W}_{hi} + \mathbf{b}_i),\\ \mathbf{F}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xf} + \mathbf{H}_{t-1} \mathbf{W}_{hf} + \mathbf{b}_f),\\ \mathbf{O}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xo} + \mathbf{H}_{t-1} \mathbf{W}_{ho} + \mathbf{b}_o), \end{aligned} \]

Candidate Memory Cell

採用tanh,輸出區間是[-1,1]

\[\tilde{\mathbf{C}}_t = \text{tanh}(\mathbf{X}_t \mathbf{W}_{xc} + \mathbf{H}_{t-1} \mathbf{W}_{hc} + \mathbf{b}_c).\]

Memory Cell

在GRU,只用一個機制控制輸入和遺忘.但是LSTM分別是由 \(\mathbf{I}_t\) 控制透過 \(\tilde{\mathbf{C}}_t\) 決定採用多少新資料,以及 \(\mathbf{F}_t\) 控制 \(\mathbf{C}_{t-1}\) 的保留程度.

\[\mathbf{C}_t = \mathbf{F}_t \odot \mathbf{C}_{t-1} + \mathbf{I}_t \odot \tilde{\mathbf{C}}_t.\]

當 forget gate 內元素接近 1, input gate 內元素會接近於 0 當 forget gate 內元素接近 0, \(\mathbf{C}_{t-1}\) 會保存並傳到目前時間點.

If the forget gate is always approximately \(1\) and the input gate is always approximately \(0\), the past memory cells \(\mathbf{C}_{t-1}\) will be saved over time and passed to the current timestep.

Hidden States

In LSTM it is simply a gated version of the \(\tanh\) of the memory cell. This ensures that the values of \(\mathbf{H}_t\) are always in the interval \((-1, 1)\). Whenever the output gate is \(1\) we effectively pass all memory information through to the predictor, whereas for output \(0\) we retain all the information only within the memory cell and perform no further processing.

\[\mathbf{H}_t = \mathbf{O}_t \odot \tanh(\mathbf{C}_t).\]

9.2.2 Implementation from Scratch

9.2.3 Concise Implementation

TF2 程式碼 https://trickygo.github.io/Dive-into-DL-TensorFlow2.0/#/chapter06_RNN/6.8_lstm?id=_683-%e4%bb%8e%e9%9b%b6%e5%bc%80%e5%a7%8b%e5%ae%9e%e7%8e%b0

Summary

LSTMs have three types of gates: input gates, forget gates, and output gates which control the flow of information.
The hidden layer output of LSTM includes hidden states and memory cells. Only hidden states are passed into the output layer. Memory cells are entirely internal.
LSTMs can cope with vanishing and exploding gradients.

9.3 Deep Recurrent Neural Networks

We could add extra nonlinearity to the gating mechanisms.
We could stack multiple layers of LSTMs on top of each other.

9.3.1 Functional Dependencies

\[\begin{aligned} \mathbf{H}_t^{(1)} & = f_1\left(\mathbf{X}_t, \mathbf{H}_{t-1}^{(1)}\right), \\ \mathbf{H}_t^{(l)} & = f_l\left(\mathbf{H}_t^{(l-1)}, \mathbf{H}_{t-1}^{(l)}\right). \end{aligned}\]

\[\mathbf{O}_t = g \left(\mathbf{H}_t^{(L)}\right).\]

9.3.2 Concise Implementation

TF2 沒更新

Summary

In deep recurrent neural networks, hidden state information is passed to the next timestep of the current layer and the current timestep of the next layer.
There exist many different flavors of deep RNNs, such as LSTMs, GRUs, or regular RNNs. Conveniently these models are all available as parts of the rnn module in Gluon.
Initialization of the models requires care. Overall, deep RNNs require considerable amount of work (such as learning rate and clipping) to ensure proper convergence.

9.4 Bidirectional Recurrent Neural Networks

Instead of running an RNN only in the forward mode starting from the first symbol, we start another one from the last symbol running from back to front.

9.4.1 Dynamic Programming

We assume that there exists some latent variable \(h_t\) which governs the emissions \(x_t\) that we observe via \(p(x_t \mid h_t)\). Moreover, the transitions \(h_t \to h_{t+1}\) are given by some state transition probability \(p(h_{t+1} \mid h_{t})\).

Thus, for a sequence of \(T\) observations we have the following joint probability distribution over observed and hidden states:

\[p(x, h) = p(h_1) p(x_1 \mid h_1) \prod_{t=2}^T p(h_t \mid h_{t-1}) p(x_t \mid h_t).\]

後續透過動態規劃減少計算時間,推導出以下公式:

\[p(x_j \mid x_{-j}) \propto \sum_{h_j} \pi_j(h_j) \rho_j(h_j) p(x_j \mid h_j).\]

9.4.2 Bidirectional Model

值得注意的是, \(\mathbf{H_{t}}\) 是由 \(\overrightarrow{\mathbf{H}}_t\) 和 \(\overleftarrow{\mathbf{H}}_t\) concate 產生.

Then we concatenate the forward and backward hidden states \(\overrightarrow{\mathbf{H}}_t\) and \(\overleftarrow{\mathbf{H}}_t\) to obtain the hidden state \(\mathbf{H}_t \in \mathbb{R}^{n \times 2h}\) and feed it to the output layer. In deep bidirectional RNNs, the information is passed on as input to the next bidirectional layer. Last, the output layer computes the output \(\mathbf{O}_t \in \mathbb{R}^{n \times q}\) (number of outputs: \(q\)):

\[\mathbf{O}_t = \mathbf{H}_t \mathbf{W}_{hq} + \mathbf{b}_q.\]

Computational Cost and Applications

在訓練練階段可以拿到過去以及未來資料,但在測試階段,我們只有過去資料,可能準確率會下降很多.

One of the key features of a bidirectional RNN is that information from both ends of the sequence is used to estimate the output. That is, we use information from both future and past observations to predict the current one (a smoothing scenario). In the case of language models this is not quite what we want. After all, we do not have the luxury of knowing the next to next symbol when predicting the next one. Hence, if we were to use a bidirectional RNN naively we would not get a very good accuracy: during training we have past and future data to estimate the present. During test time we only have past data and thus poor accuracy (we will illustrate this in an experiment below).

同時需要 forward pass 和 backward pass, 計算速度慢.

To add insult to injury, bidirectional RNNs are also exceedingly slow. The main reasons for this are that they require both a forward and a backward pass and that the backward pass is dependent on the outcomes of the forward pass. Hence, gradients will have a very long dependency chain.

BiRNN 應用

In practice bidirectional layers are used very sparingly and only for a narrow set of applications, such as filling in missing words, annotating tokens (e.g., for named entity recognition), or encoding sequences wholesale as a step in a sequence processing pipeline (e.g., for machine translation). In short, handle with care!

Training a Bidirectional RNN for the Wrong Application

沒有提供 TF2 版本

Summary

In bidirectional recurrent neural networks, the hidden state for each timestep is simultaneously determined by the data prior to and after the current timestep.
Bidirectional RNNs bear a striking resemblance with the forward-backward algorithm in graphical models.
Bidirectional RNNs are mostly useful for sequence embedding and the estimation of observations given bidirectional context.
Bidirectional RNNs are very costly to train due to long gradient chains.

9.5 Machine Translation and the Dataset

MXNet實做, 跳過

9.6 Encoder-Decoder Architecture

The architecture is partitioned into two parts, the encoder and the decoder. The encoder's role is to encode the inputs into state, which often contains several tensors. Then the state is passed into the decoder to generate the outputs. In machine translation, the encoder transforms a source sentence, e.g., "Hello world.", into state, e.g., a vector, that captures its semantic information. The decoder then uses this state to generate the translated target sentence, e.g., "Bonjour le monde.".