D2L-CH7-Modern Convolutional Neural Networks

7. Modern Convolutional Neural Networks

7.1 Deep Convolutional Neural Networks (AlexNet)

7.1.1 Learning Feature Representation

作者列出數個經典的特徵表示算法:

SIFT
SURF
HOG
Bags of visual words

7.1.2 AlexNet

2012年，AlexNet横空出世。这个模型的名字来源于论文第一作者的姓名Alex Krizhevsky [1]。AlexNet使用了8层卷积神经网络，并以很大的优势赢得了ImageNet 2012图像识别挑战赛。它首次证明了学习到的特征可以超越手工设计的特征，从而一举打破计算机视觉研究的前状。

Architecture

ImageNet競賽的圖片尺寸比MNIST還大,所以Conv的window size是11x11。後續架構沒講理由。

Activation Functions

把 sigmoid 改為 ReLU。

效益1: ReLU 計算簡單,不像 sigmoid 需要計算 exponentiation.

效益2: 梯度消失問題,如果 sigmoid 的輸出非常接近0或1.梯度值會接近0,bp很難持續更新部份參數.相較之下,ReLU值若是正數,梯度值則為1. sigmoid 會因為初始參數值沒設好,會導致模型訓練效果差.

TF2程式碼

https://trickygo.github.io/Dive-into-DL-TensorFlow2.0/#/chapter05_CNN/5.6_alexnet

Summary

AlexNet在圖像分類表現優異的3個關鍵:

Data Argumentation(此章節沒深入說明)
Dropout
ReLU

7.2 Networks Using Blocks (VGG)

Blocks概念的最初來自於VGG,讓新的網路架構可以重複使用這些網路.

The idea of using blocks first emerged from the Visual Geometry Group (VGG) at Oxford University, in their eponymously-named VGG network. It is easy to implement these repeated structures in code with any modern deep learning framework by using loops and subroutines.

TF2程式碼

https://trickygo.github.io/Dive-into-DL-TensorFlow2.0/#/chapter05_CNN/5.7_vgg

Summary

VGG-11通過5個可以重複使用的卷積塊來構造網絡。根據每塊裡卷積層個數和輸出通道數的不同可以定義出不同的VGG模型。

7.3 Network in Network (NiN)

前面提到的LeNet,AlexNet,VGG最後是用dense layers處理representations.但是這可能會損失空間訊息.

NiN的新思路,不用Denselayer,改用\(1\times1\)Conv layer實做全連接層.並且用GlobalAveragePooling處理最後分類結果的轉換.

7.3.1 NiN Blocks

7.3.2 NiN Model

除使用NiN块以外，NiN还有一个设计与AlexNet显著不同：NiN去掉了AlexNet最后的3个全连接层，取而代之地，NiN使用了输出通道数等于标签类别数的NiN块，然后使用全局平均池化层对每个通道中所有元素求平均并直接用于分类。这里的全局平均池化层即窗口形状等于输入空间维形状的平均池化层。NiN的这个设计的好处是可以显著减小模型参数尺寸，从而缓解过拟合。然而，该设计有时会造成获得有效模型的训练时间的增加。

Summary

NiN blocks 採用 \(1\times1\) conv layer
NiN 拿掉 FClayers, 改用GAP實現
拿掉FClayers大大減少參數量,也可避免overfitting

7.4 Networks with Parallel Concatenations (GoogLeNet)

以前主流網路的 Conv kernel 選擇範圍小至1×1，大至11×11。

GooLeNet的新思路是，一個block內可以結合使用各種kernel size。

One focus of the paper was to address the question of which sized convolutional kernels are best. After all, previous popular networks employed choices as small as 1×1 and as large as 11×11. One insight in this paper was that sometimes it can be advantageous to employ a combination of variously-sized kernels.

7.4.1 Inception Blocks

Inception块里有4条并行的线路。前3条线路使用窗口大小分别是1×11×1、3×33×3和5×55×5的卷积层来抽取不同空间尺寸下的信息，其中中间2个线路会对输入先做1×11×1卷积来减少输入通道数，以降低模型复杂度。第四条线路则使用3×33×3最大池化层，后接1×11×1卷积层来改变通道数。4条线路都使用了合适的填充来使输入与输出的高和宽一致。最后我们将每条线路的输出在通道维上连结，并输入接下来的层中去。

Inception block 參數調整每個layer的輸出channel數量.

The commonly-tuned parameters of the Inception block are the number of output channels per layer.

7.4.2 GoogLeNet Model

TF2 程式碼

https://trickygo.github.io/Dive-into-DL-TensorFlow2.0/#/chapter05_CNN/5.9_googlenet

Summary

Inception block 等同於內含4個路徑的子網路
GoogLeNet 內的 Inception block 的 channel數量分配比例是在ImageNet資料集上通过大量的實驗得来的。
GoogLeNet和它的后继者们一度是ImageNet上最高效的模型之一：在类似的测试精度下，它们的计算复杂度往往更低。

7.5 Batch Normalization

7.5.1 Training Deep Networks

通常來說，數據標準化預處理對於淺層模型就足夠有效了。隨著模型訓練的進行，當每層中參數更新時，靠近輸出層的輸出較難出現劇烈變化。但對深層神經網絡來說，即使輸入數據已做標準化，訓練中模型參數的更新依然很容易造成靠近輸出層輸出的劇烈變化。這種計算數值的不穩定性通常令我們難以訓練出有效的深度模型。批量歸一化的提出正是為了應對深度模型訓練的挑戰。在模型訓練時，批量歸一化利用小批量上的均值和標準差，不斷調整神經網絡中間輸出，從而使整個神經網絡在各層的中間輸出的數值更穩定。批量歸一化和下一節將要介紹的殘差網絡為訓練和設計深度模型提供了兩類重要思路。

\[\mathrm{BN}(\mathbf{x}) = \mathbf{\gamma} \odot \frac{\mathbf{x} - \hat{\mathbf{\mu}}}{\hat\sigma} + \mathbf{\beta}\]

\(\mathbf{x}\)是前一層layer尚未進行activation的輸出結果 (\(\mathbf{x}=\mathbf{W}\mathbf{u}+\mathbf{b}\)),並接著進行normalization以及修改scaling(\(\mathbf{\gamma}\))和offset(\(\mathbf{\beta}\)).

Because the choice of unit variance (vs some other magic number) is an arbitrary choice, we commonly include coordinate-wise scaling coefficients \(\mathbf{\gamma}\) and offsets \(\mathbf{\beta}\).

以下是對批次輸入資料\(\mathcal{B}\)計算平均值和變異數的公式:

\[ \hat{\mathbf{\mu}}_\mathcal{B} \leftarrow \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} \mathbf{x} \]

\[ \hat{\mathbf{\sigma}}_\mathcal{B}^2 \leftarrow \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} (\mathbf{x} - \mathbf{\mu}_{\mathcal{B}})^2 + \epsilon \]

variance 加上 \(\epsilon\) 可以避免後續進行 normalization 造成 divisionbyzero.

minibatch size 對 BN 影響很大,因為mean,var是根據批次資料計算得到,所以數量越大計算越精準.

One takeaway here is that when applying BN, the choice of minibatch size may be even more significant than without BN.

有了 BN 可以嘗試將 learning rate 提高,加快訓練速度.

One piece of practitionerʼs intuition/wisdom is that BN seems to allows for more aggressive learning rates.

7.5.2 Batch Normalization Layers

Fully-Connected Layers

\[\mathbf{h} = \phi(\mathrm{BN}_{\mathbf{\beta}, \mathbf{\gamma}}(f_{\mathbf{\theta}}(\mathbf{x}) ) ) \]

Recall that mean and variance are computed on the same minibatch \(\mathcal{B}\) on which the transformation is applied. Also recall that the scaling coefficient \(\mathbf{\gamma}\) and the offset \(\mathbf{\beta}\) are parameters that need to be learned jointly with the more familiar parameters \(\mathbf{\theta}\).

Convolutional Layers

BN在處理Conv的和全連接層的方法並不同!

各個channel有各自的scale和shift

When the convolution has multiple output channels, we need to carry out batch normalization for each of the outputs of these channels, and each channel has its own scale and shift parameters, both of which are scalars.

Assume that our minibatches contain \(m\) each and that for each channel, the output of the convolution has height \(p\) and width \(q\). For convolutional layers, we carry out each batch normalization over the \(m \cdot p \cdot q\) elements per output channel simultaneously. Thus we collect the values over all spatial locations when computing the mean and variance and consequently (within a given channel) apply the same \(\hat{\mathbf{\mu}}\) and \(\hat{\mathbf{\sigma}}\) to normalize the values at each spatial location.

訓練和預測(測試)的結果會不一致

As we mentioned earlier, BN typically behaves differently in training mode and prediction mode. First, the noise in \(\mathbf{\mu}\) and \(\mathbf{\sigma}\) arising from estimating each on minibatches are no longer desirable once we have trained the model. Second, we might not have the luxury of computing per-batch normalization statistics, e.g., we might need to apply our model to make one prediction at a time.

Typically, after training, we use the entire dataset to compute stable estimates of the activation statistics and then fix them at prediction time. Consequently, BN behaves differently during training and at test time. Recall that dropout also exhibits this characteristic.

TF2程式碼

https://trickygo.github.io/Dive-into-DL-TensorFlow2.0/#/chapter05_CNN/5.10_batch-norm

看程式碼可以更了解BatchNorm用法

以下範例是使用BN的LeNet.先BN再Activation.

net = tf.keras.models.Sequential()
net.add(tf.keras.layers.Conv2D(filters=6,kernel_size=5))
net.add(tf.keras.layers.BatchNormalization())
net.add(tf.keras.layers.Activation('sigmoid'))
net.add(tf.keras.layers.MaxPool2D(pool_size=2, strides=2))
net.add(tf.keras.layers.Conv2D(filters=16,kernel_size=5))
net.add(tf.keras.layers.BatchNormalization())
net.add(tf.keras.layers.Activation('sigmoid'))
net.add(tf.keras.layers.MaxPool2D(pool_size=2, strides=2))
net.add(tf.keras.layers.Flatten())
net.add(tf.keras.layers.Dense(120))
net.add(tf.keras.layers.BatchNormalization())
net.add(tf.keras.layers.Activation('sigmoid'))
net.add(tf.keras.layers.Dense(84))
net.add(tf.keras.layers.BatchNormalization())
net.add(tf.keras.layers.Activation('sigmoid'))
net.add(tf.keras.layers.Dense(10,activation='sigmoid'))

7.5.6 Controversy

沒有明確的數理證明BN的有效是因為改善了internal covariate shift.

Summary

During model training, batch normalization continuously adjusts the intermediate output of the neural network by utilizing the mean and standard deviation of the minibatch, so that the values of the intermediate output in each layer throughout the neural network are more stable.
The batch normalization methods for fully connected layers and convolutional layers are slightly different.
Like a dropout layer, batch normalization layers have different computation results in training mode and prediction mode.
Batch Normalization has many beneficial side effects, primarily that of regularization. On the other hand, the original motivation of reducing covariate shift seems not to be a valid explanation.

7.6 Residual Networks (ResNet)

這篇對ResNet的見解也很有意思 https://zhuanlan.zhihu.com/p/42833949

用BP的角度看ResNet,可以避免梯度消失.

\(\frac{dh}{dx}=\frac{d(f+x)}{dx}=1+\frac{df}{dx}\)

7.6.1 Function Classes

At the heart of ResNet is the idea that every additional layer should contain the identity function as one of its elements. This means that if we can train the newly-added layer into an identity mapping \(f(\mathbf{x}) = \mathbf{x}\), the new model will be as effective as the original model. As the new model may get a better solution to fit the training dataset, the added layer might make it easier to reduce training errors. Even better, the identity function rather than the null \(f(\mathbf{x}) = 0\) should be the simplest function within a layer.

7.6.2 Residual Blocks

2種 ResNet block,右圖加上 1x1 Conv 是為了調整維度(長or寬or通道),再做後續相加.

7.6.3 ResNet Model

pass

Summary

Residual blocks allow for a parametrization relative to the identity function \(f(\mathbf{x}) = \mathbf{x}\).
Adding residual blocks increases the function complexity in a well-defined manner.
We can train an effective deep neural network by having residual blocks pass through cross-layer data channels.
ResNet had a major influence on the design of subsequent deep neural networks, both for convolutional and sequential nature.

7.7 Densely Connected Networks (DenseNet)

7.7.1 Function Decomposition

The key point is that it decomposes the function into increasingly higher order terms. In a similar vein, ResNet decomposes functions into

\[f(\mathbf{x}) = \mathbf{x} + g(\mathbf{x}).\]

That is, ResNet decomposes \(f\) into a simple linear term and a more complex nonlinear one. What if we want to go beyond two terms? A solution was proposed by :cite:Huang.Liu.Van-Der-Maaten.ea.2017 in the form of DenseNet, an architecture that reported record performance on the ImageNet dataset.

ResNet和DenseNet的關鍵差異在於,ResNet最後進行sum,但是DenseNet進行concat.

As shown in fig. 7.7.1, the key difference between ResNet and DenseNet is that in the latter case outputs are concatenated rather than added. As a result we perform a mapping from \(\mathbf{x}\) to its values after applying an increasingly complex sequence of functions.

\[\mathbf{x} \to \left[\mathbf{x}, f_1(\mathbf{x}), f_2(\mathbf{x}, f_1(\mathbf{x})), f_3(\mathbf{x}, f_1(\mathbf{x}), f_2(\mathbf{x}, f_1(\mathbf{x})), \ldots\right].\]

In the end, all these functions are combined in an MLP to reduce the number of features again. In terms of implementation this is quite simple---rather than adding terms, we concatenate them.

DenseNet的命名,從上圖可以觀察到,每一層layer和之前全部的layers進行全連接.

The name DenseNet arises from the fact that the dependency graph between variables becomes quite dense. The last layer of such a chain is densely connected to all previous layers.

DenseNet = dense blocks + transition layers

dense blocks 控制 concat.

transition layers 控制 channel 數量避免過大.

The main components that compose a DenseNet are dense blocks and transition layers. The former defines how the inputs and outputs are concatenated, while the latter controls the number of channels so that it is not too large.

7.7.2 Dense Blocks

TF2 程式碼：https://trickygo.github.io/Dive-into-DL-TensorFlow2.0/#/chapter05_CNN/5.12_densenet?id=_5121-%e7%a8%a0%e5%af%86%e5%9d%97

7.7.3 Transition Layers

作者同時用\(1\times 1\) convolutional layer減少channel數量,和average pooling layer減少長寬.

Since each dense block will increase the number of channels, adding too many of them will lead to an excessively complex model. A transition layer is used to control the complexity of the model. It reduces the number of channels by using the \(1\times 1\) convolutional layer and halves the height and width of the average pooling layer with a stride of 2, further reducing the complexity of the model.

TF2 程式碼：https://trickygo.github.io/Dive-into-DL-TensorFlow2.0/#/chapter05_CNN/5.12_densenet?id=_5122-%e8%bf%87%e6%b8%a1%e5%b1%82

7.7.4 DenseNet Model

TF2 程式碼：https://trickygo.github.io/Dive-into-DL-TensorFlow2.0/#/chapter05_CNN/5.12_densenet?id=_5123-densenet%e6%a8%a1%e5%9e%8b

Summary

In terms of cross-layer connections, unlike ResNet, where inputs and outputs are added together, DenseNet concatenates inputs and outputs on the channel dimension.
The main units that compose DenseNet are dense blocks and transition layers.
We need to keep the dimensionality under control when composing the network by adding transition layers that shrink the number of channels again.