D2L-CH6-Convolutional Neural Networks

6. Convolutional Neural Networks

6.1 From Dense Layers to Convolutions

把pixel攤開用迴歸的方式處理,是不切實際的作法.

For instance, let us return to our running example of distinguishing cats from dogs. Say that we do a thorough job in data collection, collecting an annotated sets of high-quality 1-megapixel photographs. This means that the input into a network has 1 million dimensions. Even an aggressive reduction to 1,000 hidden dimensions would require a dense (fully-connected) layer to support \(10^9\) parameters. Unless we have an extremely large dataset (perhaps billions?), lots of GPUs, a talent for extreme distributed optimization, and an extraordinary amount of patience, learning the parameters of this network may turn out to be impossible.

6.1.1 Invariances

Back to images, the intuitions we have been discussing could be made more concrete yielding a few key principles for building neural networks for computer vision:

Our vision systems should, in some sense, respond similarly to the same object regardless of where it appears in the image (translation invariance).
Our visions systems should, in some sense, focus on local regions, without regard for what else is happening in the image at greater distances (locality).

Let us see how this translates into mathematics.

6.1.2 Constraining the MLP

看不懂這章節的公式的解釋,只知道引用了上個章節的translation invariance和locality.

結論: layer只需考慮局部資訊(local information)

This, in a nutshell is the convolutional layer. When the local region (also called a receptive field) is small, the difference as compared to a fully-connected network can be dramatic. While previously, we might have required billions of parameters to represent just a single layer in an image-processing network, we now typically need just a few hundred. The price that we pay for this drastic modification is that our features will be translation invariant and that our layer can only take local information into account. All learning depends on imposing inductive bias. When that bias agrees with reality, we get sample-efficient models that generalize well to unseen data. But of course, if those biases do not agree with reality, e.g., if images turned out not to be translation invariant, our models may not generalize well.

6.1.3 Convolutions

\(\circledast\)的中文稱之為摺積.

\[[f \circledast g](x) = \int_{\mathbb{R}^d} f(z) g(x-z) dz.\]

That is, we measure the overlap between \(f\) and \(g\) when both functions are shifted by \(x\) and "flipped". Whenever we have discrete objects, the integral turns into a sum. For instance, for vectors defined on \(\ell_2\), i.e., the set of square summable infinite dimensional vectors with index running over \(\mathbb{Z}\) we obtain the following definition.

\[[f \circledast g](i) = \sum_a f(a) g(i-a).\]

用很複雜流程介紹 cross correlation

6.1.4 Waldo Revisited

用很複雜流程介紹 channel(feature maps)

Summary

Translation invariance in images implies that all patches of an image will be treated in the same manner.
Locality means that only a small neighborhood of pixels will be used for computation.
Channels on input and output allows for meaningful feature analysis.

6.2 Convolutions for Images

參考 Tensorflow2 版本

6.2.1 The Cross-Correlation Operator

the output size is given by the input size \(H \times W\) minus the size of the convolutional kernel \(h \times w\) via \((H-h+1) \times (W-w+1)\).

This is the case since we need enough space to ʻshiftʼ the convolutional kernel across the image (later we will see how to keep the size unchanged by padding the image with zeros around its boundary such that there is enough space to shift the kernel).

def corr2d(X, K):
    h, w = K.shape
    Y = tf.Variable(tf.zeros((X.shape[0] - h + 1, X.shape[1] - w +1)))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i,j].assign(tf.cast(tf.reduce_sum(X[i:i+h, j:j+w] * K), dtype=tf.float32))
    return Y


X = tf.constant([[0,1,2], [3,4,5], [6,7,8]])
K = tf.constant([[0,1], [2,3]])
corr2d(X, K)

<tf.Variable 'Variable:0' shape=(2, 2) dtype=float32, numpy=
array([[19., 25.],
       [37., 43.]], dtype=float32)>

6.2.2 Convolutional Layers

class Conv2D(tf.keras.layers.Layer):
    def __init__(self, units):
        super().__init__()
        self.units = units

    def build(self, kernel_size):
        self.w = self.add_weight(name='w',
                                shape=kernel_size,
                                initializer=tf.random_normal_initializer())
        self.b = self.add_weight(name='b',
                                shape=(1,),
                                initializer=tf.random_normal_initializer())
    def call(self, inputs):
        return corr2d(inputs, self.w) + self.b

6.2.3 Object Edge Detection in Images

透過kerne計算相鄰像素的差異,捕捉邊界.

X = tf.Variable(tf.ones((6,8)))
X[:, 2:6].assign(tf.zeros(X[:,2:6].shape))
X

<tf.Variable 'Variable:0' shape=(6, 8) dtype=float32, numpy=
array([[1., 1., 0., 0., 0., 0., 1., 1.],
       [1., 1., 0., 0., 0., 0., 1., 1.],
       [1., 1., 0., 0., 0., 0., 1., 1.],
       [1., 1., 0., 0., 0., 0., 1., 1.],
       [1., 1., 0., 0., 0., 0., 1., 1.],
       [1., 1., 0., 0., 0., 0., 1., 1.]], dtype=float32)>

K = tf.constant([[1,-1]], dtype = tf.float32)
Y = corr2d(X, K)
Y

<tf.Variable 'Variable:0' shape=(6, 7) dtype=float32, numpy=
array([[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
       [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
       [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
       [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
       [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
       [ 0.,  1.,  0.,  0.,  0., -1.,  0.]], dtype=float32)>

6.2.4 Learning a Kernel

虽然我们之前构造了Conv2D类，但由于corr2d使用了对单个元素赋值（[i, j]=）的操作因而无法自动求梯度。下面我们使用tf.keras.layers提供的Conv2D类来实现这个例子。

X = tf.reshape(X, (1,6,8,1))
Y = tf.reshape(Y, (1,6,7,1))
Y

conv2d = tf.keras.layers.Conv2D(1, (1,2))
#input_shape = (samples, rows, cols, channels)
# Y = conv2d(X)
Y.shape

TensorShape([1, 6, 7, 1])

Y_hat = conv2d(X)
for i in range(10):
    with tf.GradientTape(watch_accessed_variables=False) as g:
        g.watch(conv2d.weights[0])
        Y_hat = conv2d(X)
        l = (abs(Y_hat - Y)) ** 2
        dl = g.gradient(l, conv2d.weights[0])
        lr = 3e-2
        update = tf.multiply(lr, dl)
        updated_weights = conv2d.get_weights()
        updated_weights[0] = conv2d.weights[0] - update
        conv2d.set_weights(updated_weights)

        if (i + 1)% 2 == 0:
            print('batch %d, loss %.3f' % (i + 1, tf.reduce_sum(l)))

batch 2, loss 0.235
batch 4, loss 0.041
batch 6, loss 0.008
batch 8, loss 0.002
batch 10, loss 0.000

tf.reshape(conv2d.get_weights()[0],(1,2))

<tf.Tensor: id=1012, shape=(1, 2), dtype=float32, numpy=array([[ 0.99903595, -0.9960023 ]], dtype=float32)>

6.2.5 Cross-Correlation and Convolution

Summary

The core computation of a two-dimensional convolutional layer is a two-dimensional cross-correlation operation. In its simplest form, this performs a cross-correlation operation on the two-dimensional input data and the kernel, and then adds a bias.
We can design a kernel to detect edges in images.
We can learn the kernel through data.

6.3 Padding and Stride

In several cases we might want to incorporate particular techniques---padding and strides, regarding the size of the output:

In general, since kernels generally have width and height greater than \(1\), that means that after applying many successive convolutions, we will wind up with an output that is much smaller than our input. If we start with a \(240 \times 240\) pixel image, \(10\) layers of \(5 \times 5\) convolutions reduce the image to \(200 \times 200\) pixels, slicing off \(30 \%\) of the image and with it obliterating any interesting information on the boundaries of the original image. Padding handles this issue.
In some cases, we want to reduce the resolution drastically if say we find our original input resolution to be unwieldy. Strides can help in these instances.

6.3.1 Padding

def comp_conv2d(conv2d, X):
    X = tf.reshape(X,(1,) + X.shape + (1,))
    Y = conv2d(X)
    #input_shape = (samples, rows, cols, channels)
    return tf.reshape(Y,Y.shape[1:3])

conv2d = tf.keras.layers.Conv2D(1, kernel_size=3, padding='same')
X = tf.random.uniform(shape=(8,8))
comp_conv2d(conv2d,X).shape

TensorShape([8, 8])

NOTE

理解 X = tf.reshape(X,(1,) + X.shape + (1,))

新的X維度為 (1, 8, 8, 1), 轉換目的為符合Conv2D的輸入格式: (samples, rows, cols, channels)

channels的位置可以根據參數data_format更改位置(預設值是channels_last).

6.3.2 Stride

在前面的示例中，我們默認一次滑動一個像素。但是，有時為了提高計算效率或因為我們希望降低採樣率，我們一次將窗口移動一個以上像素，從而跳過了中間位置。

In previous examples, we default to sliding one pixel at a time. However, sometimes, either for computational efficiency or because we wish to downsample, we move our window more than one pixel at a time, skipping the intermediate locations.

下面我们令高和宽上的步幅均为2，从而使输入的高和宽减半。

conv2d = tf.keras.layers.Conv2D(1, kernel_size=3, padding='same',strides=2)
comp_conv2d(conv2d, X).shape

TensorShape([4, 4])

接下来是一个稍微复杂点儿的例子。

conv2d = tf.keras.layers.Conv2D(1, kernel_size=(3,5), padding='valid', strides=(3,4))
comp_conv2d(conv2d, X).shape

TensorShape([2, 1])

Summary

Padding can increase the height and width of the output. This is often used to give the output the same height and width as the input.
The stride can reduce the resolution of the output, for example reducing the height and width of the output to only \(1/n\) of the height and width of the input (\(n\) is an integer greater than \(1\)).
Padding and stride can be used to adjust the dimensionality of the data effectively.

6.4 Multiple Input and Output Channels

def corr2d(X, K):
    h, w = K.shape
    if len(X.shape) <= 1:
        X = tf.reshape(X, (X.shape[0],1))
    Y = tf.Variable(tf.zeros((X.shape[0] - h + 1, X.shape[1] - w +1)))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i,j].assign(tf.cast(tf.reduce_sum(X[i:i+h, j:j+w] * K), dtype=tf.float32))
    return Y

6.4.1 Multiple Input Channels

def corr2d_multi_in(X, K):
    return tf.reduce_sum([corr2d(X[i], K[i]) for i in range(X.shape[0])],axis=0)

X = tf.constant([[[0,1,2],[3,4,5],[6,7,8]],
                 [[1,2,3],[4,5,6],[7,8,9]]])
K = tf.constant([[[0,1],[2,3]],
                 [[1,2],[3,4]]])

corr2d_multi_in(X, K)

<tf.Tensor: id=145, shape=(2, 2), dtype=float32, numpy=
array([[ 56.,  72.],
       [104., 120.]], dtype=float32)>

6.4.2 Multiple Output Channels

不管輸入通道有多少，到目前為止，我們始終只能獲得一個輸出通道。但是，正如我們前面所討論的，在每一層具有多個通道是至關重要的。在最流行的神經網絡體系結構中，我們實際上是隨著神經網絡的上移而增加通道的尺寸，通常是降低採樣以犧牲空間分辨率來獲得更大的通道深度。

Regardless of the number of input channels, so far we always ended up with one output channel. However, as we discussed earlier, it turns out to be essential to have multiple channels at each layer. In the most popular neural network architectures, we actually increase the channel dimension as we go higher up in the neural network, typically downsampling to trade off spatial resolution for greater channel depth.

NOTE

多輸出 channel 要求的 kernel 格式: \(c_o\times c_i\times k_h\times k_w\)

\(c_o\): number of output channels

\(c_i\): number of input channels

\(k_h\): height of the kernel

\(k_w\): width of the kernel

def corr2d_multi_in_out(X, K):
    return tf.stack([corr2d_multi_in(X, k) for k in K],axis=0)


K = tf.stack([K, K+1, K+2],axis=0)
K.shape

TensorShape([3, 2, 2, 2])

corr2d_multi_in_out(X, K)

<tf.Tensor: id=592, shape=(3, 2, 2), dtype=float32, numpy=
array([[[ 56.,  72.],
        [104., 120.]],

       [[ 76., 100.],
        [148., 172.]],

       [[ 96., 128.],
        [192., 224.]]], dtype=float32)>

6.4.3 \(1 \times 1\) Convolutional Layer

值得注意的是，输入和输出具有相同的高和宽。输出中的每个元素来自输入中在高和宽上相同位置的元素在不同通道之间的按权重累加。假设我们将通道维当作特征维，将高和宽维度上的元素当成数据样本，那么\(1\times 1\)卷积层的作用与全连接层等价.

You could think of the \(1\times 1\) convolutional layer as constituting a fully-connected layer applied at every single pixel location to transform the \(c_i\) corresponding input values into \(c_o\) output values. Because this is still a convolutional layer, the weights are tied across pixel location Thus the \(1\times 1\) convolutional layer requires \(c_o\times c_i\) weights (plus the bias terms).

def corr2d_multi_in_out_1x1(X, K):
    c_i, h, w = X.shape
    c_o = K.shape[0]
    X = tf.reshape(X,(c_i, h * w))
    K = tf.reshape(K,(c_o, c_i))
    Y = tf.matmul(K, X)
    return tf.reshape(Y, (c_o, h, w))

X = tf.random.uniform((3,3,3))
K = tf.random.uniform((2,3,1,1))

Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_in_out(X, K)

tf.norm(Y1-Y2) < 1e-6

<tf.Tensor: id=1392, shape=(), dtype=bool, numpy=True>

Summary

Multiple channels can be used to extend the model parameters of the convolutional layer.
The \(1\times 1\) convolutional layer is equivalent to the fully-connected layer, when applied on a per pixel basis.
The \(1\times 1\) convolutional layer is typically used to adjust the number of channels between network layers and to control model complexity.

6.5 Pooling

本節介紹了池化層，其雙重目的是減輕卷積層對位置的敏感性以及空間下採樣表示的敏感性。

This section introduces pooling layers, which serve the dual purposes of mitigating the sensitivity of convolutional layers to location and of spatially downsampling representations.

6.5.1 Maximum Pooling and Average Pooling

和 convolutional layer 一樣,有個pooling window; 但是沒有filter.

Like convolutional layers, pooling operators consist of a fixed-shape window that is slid over all regions in the input according to its stride, computing a single output for each location traversed by the fixed-shape window (sometimes known as the pooling window). However, unlike the cross-correlation computation of the inputs and kernels in the convolutional layer, the pooling layer contains no parameters (there is no filter). Instead, pooling operators are deterministic, typically calculating either the maximum or the average value of the elements in the pooling window. These operations are called maximum pooling (max pooling for short) and average pooling, respectively.

6.5.2 Padding and Stride

pool2d = tf.keras.layers.MaxPool2D(pool_size=[3,3],padding='same',strides=2)
pool2d(X)

6.5.3 Multiple Channels

在处理多通道输入数据时，池化层对每个输入通道分别池化，而不是像卷积层那样将各通道的输入按通道相加。这意味着池化层的输出通道数与输入通道数相等。

When processing multi-channel input data, the pooling layer pools each input channel separately, rather than adding the inputs of each channel by channel as in a convolutional layer. This means that the number of output channels for the pooling layer is the same as the number of input channels.

X = tf.stack([X, X+1], axis=3)
X = tf.reshape(X, (2,4,4,1))
X.shape

TensorShape([2, 4, 4, 1])

pool2d = tf.keras.layers.MaxPool2D(3, padding='same', strides=2)
pool2d(X)

<tf.Tensor: id=120, shape=(2, 2, 2, 1), dtype=int32, numpy=
array([[[[ 5],
         [ 6]],

        [[ 7],
         [ 8]]],


       [[[13],
         [14]],

        [[15],
         [16]]]])>

Summary

Taking the input elements in the pooling window, the maximum pooling operation assigns the maximum value as the output and the average pooling operation assigns the average value as the output.
One of the major functions of a pooling layer is to alleviate the excessive sensitivity of the convolutional layer to location.
We can specify the padding and stride for the pooling layer.
Maximum pooling, combined with a stride larger than 1 can be used to reduce the resolution.
The pooling layer's number of output channels is the same as the number of input channels.

6.6 Convolutional Neural Networks (LeNet)

卷积神经网络就是含卷积层的网络。本节里我们将介绍一个早期用来识别手写数字图像的卷积神经网络：LeNet [1]。这个名字来源于LeNet论文的第一作者Yann LeCun。LeNet展示了通过梯度下降训练卷积神经网络可以达到手写数字识别在当时最先进的结果。这个奠基性的工作第一次将卷积神经网络推上舞台，为世人所知。

這裡只列部份程式碼,細節請看 github/trickygo

net = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters=6,kernel_size=5,activation='sigmoid',input_shape=(28,28,1)),
    tf.keras.layers.MaxPool2D(pool_size=2, strides=2),
    tf.keras.layers.Conv2D(filters=16,kernel_size=5,activation='sigmoid'),
    tf.keras.layers.MaxPool2D(pool_size=2, strides=2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(120,activation='sigmoid'),
    tf.keras.layers.Dense(84,activation='sigmoid'),
    tf.keras.layers.Dense(10,activation='sigmoid')
])

X = tf.random.uniform((1,28,28,1))
for layer in net.layers:
    X = layer(X)
    print(layer.name, 'output shape\t', X.shape)

conv2d output shape     (1, 24, 24, 6)
max_pooling2d output shape     (1, 12, 12, 6)
conv2d_1 output shape     (1, 8, 8, 16)
max_pooling2d_1 output shape     (1, 4, 4, 16)
flatten output shape     (1, 256)
dense output shape     (1, 120)
dense_1 output shape     (1, 84)
dense_2 output shape     (1, 10)

Summary

A convolutional neural network (in short, ConvNet) is network using convolutional layers.
In a ConvNet we alternate between convolutions, nonlinearities and often also pooling operations.
Ultimately the resolution is reduced prior to emitting an output via one (or more) dense layers.
LeNet was the first successful deployment of such a network.