D2L-CH4-Multilayer Perceptrons

4 Multilayer Perceptrons

4.1 Multilayer Perceptrons

4.1.1 Hidden Layers

$\hat{o}=\text{softmax}(Wx+b)$

softmax 仍是屬於 linear transformation.

線性代表 monotonicity 的弱假設,任一個feature值的增加,必定會增加或降低模型的輸出值.

linearity implies the weaker assumption of monotonicity

作者舉例,基於收入預測還款能力可以透過對收入取log作為feature;基於體溫預測體溫,可以計算 $37C-體溫$ 作為feauture.

但是在圖片辨識的應用上.pixel的重要性可能取決於周圍pixels的複雜計算,就無法用簡單的前處理方式處理特徵.透過DNN去聯合學習representation和之上的linear predictor.

And yet despite the apparent aburdity of linearity here, as compared to our previous examples, itʼs less obvious that we could address the problem with a simple preprocessing fix. That is because the significance of any pixel depends in complex ways on its context (the values of the surrounding pixels). While there might exist a representation of our data that would take into account the relevant interactions among our features (and on top of which a linear model would be suitable), we simply do not know how to calculate it by hand. With deep neural networks, we used observational data to jointly learn both a representation (via hidden layers) and a linear predictor that acts upon that representation.

Incorporating Hidden Layers

L層的MLP架構,代表前L-1層是representation,最後一層是linear predictor.

We can think of the first L − 1 layers as our representation and the final layer as our linear predictor. This architecture is commonly called a multilayer perceptron, often abbreviated as MLP.

From Linear to Nonlinear

$h=W_{1}x+b$

$o=W_{2}h+b$

$\hat{y}=\text{softmax}(o)$

沒有 non-linear activation function, MLP也可以看成只是個單層的model.

We can view the equivalence formally by proving that for any values of the weights, we can just collapse out the hidden layer, yielding an equivalent single-layer model with paramters $W=W_{2}W_{1}$ and $b=W_{2}b_{1}+b_{2}$

加入 non-linear activation function, 就無法處理成單層模型.

$h= \sigma (W_{1}x+b)$

$o=W_{2}h+b$

$\hat{y}=\text{softmax}(o)$

就算是單層網路,給足夠的節點和適合權重,也可以對任何function建模,作者認為,可以把NN看成是C語言,它可以表達任何可計算程序.但是,實際上想出一個符合你要求的程序才是困難的部分。

Even a single-hidden-layer network, given enough nodes (possibly absurdly manby), and the right set of weights, we can model any function at all. Actually learning that function is the hard part. You might think of your neural network as being a bit like the C programming language. The language, like any other modern language, is capable of expressing any computable program. But actually coming up with a program that meets your specifications is the hard part.

Vectorization and Minibatch

softmax 是屬於 row-wise operation

但是 activation function 不只是 row-wise 也是 component wise

batch normalization 是個特例,Section 7.5 會提到.

4.1.2 Activation Functions

ReLU Function

$\text{ReLU}(z) = \text{max}(z, 0)$

$\text{pReLU}(x) = \text{max}(0, x) + \alpha \text{min}(0, x)$

Relu

Gradient of Relu

The reason for using the ReLU is that its derivatives are particularly well behaved: either they vanish or they just let the argument through. This makes optimization better behaved and it mitigated well-documented problem of vanishing gradients that plagued previous versions of neural networks (more on this later).

Sigmoid Function

the sigmoid is often called a squashing function: it squashes any　input in the range (-inf, inf) to some value in the range (0, 1).

$\text{sigmod}(x)=\frac{1}{1+\text{exp}(-x)}$

$\frac{d}{dx}\text{sigmoid}(x)=\frac{\text{exp}(-x)}{(1+\text{exp}(-x))^{2}}=\text{sigmoid}(x)(1-\text{sigmoid}(x))$

you can think of the sigmoid as a special case of the softmax. However, the sigmoid has mostly been replaced by the simpler and more easily trainable ReLU for most use in hidden layers.

Sigmoid

Gradient of Sigmoid

Tanh Function

Like the sigmoid function, the tanh (Hyperbolic Tangent) function also squashes its inputs, transforms them into elements on the interval between -1 and 1:

$\text{tanh}(x)=\frac{1-\text{exp}(-2x)}{1=\text{exp}(-2x)}$

$\frac{d}{dx}\text{tanh}(x)=1-\text{tanh}^2 (x)$

Tanh

Gradient of Tanh

4.2 Implementation of Multilayer Perceptron from Scratch

實做花太多時間, 略過

4.3 Concise Implementation of Multilayer Perceptron

實做花太多時間, 略過

4.4 Model Selection, Underfitting and Overfitting

As machine learning scientists, our goal is to discover patterns. But how can we be sure that we have truly discovered a general pattern and not simply memorized our data.

If you altered the model structure or the hyper-parameters during the experiment, you might have noticed that with enough nodes, layers, and training epochs, the model can eventually reach perfect accuracy on the training set, even as the accuracy on test data deteriorates.

4.4.1 Training Error and Generalization Error

training error: 對 training data 的模型錯誤

generalization error: 理想情況是模型對從樣本分佈無限取樣的樣本所計算的誤差.但現實情況是用test set 去估計(estimate) generalization error.

Statistical Learning Theory

在標準的監督式學習設定下,我們會假設training data 和 test data 是獨立取樣自同一個分佈.代表取樣過程不具有記憶性質.

In the standard supervised learning setting, which we have addressed up until now and will stick throughout most of this book, we assume that both the training data and the test data are drawn independently from identical distributions (commonly called the i.i.d. assumption). This means that the process that samples our data has no memory. The 2nd example drawn and the 3rd drawn are no more correlated than the 2nd and the 2-millionth sample drawn.

推特的主題分類,違反了iid,因為主題帶有時間相依性.

draws might be correlated in time. What if we are classifying the topics of Tweets. The news cycle would create temporal dependencies in the topics being discussed violating any assumptions of independence.

在現實應用中,有時我們違反i.i.d的假設,但模型的表現依然不錯.

後續章節會開始討論因為違反i.i.d而造成的問題

In subsequent chapters and volumes, we will discuss problems arising from violations of the i.i.d. assumption.

Model Complexity

很多因素都會影響泛化能力.通常會認為模型參數range越大,可能越複雜.

在NN,會認為訓練次數越多次,模型會越複雜,所以有了early stopping機制

作者列舉幾個影響泛化能力的因子:

超參數數量,也稱之為自由度(degree of freedom)
當權重可以取較寬的值範圍時,模型可能更容易過度擬合
training examples 數量

In this section, to give you some intuition, weʼll focus on a few factors that tend to influence the generalizability of a model class:

The number of tunable parameters. When the number of tunable parameters, sometimes called the degrees of freedom, is large, models tend to be more susceptible to overfitting.

The values taken by the parameters. When weights can take a wider range of values, models can be more susceptible to over fitting.

The number of training examples. Itʼs trivially easy to overfit a dataset containing only one or two examples even if your model is simple. But overfitting a dataset with millions of examples requires an extremely flexible model.

4.4.2 Model Selection

比較不同超參數的模型. 我們需要validation set

With multilayer perceptrons for example, we may wish to compare models with different numbers of hidden layers, different numbers of hidden units, and various choices of the activation functions applied to each hidden layer. In order to determine the best among our candidate models, we will typically employ a validation set.

Validation Dataset

實務上,在我們還沒選好模型的超參數時,不能去碰到test set.

In principle we should not touch our test set until after we have chosen all our hyper-parameters.

不明白這句話意思,等論壇大神們回答

https://discuss.mxnet.io/t/multilayer-perceptron/2338/5?u=ychuang

The result is a murky practice where the boundaries between validation and test data are worryingly ambiguous. Unless explicitly stated otherwise, in the experiments in this book we are really working with what should rightly be called training data and validation data, with no true test sets. Therefore, the accuracy reported in each experiment is really the validation accuracy and not a true test set accuracy. The good news is that we do not need too much data in the validation set. The uncertainty in our estimates can be shown to be of the order of $\mathcal{O}(n^{-\frac{1}{2}})$.

K-Fold Cross-Validation

將 training data 切成 K 份不重疊子集. 訓練和驗證總共跑 K 次,分別對 K 個 training error 和 validation error 求平均得到最終分數.

the original training data is split into K non-overlapping subsets. Then model training and validation are executed K times, each time training on K − 1 subsets and validating on a different subset (the one not used for training in that round). Finally, the training and validation error rates are estimated by averaging over the results from the K experiments.

4.4.3 Underfitting or Overfitting?

過度擬合併不總是一件壞事.尤其是在深度學習方面,眾所周知,最佳預測模型通常在training data上的性能比較好.最終,我們通常更關心validation error，而不是 training error 和 validation errors 之間的差距.

Note that overfitting is not always a bad thing. With deep learning especially, it is well known that the best predictive models often perform far better on training data than on holdout data. Ultimately, we usually care more about the validation error than about the gap between the training and validation errors.

是否 overfitting 或是 underfitting, 可以看模型複雜度和資料集大小.

Model Complexity

Dataset Size

training dataset 樣本數越少,越有可能會overfitting.隨著樣本數增加,generalization error 通常會減小.

如果沒有足夠的數據量,深度學習很有可能只贏得過線性模型而已.

深度學習的當前成功部分歸因於互聯網公司,廉價的存儲,連接的設備以及經濟數位化產生當前大量的數據集.

4.4.4 Polynomial Regression

跳過實做部份, 但是作者建立模擬資料的方法值得紀錄!

建立 polynomial of degree $d$

$y=5+1.2x-3.4\frac{x^2}{2!}+5.6\frac{x^3}{3!}+\epsilon \sim \mathcal{N}(0,0,1)$

maxdegree = 20 # Maximum degree of the polynomial
n_train, n_test = 100, 100 # Training and test dataset sizes
true_w = np.zeros(maxdegree) # Allocate lots of empty space
true_w[0:4] = np.array([5, 1.2, -3.4, 5.6])
features = np.random.normal(size=(n_train + n_test, 1))
features = np.random.shuffle(features)
poly_features = np.power(features, np.arange(maxdegree).reshape(1, -1))
poly_features = poly_features / (npx.gamma(np.arange(maxdegree) + 1).reshape(1, -1))
labels = np.dot(poly_features, true_w)
labels += np.random.normal(scale=0.1, size=labels.shape)

For optimization, we typically want to avoid very large values of gradients, losses, etc. This is why the monomials stored in poly_features are rescaled from $x^{i}$ to $\frac{1}{i!}x^{i}$. It allows us to avoid very large values for large exponents $i$. Factorials are implemented in Gluon using the Gamma function, where $n! = Γ(n + 1)$.

Summary

A validation set can be used for model selection (provided that it is not used too liberally).
Underfitting means that the model is not able to reduce the training error rate while overfitting is a result of the model training error rate being much lower than the testing dataset rate.
We should choose an appropriately complex model and avoid using insufficient training samples.

4.5 Weight Decay

要用 regularization 技術,前提是你已有足夠多的高質量資料.

For now, we can assume that we already have as much high-quality data as our resources permit and focus on regularization techniques.

作者以 monomial 為例, 簡單地增加degree, 對模型的複雜度影響很大.

Note that the number of terms with degree $d$ blows up rapidly as $d$ grows larger. Given $k$ variables, the number of monomials of degree $d$ is $\binom{k-1+d}{k-1}$. Even small changes in degree, say, from 2 to 3 dramatically increase the complexity of our model. Thus we often need a more fine-grained tool for adjusting function complexity.

4.5.1 Squared Norm Regularization

Weight decay 也稱之為 L2 regularization.

一個 function 的複雜度,最簡單的方法是計算跟原點0的距離.但其實沒有精確的衡量方式.

the simplest and that we can measure the complexity of a function by its distance from zero. But how precisely should we measure the distance between a function and zero? There is no single right answer.

用 norm 衡量複雜度是常見作法.

One simple interpretation might be to measure the complexity of a linear function $f(x)=w^{T}x$ by some norm of its weight vector, e.g., $||w||^{2}$.

$l(w,b)+\frac{\lambda}{2}||w||^{2}$ 用 $\lambda$ 控制 norm 的影響力.

使用 L2 norm 而不使用 Euclidean distance,是因為方便計算,可以免除平方根,以及免除sum of squares

The astute reader might wonder why we work with the squared norm and not the standard norm (i.e., the Euclidean distance). We do this for computational convenience. By squaring the L2 norm, we remove the square root, leaving the sum of squares of each component of the weight vector. This makes the derivative of the penalty easy to compute (the sum of derivatives equals the derivative of the sum).

容易記憶的公式: $||w||^{p}_{p}:=\sum^{d}_{i=1}{|w_{i}|^{p}}$

參考知乎：https://zhuanlan.zhihu.com/p/35897775

$||w||_{p}=(\sum^{d}_{i=1}{|w_{i}|^{p}})^{\frac{1}{p}}$

$||A||_{F}=(\sum_{i}\sum_{j}{a^{2}_{ij}})^{\frac{1}{2}}$

作者說明使用 L2 norm 的理由:使用 L2 norm 的原因之一是，它在權重向量的較大部分上放置了超大懲罰項.這使我們的學習算法偏向於將權重均勻分佈在大量特徵上的模型.在實踐中,這可能會使它們對單個變量中的測量誤差更加robust.

One reason to work with the L2 norm is that it places and outsize penalty on large components of the weight vector. This biases our learning algorithm towards models that distribute weight evenly across a larger number of features. In practice, this might make them more robust to measurement error in a single variable.

L1 norm 導致模型將權重集中在少數特徵上,有些原因會有需要他的地方.

L1 penalties lead to models that concentrate weight on a small set of features, which may be desirable for other reasons.

從底下SGD更新參數公式可以知道為何被稱之為 “weight decay”

L2 范数正则化令权重 w1 和 w2 先自乘小于1的数，再减去不含惩罚项的梯度。因此， L2 范数正则化又叫权重衰减。权重衰减通过惩罚绝对值较大的模型参数为需要学习的模型增加了限制，这可能对过拟合有效。实际场景中，我们有时也在惩罚项中添加偏差元素的平方和。

$w \leftarrow (1-\frac{\eta\lambda}{|B|})w-\frac{\eta}{|B|}\sum_{i \in B}(w^{T}x^{(i)}+b-y^{(i)})$

4.5.2 High-Dimensional Linear Regression

實做,跳過

4.5.3 Implementation from Scratch

實做,跳過

Defining l2 Norm Penalty

值得紀錄的 l2 norm 實做方法. 之所以要除以2,是為了微分後的參數更新比較容易.

def l2_penalty(w):
    return (w**2).sum() / 2

4.5.4 Concise Implementation

實做,跳過

Summary

Regularization is a common method for dealing with overfitting. It adds a penalty term to the loss function on the training set to reduce the complexity of the learned model.
You can have different optimizers within the same training loop, e.g., for different sets of parameters.(但是作者在文章中沒有提到作法)

4.6 Dropout

回顧L2 norm, 從機率角度來看,是以weight從高斯分佈而來.更直覺地說,我們期望模型能將weight分散到各個feature,而不是只依賴於少數潛在的虛假關聯.

In probabilistic terms, we could justify this technique by arguing that we have assumed a prior belief that weights take values from a Gaussian distribution with mean 0. More intuitively, we might argue that we encouraged the model to spread out its weights among many features and rather than depending too much on a small number of potentially spurious associations.

4.6.1 Overfitting Revisited

看不懂

4.6.2 Robustness through Perturbations

NN的overfitting可表示為每一層layer只依賴上一層特定的pattern of activations,這狀況稱之為 co-adaptation.

The authors argue that neural network overfitting is characterized by a state in which each layer an relies on a specifc pattern of activations in the previous layer, calling this condition co-adaptation.

在線性模型的作法,是透過加上bias達到作用,且能保證期望值一樣.

$x'=x+\epsilon$

$\epsilon\backsim\mathcal{N}(0,\sigma^{2})$

$E[x']=x$

inverted dropout(作者沒講專有名詞,這是上網查到的):

NOTE

dropout的input是activation function的output.

4.6.3 Dropout in Practice

可以把 dropout 理解成取代原有 hidden layer 的作法.

dropout不會在test階段進行.但也有例外,例如:為了估計模型的uncertainty.

Typically, we disable dropout at test time. Given a trained model and a new example, we do not drop out any nodes (and thus do not need to normalize). However, there are some exceptions: some researchers use dropout at test time as a heuristic for estimating the uncertainty of neural network predictions: if the predictions agree across many different dropout masks, then we might say that the network is more confident. For now we will put off uncertainty estimation for subsequent chapters and volumes.

4.6.4 Implementation from Scratch

值得參考的Numpy實做方法(mask概念)

def dropout_layer(X, dropout):
    assert 0 <= dropout <= 1
    # In this case, all elements are dropped out
    if dropout == 1:
        return np.zeros_like(X)
    mask = np.random.uniform(0, 1, X.shape) > dropout
    return mask.astype(np.float32) * X / (1.0-dropout)

4.6.5 Concise Implementation

MXnet有參數可以決定test階段,是否使用dropout,但不知道tensorflow有沒有這機制.

Summary

Beyond controlling the number of dimensions and the size of the weight vector, dropout is yet another tool to avoid overfitting. Often all three are used jointly.
Dropout replaces an activation $h$ with a random variable $h′$ with expected value $h$ and with variance given by the dropout probability $p$.
Dropout is only used during training.

4.7 Forward Propagation, Backward Propagation, and Computational Graphs

4.7.1 Forward Propagation

再複習一下：

$z=W^{(1)}x$, $\text{where}~~~~x\in\mathbb{R}^{d}$, $W^{(1)}\in\mathbb{R}^{h \times b}$, $z\in\mathbb{R}^{h}$

$h=\phi(z)$

$o=W^{(2)}h$,$\text{where}~~~~W^{(2)}\in\mathbb{R}^{q \times h}$

Loss function is $L=l(o,y)$

Regularization term is $s=\frac{\lambda}{2}(||W^{(1)}||^{2}_{F}+||W^{(2)}||^{2}_{F})$

作者簡單解釋 Frobenius norm:

where the Frobenius norm of the matrix is simply the L2 norm applied after flattening the matrix into a vector.

Objective function is $J=L+s$

NOTE

作者對loss function 和 objective function 的定義:

$y$ 和 $\hat{y}$ 的差距計算是 loss function

但 loss function + regularization term 才是 objective function

4.7.2 Computational Graph of Forward Propagation

用畫圖來理解 Forward Propagation

4.7.3 Backpropagation

參考 Fig. 4.7.1, 參數是 $\mathbf{W}^{(1)}$ $\mathbf{W}^{(2)}$ 這次BP的目標是要求$\(\partial J/\partial \mathbf{W}^{(1)}$和\)\partial J/\partial \mathbf{W}^{(2)}$的梯度.

NOTE

BP過程參考 Fig. 4.7.1, 注意看分母代表的方框和分子代表的方框,他們路徑之間經過了哪些方框.

要先把距離$J$較近的方框值求梯度,由近而遠地透過 chain rule 依序推導得出.

The first step is to calculate the gradients of the objective function $J=L+s$ with respect to the loss term $L$ and the regularization term $s$.

$\frac{\partial J}{\partial L} = 1 \; \text{and} \; \frac{\partial J}{\partial s} = 1.$

Next, we compute the gradient of the objective function with respect to variable of the output layer $o$ according to the chain rule.

$\frac{\partial J}{\partial \mathbf{o}} = \text{prod}\left(\frac{\partial J}{\partial L}, \frac{\partial L}{\partial \mathbf{o}}\right) = \frac{\partial L}{\partial \mathbf{o}} \in \mathbb{R}^q.$

Next, we calculate the gradients of the regularization term with respect to both parameters.

$\frac{\partial s}{\partial \mathbf{W}^{(1)}} = \lambda \mathbf{W}^{(1)} \; \text{and} \; \frac{\partial s}{\partial \mathbf{W}^{(2)}} = \lambda \mathbf{W}^{(2)}.$

不能理解?

Now we are able calculate the gradient $\partial J/\partial \mathbf{W}^{(2)} \in \mathbb{R}^{q \times h}$ of the model parameters closest to the output layer. Using the chain rule yields:

$\frac{\partial J}{\partial \mathbf{W}^{(2)}} = \text{prod}\left(\frac{\partial J}{\partial \mathbf{o}}, \frac{\partial \mathbf{o}}{\partial \mathbf{W}^{(2)}}\right) + \text{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial \mathbf{W}^{(2)}}\right) = \frac{\partial J}{\partial \mathbf{o}} \mathbf{h}^\top + \lambda \mathbf{W}^{(2)}.$

$h^{\top}$怎麼來的?

To obtain the gradient with respect to $\mathbf{W}^{(1)}$ we need to continue backpropagation along the output layer to the hidden layer. The gradient with respect to the hidden layer’s outputs $\partial J/\partial \mathbf{h} \in \mathbb{R}^h$ is given by

$\frac{\partial J}{\partial \mathbf{h}} = \text{prod}\left(\frac{\partial J}{\partial \mathbf{o}}, \frac{\partial \mathbf{o}}{\partial \mathbf{h}}\right) = {\mathbf{W}^{(2)}}^\top \frac{\partial J}{\partial \mathbf{o}}.$

${\mathbf{W}^{(2)}}^{\top}$怎麼來的?

Since the activation function $\phi$ applies elementwise, calculating the gradient $\partial J/\partial \mathbf{z} \in \mathbb{R}^h$ of the intermediate variable $\mathbf{z}$ requires that we use the elementwise multiplication operator, which we denote by $\odot$.

$\frac{\partial J}{\partial \mathbf{z}} = \text{prod}\left(\frac{\partial J}{\partial \mathbf{h}}, \frac{\partial \mathbf{h}}{\partial \mathbf{z}}\right) = \frac{\partial J}{\partial \mathbf{h}} \odot \phi'\left(\mathbf{z}\right).$

Finally, we can obtain the gradient $\partial J/\partial \mathbf{W}^{(1)} \in \mathbb{R}^{h \times d}$ of the model parameters closest to the input layer. According to the chain rule, we get

$\frac{\partial J}{\partial \mathbf{W}^{(1)}} = \text{prod}\left(\frac{\partial J}{\partial \mathbf{z}}, \frac{\partial \mathbf{z}}{\partial \mathbf{W}^{(1)}}\right) + \text{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial \mathbf{W}^{(1)}}\right) = \frac{\partial J}{\partial \mathbf{z}} \mathbf{x}^\top + \lambda \mathbf{W}^{(1)}.$

4.7.4 Training a Model

因此，在模型参数初始化完成后，我们交替地进行正向传播和反向传播，并根据反向传播计算的梯度迭代模型参数。既然我们在反向传播中使用了正向传播中计算得到的中间变量来避免重复计算，那么这个复用也导致正向传播结束后不能立即释放中间变量内存。这也是训练要比预测占用更多内存的一个重要原因。另外需要指出的是，这些中间变量的个数大体上与网络层数线性相关，每个变量的大小与批量大小和输入个数也是线性相关的，它们是导致较深的神经网络使用较大批量训练时更容易超内存的主要原因。

Summary

Forward propagation sequentially calculates and stores intermediate variables within the compute graph defined by the neural network. It proceeds from input to output layer.
Back propagation sequentially calculates and stores the gradients of intermediate variables and parameters within the neural network in the reversed order.
When training deep learning models, forward propagation and back propagation are inter-dependent.
Training requires significantly more memory and storage.

4.8.1 Vanishing and Exploding Gradients

Vanishing Gradients

作者以sigmid舉例梯度消失問題,除非輸入值剛好在[-4,4]範圍內,否則梯度為0.

所以目前主流的 activation function 是 Relu.

Exploding Gradients

作者隨機造高斯分佈的矩陣,相乘100次,模擬深層網路BP的過程.矩陣最終過大.

Symmetry

看不懂!!

4.8.2 Parameter Initialization

Default Initialization

參數從 uniform distribution 隨機取樣

bias 設 0

Xavier Initialization

假设某全连接层的输入个数为 a ，输出个数为 b ，Xavier随机初始化将使该层中权重参数的每个元素都随机采样于均匀分布

$U\left(-\sqrt{\frac{6}{a+b}}, \sqrt{\frac{6}{a+b}}\right).$

它的设计主要考虑到，模型参数初始化后，每层输出的方差不该受该层输入个数影响，且每层梯度的方差也不该受该层输出个数影响。

Beyond

Summary

Vanishing and exploding gradients are common issues in very deep networks, unless great care is taking to ensure that gradients and parameters remain well controlled.
Initialization heuristics are needed to ensure that at least the initial gradients are neither too large nor too small.
The ReLU addresses one of the vanishing gradient problems, namely that gradients vanish for very large inputs. This can accelerate convergence significantly.
Random initialization is key to ensure that symmetry is broken before optimization.

4.9 Considering the Environment

有时，按照测试集的准确性衡量，模型似乎表现出色，但是当数据的分布突然变化时，部署中的灾难性失败

有时模型的部署可能会扰乱数据分布。例如，假设我们训练了一个模型来预测谁将偿还贷款与违约，发现申请人选择的鞋类与违约风险相关（牛津表示还款，运动鞋表示违约）。此后，我们可能倾向于向所有穿着牛津鞋的申请人提供贷款，并拒绝所有穿着运动鞋的申请人。

一旦我们开始基于鞋类做出决策，客户就会赶上并改变他们的行为。不久之后，所有申请人都将穿着牛津鞋，而信用信誉没有任何显着改善。请花一点时间来解决这个问题，因为在机器学习的许多应用程序中都存在类似的问题：通过将基于模型的决策引入环境，我们可能会破坏模型。

虽然我们不可能在一个部分中对这些主题进行完整的处理，但我们的目的是暴露一些常见的问题，并激发批判性思维，以及早发现这些情况，减轻损害并负责任地使用机器学习。有些解决方案很简单（要求“正确”数据），有些技术上很困难（实施强化学习系统），而另一些解决方案则要求完全脱离统计预测领域，并应对有关伦理道德应用的算法。

4.9.1 Distribution Shift

這章節重點在討論,訓練好的模型為何在實際應用表現不好.

Covariate Shift

We assume that although the distribution of inputs may change over time, the labeling function, i.e., the conditional distribution $P(y|x)$ does not change.

Mathematically,we could say that $P(x)$ changes but that $P(y|x)$ remains unchanged.

舉例: 要建立一個動物分類器,訓練階段用真實照片,但是測試階段卻用卡通圖片.

Label Shift

Label shift is a reasonable assumption to make when we believe that $y$ causes $x$.

舉例: 疾病導致症狀的發生.

Concept Shift

If we were to build a machine translation system, the distribution $P(y|x)$ might be different depending on our location.

舉例: 飲料名稱會因為地區不同,名稱也有所不同.

Examples

Before we go into further detail and discuss remedies, we can discuss a number of situations where covariate and concept shift may not be so obvious.

Medical Diagnostics

某疾病主要是發生在老人身上,但疾病檢測分類器的訓練樣本是取自於年輕學生.

Self Driving Cars

自駕車的樣本來自遊戲的渲染引擎產生的模擬資料.這導致真實世界的環境也都以相同的紋理進行渲染.

US陸軍想嘗試分辨樹林裡是否坦克,先拍一組沒有坦克的森林照片,然後再將坦克開進去製作另一組照片,訓練效果非常好.但是,他學習到的只是根據森林有無影子 - 因為第一組照片在早上拍,但是第二組照片是在中午拍的.

Nonstationary distributions

當資料的分佈改變,但是模型沒跟著更新.

廣告模型,沒有因為新商品加入而更新
垃圾郵件過濾,出現以前沒看過的內容
推薦系統,推錯季節性商品.

Covariate Shift Correction

看不懂...

最後提了GAN

Label Shift Correction

看不懂...

Concept Shift Correction

看不懂...

4.9.2 A Taxonomy of Learning Problems

Batch Learning

本章節討論到的問題,主因是這種學習方式...

Online Learning

根據新樣本不斷更新模型.

Bandits

in a bandit problem we only have a finite number of arms that we can pull (i.e., a finite number of actions that we can take).

Control (and nonadversarial Reinforcement Learning)

In many cases the environment remembers what we did.

使用者在網站上的活動取決於我們給他看了什麼.

Reinforcement Learning

In the more general case of an environment with memory, we may encounter situations where the environment is trying to cooperate with us (cooperative games, in particular for non-zero-sum games), or others where the environment will try to win.

4.9.3 Fairness, Accountability, and Transparency in Machine Learning

it is important to remember that when you deploy machine learning systems you are not simply minimizing negative log likelihood or maximizing accuracy—you are automating some kind of decision process.

“accuracy” is seldom the right metric. When translating predictions in to actions we will often want to take into account the potential cost sensitivity of erring in various ways.

We also want to be careful about how prediction systems can lead to feedback loops. For example, if prediction systems are applied naively to predictive policing, allocating patrol officers accordingly, a vicious cycle might emerge.

Summary

In many cases training and test set do not come from the same distribution. This is called covariate shift.
Covariate shift can be detected and corrected if the shift is not too severe. Failure to do so leads to nasty surprises at test time.
In some cases the environment remembers what we did and will respond in unexpected ways. We need to account for that when building models.

4.10 Predicting House Prices on Kaggle

觀察作者的前處理思路

先整合 training, test data, 一起進行資料標準化, 缺值處理, One-hot編碼 然後再重新切分 training, test data

作者用相對誤差取代絕對誤差. 理由看原文...

$\text{RMSLE} = \sqrt{\frac{1}{n}\sum_{i=1}^n\left(\log y_i -\log \hat{y}_i\right)^2}$

知乎對RMSLE的討論, 因為log的性質,所以低估的loss會比高估的loss還高. $\left(\log 90 - \log 100 \right)^2 > \left(\log 110 - \log 100 \right)^2$

def log_rmse(net, features, labels):
    # To further stabilize the value when the logarithm is taken, set the
    # value less than 1 as 1
    clipped_preds = np.clip(net(features), 1, float('inf'))
    return np.sqrt(2 * loss(np.log(clipped_preds), np.log(labels)).mean())

NOTE

np.clip 可以控制向量內元素的邊界值,計算log或是exp可以避免出錯.

Summary

Real data often contains a mix of different data types and needs to be preprocessed.
Rescaling real-valued data to zero mean and unit variance is a good default. So is replacing missing values with their mean.
Transforming categorical variables into indicator variables allows us to treat them like vec- tors.
We can use k-fold cross validation to select the model and adjust the hyper-parameters.
Logarithms are useful for relative loss.