D2L-CH3-Linear Neural Networks

3. Linear Neural Networks

3.1.1 Basic Elements of Linear Regression

假設1: the relationship between the features x and targets y is linear

假設2: any noise is well-behaved (following a Gaussian distribution)

預測的對象稱之為 target 或 label

預測所依據的變數稱之為 features 或 covariates

每筆input都有個索引 \(i\), \(x^{(i)}=[x^{(i)}_{1}, x^{(i)}_{2}]\)

對應的label就是 \(y^{(i)}\)

Linear Model

price = \(w_{area} \cdot area + w_{age} \cdot age + b\)

\(w_{area}\) 和 \(w_{age}\) 稱之為 weights

\(b\) 稱之為 bias, offset or intercept

Loss Function

The loss function quantifies the distance between the real and predicted value of the target.

sum of square errors: \(l^{(i)}(w,b)=\frac{1}{2}(\hat{y}^{(i)}-y^{(i)})^{2}\)

Note

前面乘\(\frac{1}{2}\)是為了微分時可以抵銷掉.

Analytic Solution

當你的function是quadratic frorm,他具有convex性質.而且當feature之間是線性獨立,他是strictly convex.

\(w^{*}=(X^{T}X)^{-1}X^{T}y\)

Gradient descent

We cannot solve the models analytically, and even when the loss surfaces are high-dimensional and nonconvex.

\((w,b) \leftarrow (w,b)-\frac{\eta}{\mathcal{B}}\sum_{i \in \mathcal{B}}{\partial_{w,b}}^{i^{(i)}}(w,b)\)

To summarize, steps of the algorithm are the following: (i) we initialize the values of the model parameters, typically at random; (ii) we iteratively sample random batches from the data (many times), updating the parameters in the direction of the negative gradient.

Making Predictions with the Learned Model

預測對象通常稱之為 prediction 或 inference.

但是作者認為,在統計學,inference是指基於資料集的參數估計,所以建議使用prediction為主.

We will try to stick with prediction because calling this step inference, despite emerging as standard jargon in deep learning, is somewhat of a misnomer. In statistics, inference more often denotes estimating parameters based on a dataset. This misuse of terminology is a common source of confusion when deep learning practitioners talk to statisticians.

Vectorization for Speed

教你向量計算不要用for-loop

3.1.2 The Normal Distribution and Squared Loss

The probability density of a normal distribution with mean μ and variance σ 2 is given as follows:

\(p(z)=\frac{1}{\sqrt{2\pi\sigma^{2}}}exp(-\frac{1}{2\sigma^{2}}(z-u)^{2})\)

用Python表示 the probability density of a normal distribution

x = np.arange(-7, 7, 0.01)
def normal(z, mu, sigma):
    p = 1 / math.sqrt(2 * math.pi * sigma**2)
    return p * np.exp(- 0.5 / sigma**2 * (z - mu)**2)

作者表示,linear regression 採用 mean squared error 作為 loss function, 是假設 observations 遵從常態分佈(normally distributed)

\(y=\boldsymbol{w}^{T}\boldsymbol{x}+b+\varepsilon\sim\mathcal{N}(0, \sigma^{2})\)

The likelihood of seeing a particular \(y\) for a given \(x\):

\(p(y|x)=\frac{1}{\sqrt{2\pi\sigma^{2}}}exp(-\frac{1}{2\sigma^{2}}(y-\boldsymbol{w}^{T}\boldsymbol{x}-b)^{2})\)

According to the maximum likelihood principle, the best values of \(b\) and \(\boldsymbol{w}\) are those that maximize the likelihood of the entire dataset:

\(P(Y|X)=\prod^{n}_{i=1}{p(y^{(i)}|\boldsymbol{x}^{(i)})}\)

Estimators chosen according to the maximum likelihood principle are called Maximum Likelihood Estimators (MLE).While, maximizing the product of many exponential functions, might look difficult, we can simplify things significantly, without changing the objective, by maximizing the log of the likelihood instead.

without changing anything we can minimize the Negative Log-Likelihood(NLL) \(−\log{p(\boldsymbol{y}|\boldsymbol{X})}\)

\(−\log{p(\boldsymbol{y}|\boldsymbol{X})}=\Sigma^{n}_{i=1}{\frac{1}{2}\log{(2\pi\sigma^{2})}}+{\frac{1}{2\sigma^{2}}({y^{(i)}-\boldsymbol{w}^{T}\boldsymbol{x}^{(i)}-b})}^{2}\)

Now we just need one more assumption: that \(\sigma\) is some fixed constant. Thus we can ignore the first term because it does not depend on \(\boldsymbol{w}\) or \(b\). Now the second term is identical to the squared error objective introduced earlier, but for the multiplicative constant \(\frac{1}{\sigma^{2}}\). Fortunately, the solution does σnot depend on \(\sigma\). It follows that minimizing squared error is equivalent to maximum likelihood estimation of a linear model under the assumption of additive Gaussian noise.

NOTE

重點就是說, 符合一些假設後, 求likelihood的最大值,也就是在求squared error的最小值

3.1.3 From Linear Regression to Deep Networks

Neural Network Diagram

線性回歸是個單層的網路架構.

為什麼是單層,因為他只有一個transformation!

we can regard this transformation as a fully-connected layer, also commonly called a dense layer.

Biology

把神經圖和類神經網路結構對上有點困難~

Summary

machine learning model的關鍵要素
- training data
- a loss function
- an optimization algorithm
- the model itself
Minimizing an objective function and performing maximum likelihood can mean the samething

3.2 Linear Regression Implementation from Scratch

MXNet, 略過

3.3 Concise Implementation of Linear Regression

MXNet, 利用Gluon建立類神經網路

3.4 Softmax Regression

遇到分類問題,我們需要的 hard assignments 或是 soft assignments(每個分類的機率)

這種區分往往會變得模糊,部分原因是,即使經常我們只關心 hard assignments，我們仍然經常使用輸出 soft assignments 的模型.

even when we only care about hard assignments, we still use models that make soft assignments.

3.4.1 Classification Problems

統計學家從以前開始就用 oen-hot encoding 表示類別資料:

\(y \in \{(1,0,0),(0,1,0),(0,0,1)\}\)

Network Architecture

\(o=Wx+b\), \(o\) 表示為 logits

softmax regression 和線性回歸一樣是單層NN.

Softmax Operation

名詞定義：calibration 為了將分類結果解釋成機率,我們必須保證輸出是非負數且加總為1. 再者,我們的訓練目標可以讓模型估計出可以信任的機率,當機率輸出為0.5,我們希望有一半的樣本的確是屬於某個被預測的類別.這特性稱之為calibration.

To interpret our outputs as probabilities, we must guarantee that (even on new data), they will be nonnegative and sum up to 1. Moreover, we need a training objective that encourages the model to estimate faithfully probabilities. Of all instances when a classifier outputs .5, we hope that half of those examples will actually belong to the predicted class. This is a property called calibration.

為了讓原本模型輸出的logits符合非負數且相加為1,而且讓模型仍能是可微分的,做以下數學操作,求exp是為了非負數,除以sum是為了相加為1.

\(\hat{y}=\text{softmax}(o)\) where \(\hat{y_{i}}=\frac{\text{exp}(o_{i})}{\sum_{j}{\text{exp}(o_{j})}}\)

To transform our logits such that they become nonnegative and sum to 1, while requiring that the model remains differentiable, we first exponentiate each logit (ensuring non-negativity) and then divide by their sum (ensuring that they sum to 1).

\(o\) 也可以說是確認類別機率的 pre-softmax value.

\(o^{(i)}=Wx^{(i)}+b,\) where \(\hat{y}^{(i)}=\text{softmax}(o^{(i)})\)

Vectorization for Minibatches

a minibatch \(X\) of samples with dimensionality \(d\) and batch size \(n\).

\(O=XW+b\)

\(\hat{Y}=\text{softmax}(O)\)

3.4.2 Loss Function

Log-Likelihood

The softmax function gives us a vector \(\hat{y}\), which we can interpret as estimated conditional probabilities of each class given the input \(x\), e.g., \(\hat{y}_{1}=\hat{P}(y=\text{cat}|x)\).

取對數是為了方便求導數可以從相乘變成相加,因為機率值取對數會是負數,所以前面再加個負號,讓loss為正.

\(P(Y|X)=\prod_{i=1}^{n}{P(y^{(i)}|x^{(i)})}\) and thus \(-\text{log}P(Y|X)=\sum_{i=1}^{n}{-\text{log}P(y^{(i)}|x^{(i)})}\)

求 \(P(Y|X)\) 最大值也就代表求 \(-\text{log}P(Y|X)\) 的最小值.

這個 loss function 也稱之為 cross-entropy loss

\(\text{loss function}=l=-\text{log}P(y|x)=-\sum_{j}{y_{j}\text{log}\hat{y}_{j}}\)

Softmax and Derivatives

從以下對cross-entropy loss求導過程可以發現,就像迴歸一樣簡單,最後只需要二值相減.

\(\partial{o_{j}}l=\frac{\text{exp}(o_{j})}{\sum_{k}{\text{exp}(o_{k})}}-y_{j}=\text{softmax}(o)_{j}-y_{j}=P(y=j|x)-y_{j}\)

In other words, the gradient is the difference between the probability assigned to the true class by our model, as expressed by the probability \(P(y|x)\), and what actually happened, as expressed by \(y\).

Cross-Entropy Loss

\(l(y,\hat{y})=-\sum_{j}{y_{j}\text{log}\hat{y}_{j}}\)

3.4.3 Information Theory Basics

Entropy

可以把Entorpy看成量化資料內資訊內容的工具.

The central idea in information theory is to quantify the information content in data.

\(H[p]=\sum_{j}{-p(j)\text{log}p(j)}\)

文章中提到了 nat,應該是資訊理論內的專有名詞.

Surprisal

看不懂

Cross-Entropy Revisited

看不懂

What is cross-entropy? The cross-entropy from :math:p to :math:q, denoted \(H(p, q)\), is the expected surprisal of an observer with subjective probabilities \(q\) upon seeing data that was actually generated according to probabilities \(p\). The lowest possible cross-entropy is achieved when \(p=q\). In this case, the cross-entropy from \(p\) to \(q\) is \(H(p,p)=H(p)\).

Our loss is lower-bounded by the entropy given by the actual conditional distributions \(P(y|x)\).

Kullback Leibler Divergence

看不懂

This is simply the difference between the cross-entropy and the entropy, i.e., the additional cross-entropy incurred over the irreducible minimum value it could take:

\(D(p||q)=H(p,q)-H[p]=\sum_{j}{p(j)\text{log}\frac{p(j)}{q(j)}}\)

對 \(D(p||q)\) 求最小值,也等於是對 cross-entropy loss 求最小值.

Note that in classification, we do not know the true \(p\), so we cannot compute the entropy directly. However, because the entropy is out of our control, minimizing \(D(p||q)\) with respect to \(q\) is equivalent to minimizing the cross-entropy loss.

Summary

We introduced the softmax operation which takes a vector maps it into probabilities.
Softmax regression applies to classification problems. It uses the probability distribution of the output category in the softmax operation.
Cross-entropy is a good measure of the difference between two probability distributions. It measures the number of bits needed to encode the data given our model.

3.5 The Image Classification Dataset (Fashion-MNIST)

MNIST

3.6 Implementation of Softmax Regression from Scratch

keep working

嘗試用tensorflow實做看看!!

3.7 Concise Implementation of Softmax Regression

3.7.2 The Softmax

在數學領域,計算exp不是問題,但是在計算機領域,計算exp可能會造成後續的計算會有數值不穩定的問題.

However, from a computational perspective, exponentiation can be a source of numerical stability issues (as discussed in Section 16.8).

如果 \(z_{j}\) 是超大正數, \(e^{z_{j}}\) 可能導致overflow, 分子或分母會得到inf, 接著後續計算 \(\hat{y}_{j}\) 會得到 0, inf, or nan.

If some of the \(z_{j}\) are very large (i.e., very positive), then \(e^{z_{j}}\) might be larger than the largest number we can have for certain types of float (i.e., overflow). This would make the denominator (and/or numerator) inf and we wind up encountering either 0, inf, or nan for \(\hat{y}_{j}\).

如果 \(z_{j}\) 是超大負數, \(e^{z_{j}}\) 的值會接近於0, 或因為underflow,數值直接為0. 在計算 \(\text{log}(\hat{y}_j)\) 得到 -inf

it might be that possible that some \(z_{j}\) have large negative values and thus that the corresponding \(e^{z_{j}}\) will take values close to zero. These might be rounded to zero due to finite precision (i.e underflow), making \(\hat{y}_j\) zero and giving us -inf for \(\text{log}(\hat{y}_j)\).

幸運的是,因為我們計算exp之後會取log,透過以下公式,我們可以避免計算 \(e^{z_{j}}\)

\(\text{log}(\hat{y}_{j})=\text{log}(\frac{e^{z_{j}}}{\sum_{i=1}^{n}{e^{z_{i}}}})\)

\(=\text{log}(e^{z_{j}})-\text{log}(\frac{e^{z_{j}}}{\sum_{i=1}^{n}{e^{z_{i}}}})\)

\(=z_{j}-\text{log}(\frac{e^{z_{j}}}{\sum_{i=1}^{n}{e^{z_{i}}}})\)

類似作法可參考 log-sum-exp trick https://en.wikipedia.org/wiki/LogSumExp