D2L-CH2-Preliminaries

2. Preliminaries

2.3. Linear Algebra

2.3.1. Scalars

在書中,實數的表示方式.

In this book, we adopt the mathematical notation where scalar variables are denoted by ordinary lower-cased letters (e.g., x , y , and z ).

2.3.2. Vectors

You can think of a vector as simply a list of scalar values.

向量表示方式

In math notation, we will usually denote vectors as bold-faced, lower-cased letters (e.g., x , y , and z) .

2.3.2.1. Length, Dimensionality, and Shape

向量的長度也稱之為"維度".

The length of a vector is commonly called the dimension of the vector.

dimensionality

To clarify, we use the dimensionality of a vector or an axis to refer to its length, i.e., the number of elements of a vector or an axis. However, we use the dimensionality of an ndarray to refer to the number of axes that an ndarray has. In this sense, the dimensionality of some axis of an ndarray will be the length of that axis.

2.3.3. Matrices

矩陣表示方式.

Matrices, which we will typically denote with bold-faced, capital letters (e.g., X , Y , and Z ), are represented in code as ndarrays with 2 axes.

2.3.4. Tensors

其實Tensor就是更廣泛的用法. 向量是first-order tensor,矩陣是second-order tensor

Just as vectors generalize scalars, and matrices generalize vectors, we can build data structures with even more axes. Tensors give us a generic way of describing ndarrays with an arbitrary number of axes.Vectors, for example, are first-order tensors, and matrices are second-order tensors.

向量表示方式

Tensors are denoted with capital letters of a special font face (e.g., X , Y , and Z )

2.3.5 Basic Properties of Tensor Arithmetic

elementwise multiplication of two metrices,在python用A*B表示.

Specifi cally, elementwise multiplication of two matrices is called their Hadamard product (math notation ⊙)

2.3.6. Reduction

帶有axis參數的函式原以為很難懂,但是作者的解釋很不錯,axis=0可理解為把axis=0丟掉!

By default, invoking the sum function reduces a tensor along all its axes to a scalar. We can also specify the axes along which the tensor is reduced via summation. Take matrices as an example. To reduce the row dimension (axis 0 ) by summing up elements of all the rows, we specify axis=0 when invoking sum. Since the input matrix reduces along axis 0 to generate the output vector, the dimension of axis 0 of the input is lost in the output shape.

2.3.6.1. Non-Reduction Sum

但是numpy也提供參數keepdims可以保留住該維度.

cumsum也不會丟失原本維度.

2.3.7. Dot Products

pass

2.3.8. Matrix-Vector Products

\(m \times n\)矩陣和\(n\)維矩陣相乘,可以看成是個transformation,將向量從n維空間投影到m維空間上.

We can think of multiplication by a matrix A∈Rm×n as a transformation that projects vectors from Rn to Rm .

2.3.9. Matrix-Matrix Multiplication

教科書對矩陣相乘的示意圖.

2.3.10 Norms

Informally, the norm of a vector tells us how big a vector is.

norm的特點:

In linear algebra, a vector norm is a function \(f\) that maps a vector to a scalar, satisfying a handful of properties.

\(f(ax) = |a|f(x)\)

\(f(x + y) \leq f(x) + f(y)\)

\(f(x)>=0\)

\(\forall,[x]_{i}=0 \longleftrightarrow f(x)=0\)

\(||x||_{2}=\sqrt{\Sigma^{n}_{i=1}{x^{2}_{i}}}\)

\(||x||_{1}=\Sigma^{n}_{i=1}{|x_{i}|}\)

\(||x||_{p}=(\Sigma^{n}_{i=1}{|x_{i}|^{p}})^{1/p}\)

Frobenius norm:

Analogous to ℓ2 norms of vectors, the Frobenius norm of a matrix is the square root of the sum of the squares of the matrix elements:

\(||X||_{F}=\sqrt{\Sigma_{i=1}^{m}{\Sigma_{j=1}^{n}{x^{2}_{ij}}}}\)

Summary

Vectors generalize scalars, and matrices generalize vectors.
Elementwise multiplication of two matrices is called their Hadamard product. It is different from matrix multiplication
In deep learning, we often work with norms such as the l1 norm, the l2 norm, and the Frobenius norm.

2.4 Calculus

積分(integral calculus)的由來: 2500年前,希臘人計算多邊形(polygon)面積會將圖形切成多個三角形(triangle)計算,那麼曲面(curved shape)面積呢(圓)?透過inscribed polygon,圓內的多邊圖形的邊越多,越能近似這個圓.這個方法稱之為 method of exhaustion(好像不能翻譯成窮舉法).而積分的起源來自於此.

2000年後,微分(differential calculus)也被發明出來了.

2.4.1 Derivatives and Differentiation

微分的輸入輸出表示 \(f : \mathbb{R} \rightarrow \mathbb{R}\)

The derivate of f is defined as \(f'(x)=\lim_{h \rightarrow 0} \frac{f(x+h)-f(x)}{(x+h) - x}\) if this limit exists.

If \(f'(x)\) exists, \(f\) is said to be differentiable at \(a\). If \(f\) is differentiable at every number of an interval, then this function is differentiable on this interval.

微分可以看成是瞬時變動率(instantaneous rate)

We can interpret the derivative \(f'(x)\) as the instantaneous of change of \(f(x)\) with respect to \(x\). The so-called instantaneous rate of change is based on the variation \(h\) in \(x\), which approaches 0.

各種代表微分的notations:

\(f'(x)=y'=\frac{dy}{dx}=\frac{df}{dx}=\frac{d}{dx}f(x)=Df(x)=D_{x}f(x)\)

常見微分方法:

\(DC=0\) (\(C\) is a constant)

\(Dx^{n}=nx^{n-1}\) (the power rule, \(n\) is any real number)

\(De^{x}=e^{x}\)

\(Dln(x)=1/x\)

constant multiple rule: \(\frac{d}{dx}[Cf{x}]=C\frac{d}{dx}f(x)\)

sum rule: \(\frac{d}{dx}[f(x)+g(x)]=\frac{d}{dx}f(x)+\frac{d}{dx}g(x)\)

producr rule: \(\frac{d}{dx}[f(x)g(x)]=f(x)\frac{d}{dx}[g(x)] + g(x)\frac{d}{dx}[f(x)]\)

quotient rule: \(\frac{d}{dx}[\frac{f(x)}{g(x)}]=\frac{g(x)\frac{d}{dx}[f(x)]-f(x)\frac{d}{dx}[g(x)]}{[g(x)]^{2}}\)

2.4.2 Partial Derivatives

單變數函數(微分) vs. 多變數函數(偏微分)

Let \(y=f(x_{1},x_{2},...,x_{n})\) be a function with \(n\) variables. The partial derivative of \(y\) with respect to its \(i^{th}\) parameter \(x_{i}\) is

\(\frac{\partial{y}}{\partial{x_{i}}}=\lim_{h \rightarrow 0}\frac{f(x_{1},...,x_{i-1},x_{i}+h,...,x_{n})-f(x_{1},...,x_{i},...,x_{n})}{h}\)

在計算對\(x_{i}\)的偏微分,可直接將變數\(x_{i}\)以外的變數直接當常數看.

To calculate \(\frac{\partial{y}}{\partial{x_{i}}}\), we can simply treat \(x_{1},...,x_{i-1},x_{i+1},...,x_{n}\) as constants and calculate the derivative of \(y\) with respect to \(x_{i}\). For notation of partial derivatives, the following are equivalent:

\(\frac{\partial{y}}{\partial{x_{i}}}=\frac{\partial{f}}{\partial{x_{i}}}=f_{x_{i}}=f_{i}=D_{i}f=D_{x_{i}}f\)

2.4.3 Gradients

Let \(\textbf{x}\) is a an n-dimensional vector, The gradient of the function \(f(\textbf{x})\) with respect to \(\textbf{x}\) is a vector of n partial derivatives

\(\nabla_{x}f(\textbf{x})=[\frac{\partial{f(\textbf{x})}}{\partial{x_{1}}}, \frac{\partial{f(\textbf{x})}}{\partial{x_{2}}}, ..., \frac{\partial{f(\textbf{x})}}{\partial{x_{n}}}]^{\top}\)

2.4.4 Chain Rule

multivariate functions in deep learning are often composite.

Suppose that functions \(y=f(u)\) and \(u=g(x)\) are both differential, then the chain rule states that

\(\frac{dy}{dx}=\frac{dy}{du}\frac{du}{dx}\).

Summary

A derivative can be interpreted as the instantaneous rate of change of a function with respect to its variable. It is also the slope of the tangent line to the curve of the function.
A gradient is a vector whose components are the partial derivatives of a multivariate function with respect to all its variables.
The chain rule enables us to differentiate composite functions.

2.5 Automatic Differentiation

介紹MXNet,略過不看

2.6 Probability

2.6.1 Basic Probability Theory

要確保一個骰子各個面出現的機率,直覺作法就是不斷的試驗,大樹法則(The law of large numbers)告訴我們,隨著次數越來越多,結果會越接近真實的機率.

import numpy as np
fair_prob = [1/6] * 6  # 設骰子每個面的機率都是1/6
np.random.multinomial(10, fair_prob)  # 10次實驗結果

下圖證明,經過多次實驗,骰子機率分佈會接近初始設定的機率分佈.

Axioms of Probability Theory

事件(event)是樣本空間(sample space)的子集
互斥(mutually exclusive)

Random Variables

We can just denote P(X) as the distribution over the random variable X: the distribution tells us the probability that X takes any value.

作者有提到離散(discrete)和連續(continuous)隨機變數

離散: 擲骰子

連續: 身高,因為我們並無法找到2個身高一模一樣的人.

2.6.2 Dealing with Multiple Random Variables

舉例1: 疾病和症狀,我們會想知道2個事件發生的機率以及相互關係.

舉例2: 圖片,每個像素,地點,時間,相機型號,標籤...等,都可視為Random Variables.

All of these are random variables that occur jointly.

Joint Probability

\(P(A,B)\)

Given any values a and b, the joint probability lets us answer, what is the probability that A=a and B=b simultaneously?

Conditional Probability

\(P(B|A)\)

It is the probability of B=b, provided that A=a has occurred.

Bayes’ theorem

由Conditional Probability延伸.

multiplication rule : \(P(A,B)=P(B|A)P(A)\)

\(P(A,B)=P(A|B)P(B)\)

\(P(A|B)=\frac{P(B|A)P(A)}{P(B)}\)

Marginalization

sum rule : \(P(B)=\Sigma_{A}{P(A,B)}\)

The probability of B amounts to accounting for all possible choices of A and aggregating the joint probabilities over all of them

加總任何可能的A值產生的聯合機率,並加總,可得到B發生的機率.

Independence

dependence vs. independence.

獨立: A事件不揭露任何關於B事件資訊. \(P(B|A)=P(B)\)　　統計學表示成 \(A⊥B\) 從貝式定理也可以得出 \(P(A|B)=P(A)\)

Since \(P(A|B)=\frac{P(A,B)}{P(B)}=P(A)\) is equivalent to \(P(A,B)=P(A)P(B)\), two random variables are independent if and only if their joint distribution is the product of their individual distributions.

Likewise, two random variables A and B are conditionally independent given another random variable C if and only if \(P(A,B|C)=P(A|C)P(B|C)\). This is expressed as \(A⊥B|C\).

Application

pass

2.6.3 Expectation and Variance

To summarize key characteristics of probability distributions, we need some measures. The expectation (or average) of the random variable \(X\) is denoted as

\(E[X]=\Sigma_{x}{xP(X=x)}\)

When the input of a function \(f(x)\) is a random variable drawn from the distribution \(P\) with different values \(x\), the expectation of \(f(x)\) is computed as

\(E_{x~P}[f(x)]=\Sigma_{x}{f(x)P(x)}\)

2.7 Documentation

介紹MXNet內容,略過