Learning Map

Regression: Case Study

y=b+wxcpy = b + w\cdot{x_{cp}}

(xfeature,y^)(x_{feature},\hat{y})

Loss function L

input: a function output: how bad it is

L(f)=L(w,b)=n(y^n(b+wxfeaturen))2L(f)=L(w,b)=\displaystyle\sum_n(\hat{y}^n-(b+w\cdot{x_{feature}^n}))^2

Pick the "Best" Function

f=argminfL(f)f^*=arg \displaystyle\min_fL(f)

w,b=argminw,bL(w,b)=argminw,bn(y^n(b+wxfeaturen))2w^*,b^*=arg\displaystyle\min_{w,b}L(w,b)=arg\displaystyle\min_{w,b}\displaystyle\sum_n(\hat{y}^n-(b+w\cdot{x_{feature}^n}))^2

Gradient Descent

  • Consider loss function L(w) with two parameter w, b:
    • (Randomly) Pick an initial value w0,b0w^0,b^0
    • Compute Lww=w0,b=b0,Lbw=w0,b=b0\frac{\partial{L}}{\partial{w}}|w=w^0,b=b^0,\frac{\partial{L}}{\partial{b}}|w=w^0,b=b^0
    • w1w0ηLww=w0,b1b0ηLww=w0,b=b0w^1\gets{w^0-\eta\frac{\partial{L}}{\partial{w}}|w=w^0},\qquad b^1\gets{b^0-\eta\frac{\partial{L}}{\partial{w}}|w=w^0,b=b^0} η\eta:"learning rate"
    • ... Many iteration

Local optimal, not global optimal...

In linear regression, the loss function L is convex, so no local optimal

L(w,b)=n(y^n(b+wxfeaturen))2L(w,b)=\displaystyle\sum_n(\hat{y}^n-(b+w\cdot{x_{feature}^n}))^2

Lw=n2(y^n(b+wxfeaturen))(xfeaturen)\frac{\partial{L}}{\partial{w}}=\displaystyle\sum_n{2(\hat{y}^n-(b+w\cdot{x_{feature}^n}))(-x_{feature}^n)}

Lb=n2(y^n(b+wxfeaturen))(1)\frac{\partial{L}}{\partial{b}}=\displaystyle\sum_n{2(\hat{y}^n-(b+w\cdot{x_{feature}^n}))(-1)}

Regularization

L=n(y^n(b+wixi))2+λ(wi)2L=\displaystyle\sum_n{(\hat{y}^n-(b+\sum{w_ix_i}))^2}+\lambda\sum{(w_i)^2}

The functions with smaller wiw_i are better

Training error: largerλ\lambda, considering the training error less

We prefer smooth function, but don't be too smooth

Gradient Descent

θ=argminθL(θ)\theta^*=arg\displaystyle\min_{\theta}{L(\theta)} L: loss function θ\theta: parameters

Suppose that θ\theta has two variables {θ1,θ2}\{\theta_1, \theta_2\}

Randomly start at θ0=[θ10θ20]\theta^0=\begin{bmatrix}\theta_1^0\\ \theta_2^0\end{bmatrix}

[θ11θ21]=[θ10θ20]η[L(θ10)/θ1 L(θ20)/θ2]L(θ)=[L(θ1)/θ1 L(θ2)/θ2]\begin{bmatrix}\theta_{1}^1\\ \theta_2^1\end{bmatrix}=\begin{bmatrix}\theta_1^0\\ \theta_2^0\end{bmatrix} - \eta \begin{bmatrix} \partial{L(\theta_1^0)}/\partial{\theta_1}\\ \ \partial{L(\theta_2^0})/ \partial{\theta_2}\end{bmatrix}\qquad \nabla L(\theta)=\begin{bmatrix} \partial{L(\theta_1)}/\partial{\theta_1}\\ \ \partial{L(\theta_2})/ \partial{\theta_2}\end{bmatrix}

...

θi=θi1ηL(θi1)\theta^i=\theta^{i-1}-\eta\nabla L(\theta^{i-1})

Watch picture

  • Popular & Simple Idea: Reduce the learning rate by some factor every few epochs
  • E.g. 1/t decay: ηt=η/t+1\eta^t = \eta/\sqrt{t+1}

Adagrad

Divide the learning rate of each parameter by the root mean square of its previous derivatives

wt+1wtηtσtgtw^{t+1}\gets{w^t-\frac{\eta^t}{\sigma^t}}g^t

ηt=ηt+1gt=C(θt)wσt\eta^t=\frac{\eta}{\sqrt{t+1}}\qquad g^t=\frac{\partial{C(\theta^t)}}{\partial{w}} \qquad \sigma^t: root mean square of the previous derivatives of parameter w

w1w0η0σ0g0σ0=(g0)2w^1\gets{w^0-\frac{\eta^0}{\sigma^0}g^0}\qquad \sigma^0=\sqrt{(g^0)^2}

w2w1η1σ1g1σ1=12[(g0)2+(g1)2]w^2\gets{w^1-\frac{\eta^1}{\sigma^1}g^1}\qquad \sigma^1=\sqrt{\frac{1}{2}[(g^0)^2+(g^1)^2]}

...

wt+1wtηtσtgtσt=1t+1t=0t(gi)2w^{t+1}\gets{w^t-\frac{\eta^t}{\sigma^t}g^t}\qquad \sigma^t=\sqrt{\frac{1}{t+1}\displaystyle\sum_{t=0}^t{(g_i)^2}}

wt+1wtηi=0t(gi)2gt\therefore w^{t+1}\gets{w^t-\frac{\eta}{\sqrt{\sum_{i=0}^t{(g^i)^2}}}g^t}

best step: FirstderivativeSecondderivative\frac{|First-derivative|}{Second-derivative}

Stochastic Gradient Descent

faster

Pick an example xnx^n

Ln=(y^n(b+wixin))2L^n = (\hat{y}^n-(b+\sum{w_ix_i^n}))^2 Loss for only one example

θi=θi1ηLn(θi1)\theta^i=\theta^{i-1}-\eta\nabla L^n(\theta^{i-1})

Feature Scaling

make different features have the same scaling

More limitation

local minima

stuck at saddle point

very slow at the plateau

Where does the error come from?

error due to "bias"

error due to "variance"

ff^* is an estimator of f^\hat{f}

简单model,small variance Simple model is less influenced by the sampled data.

Bias: If we average all the ff^*, it is close to f^\hat{f}

简单model,large bias

Large bias?

Redesign your model:

  • Add more features as input
  • A more complex model

Large variance

  • More data
  • Regularization

Cross Validation

N-fold Cross Validation

Classification: Probabilistic Generative Model

P(C1x)=P(xC1)P(C1)P(xC1)P(C1)+P(xC2)P(C2)P(C_1|x)=\frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1)+P(x|C_2)P(C_2)}

Generative Model P(x)=P(xC1)P(C1)+P(xC2)P(C2)P(x)=P(x|C_1)P(C_1)+P(x|C_2)P(C_2)

Gaussian Distribution

fμ,Σ(x)=1(2π)D/21Σ1/2exp{12(xμ)TΣ1(xμ)}f_{\mu,\Sigma}(x)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma|^{1/2}}exp\{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\}

P(xcn)=fμn,Σn(x)P(x|c_n)=f_{\mu^n,\Sigma^n}(x)

Input: vector x, output: probability of sampling x

The shape of the function determines by mean μ\mu and convariance matrix Σ\Sigma

Maximum Likelihood

We have n examples: x1,x2,,xnx^1, x^2, \cdots, x^n, feature m 个

L(μ,Σ)=nfμ,Σ(xn)L(\mu, \Sigma) = \displaystyle\prod_n{f_{\mu,\Sigma}(x^n)}

μ,Σ=argmaxμ,ΣL(μ,Σ)\mu^*,\Sigma^*=arg\displaystyle\max_{\mu,\Sigma}L(\mu,\Sigma)

μ=1nnxn\mu^*=\frac{1}{n}\displaystyle\sum_n{x^n} (average) 会得到一个m×1m\times{1}的矩阵Σ=1nn(xnμ)(xnμ)T\Sigma^*=\frac{1}{n}\displaystyle\sum_n{(x^n-\mu^*)(x^n-\mu^*)^T} m×mm\times{m}的矩阵

Modifying Model

假设两个classification共享 the same convariance matrix Σ\Sigma

Class 1: 79个, Class 2: 61个

Find μ1,μ2,Σ\mu^1,\mu^2,\Sigma maximizing the likelihood L(μ1,μ2,Σ)L(\mu^1,\mu^2,\Sigma)

L(μ1,μ2,Σ)=n1fμ1,Σ(n1)×n2fμ2,Σ(n2)L(\mu^1,\mu^2,\Sigma)=\displaystyle\prod_{n_1}{f_{\mu^1,\Sigma}}(n_1)\times{\displaystyle\prod_{n_2}{f_{\mu^2,\Sigma}}(n_2)}

μ1\mu^1 and μ2\mu^2 is the same Σ=79140Σ1+61140Σ2\Sigma=\frac{79}{140}\Sigma^1+\frac{61}{140}\Sigma^2

变成linear model

Probability Distribution

Posterior Probability

P(C1x)=P(xC1)P(C1)P(xC1)P(C1)+P(xC2)P(C2)=11+P(xC2)P(C2)P(xC1)P(C1)=11+exp(z)=σ(z)z=lnP(xC1)P(C1)P(xC2)P(C2)\begin{matrix} P(C_1|x)=\frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1)+P(x|C_2)P(C_2)} \\ = \frac{1}{1+\frac{P(x|C_2)P(C_2)}{P(x|C_1)P(C_1)}} = \frac{1}{1+exp(-z)}=\sigma(z) \\ z = \ln{\frac{P(x|C_1)P(C_1)}{P(x|C_2)P(C_2)}} \end{matrix}

σ(z)\sigma(z): sigmoid function

经过运算,

z=lnΣ21/2Σ11/212xT(Σ1)1x+(μ1)T(Σ1)1x12(μ1)T(Σ1)1μ1+12xT(Σ2)1x(μ2)T(Σ2)1x+12(μ2)T(Σ2)1μ2+lnN1N2\begin{matrix} z=\ln{\frac{|\Sigma^2|^{1/2}}{|\Sigma^1|^{1/2}}}-\frac{1}{2}x^T(\Sigma^1)^{-1}x +(\mu^1)^T(\Sigma^1)^{-1}x-\frac{1}{2}(\mu^1)^T(\Sigma^1)^{-1}\mu^1 \\ +\frac{1}{2}x^T(\Sigma^2)^{-1}x-(\mu^2)^T(\Sigma^2)^{-1}x+\frac{1}{2}(\mu^2)^T(\Sigma^2)^{-1}\mu^2+\ln{\frac{N_1}{N_2}} \end{matrix}

Σ1=Σ2=Σ\Sigma_1=\Sigma_2=\Sigma

z=(μ1μ2)TΣ1x12(μ1)TΣ1μ1+12(μ2)TΣ1μ2+lnN1N2=wTx+b\begin{matrix} z=(\mu^1-\mu^2)^T\Sigma^{-1}x-\frac{1}{2}(\mu^1)^T\Sigma^{-1}\mu^1+\frac{1}{2}(\mu^2)^T\Sigma^{-1}\mu^2+\ln{\frac{N_1}{N_2}} \\ =w^Tx+b \end{matrix}

P(C1x)=σ(wx+b)P(C_1|x)=\sigma(w\cdot{x}+b)

Logistic Regression

fw,b(x)=σ(iwixi+b)f_{w,b}(x)=\sigma(\displaystyle\sum_i{w_ix_i+b)}

Training data: (xn,y^n)(x^n,\hat{y}^n)

y^n\hat{y}^n: 1 for class 1, 0 for class 2

L(f)=nC(f(xn),y^n)L(f)=\displaystyle\sum_nC(f(x^n),\hat{y}^n)

Cross entropy:

C(f(xn),y^n)=[y^nlnf(xn)+(1y^n)ln(1f(xn))]C(f(x^n),\hat{y}^n)=-[\hat{y}^n\ln{f(x^n)}+(1-\hat{y}^n)\ln{(1-f(x^n))}]

partial it,

wiwiηn(y^nfw,b(xn))xinw_i\gets{w_i-\eta\displaystyle\sum_n-(\hat{y}^n-f_{w,b}(x^n))x_i^n}

Just the same as linear regression

Discriminative v.s Generative

discriminative: logistic

generative: naive

结果是不一样的

discriminative 效果比 generative 好

Benefit of generative model

在数据量比较小的时候,generative会赢

noisy, gen wins

priors and class-dependent probabilities can be estimated from different sources.

Multi-class Classification

C1:w1x+b1z1=w1x+b1C2:w2x+b2z2=w2x+b2C3:w3x+b3z3=w3x+b3\begin{matrix} C_1:w^1\cdot{x}+b_1\qquad z_1= w^1\cdot{x}+b_1 \\ C_2:w^2\cdot{x}+b_2\qquad z_2= w^2\cdot{x}+b_2 \\ C_3:w^3\cdot{x}+b_3 \qquad z_3= w^3\cdot{x}+b_3 \end{matrix}x=yx = y

y1=ez1/j=13ezjy2=ez2/j=13ezjy3=ez3/j=13ezjSoftmax\begin{matrix} y_1=e^{z_1}/\displaystyle\sum_{j=1}^3{e^{z_j}} \\ y_2=e^{z_2}/\displaystyle\sum_{j=1}^3{e^{z_j}} \\ y_3=e^{z_3}/\displaystyle\sum_{j=1}^3{e^{z_j}} \end{matrix} \qquad Softmax

yi=P(Cix)y_i=P(C_i|x)

上图有误:

cross entropy: i=13yi^lnyi-\displaystyle\sum_{i=1}^3{\hat{y_i}\ln{y_i}}

Limitation of Logistic Regression

非线性

Brief Introduction of Deep Learning

deep = Many hidden layers

Matrix Operation

σ(wLaL1+bL)\sigma(w^La^{L-1}+b^L)

y=f(x)=σ(wLσ(w2σ(w1x+b1)+b2)++bL)y=f(x)=\sigma(w^L\cdots{\sigma(w^2\sigma(w^1x+b^1)}+b^2)+\cdots+b^L)

Backprogation

To compute the gradients effiently, we use backpropagation

results matching ""

    No results matching ""