机器学习课程笔记-3

2017-03-12

Notes about Andrew Ng's Machine Learning course on Coursera-part 3

原创文章，转载请注明：转自Luozm's Blog

1. Logistic Regression

1.1 Classification

For now, we will focus on the binary classification problem in which y can take on only two values, 0 and 1. (Most of what we say here will also generalize to the multiple-class case.)

For instance, if we are trying to build a spam classifier for email, then $x^{(i)}$ may be some features of a piece of email, and y may be 1 if it is a piece of spam mail, and 0 otherwise.

Hence, y∈{0,1}. 0 is also called the negative class, and 1 the positive class, and they are sometimes also denoted by the symbols “-” and “+.” Given $x^{(i)}$, the corresponding $y^{(i)}$ is also called the label for the training example.

1.2 Hypothesis Representation

Logistic Regression Model:

$h_\theta(x)=g(\theta^Tx)$

其中使用的函数为Sigmoid/Logistic函数，即：

$g(z)=\frac{1}{1+e^{-z}}$

Sigmoid Function

$h_\theta(x)$ 的意义是输出为1的概率。

$\begin{align*}& h_\theta(x) = P(y=1 | x ; \theta) = 1 - P(y=0 | x ; \theta) \newline& P(y = 0 | x;\theta) + P(y = 1 | x ; \theta) = 1\end{align*}$

1.3 Decision Boundary

$\begin{align*}& \theta^T x \geq 0 \Rightarrow y = 1 \newline& \theta^T x < 0 \Rightarrow y = 0 \newline\end{align*}$

Non-liner Boundary

Again, the input to the sigmoid function g(z) (e.g. $θ^TX$) doesn’t need to be linear, and could be a function that describes a circle (e.g. $z = \theta_0 + \theta_1 x_1^2 +\theta_2 x_2^2$) or any shape to fit our data.

1.4 Cost function

如果还使用和线性回归相同的损失函数，则该损失函数为non-convex函数，有非常多的局部最小值（local minima），很可能不会收敛到全局最优，如图所示：

所以Logistic regression的损失函数定义为：

$\begin{align*}& J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(h_\theta(x)) \; & \text{if y = 1} \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(1-h_\theta(x)) \; & \text{if y = 0}\end{align*}$

当 $y=1$ 时，损失函数如图所示： Cost Function

即：

$\begin{align*}& \mathrm{Cost}(h_\theta(x),y) = 0 \text{ if } h_\theta(x) = y \newline & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 0 \; \mathrm{and} \; h_\theta(x) \rightarrow 1 \newline & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 1 \; \mathrm{and} \; h_\theta(x) \rightarrow 0 \newline \end{align*}$

1.5 Gradient Descent

简化损失函数：

$\text{Cost}(h_\theta(x),y)=-y\log(h_\theta(x))-(1-y)\log(1-h_\theta(x))$ $J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))]$

向量版本为：

$\begin{align*} & h = g(X\theta)\newline & J(\theta) = \frac{1}{m} \cdot \left(-y^{T}\log(h)-(1-y)^{T}\log(1-h)\right) \end{align*}$

因此，梯度下降公式为：

$\begin{align*} & Repeat \; \lbrace \newline & \; \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \newline & \rbrace \end{align*}$

向量版本为：

$\theta := \theta - \frac{\alpha}{m} X^{T} (g(X \theta ) - \vec{y})$

1.6 Advanced Optimization

除了一般的梯度下降，还有许多更高级的优化方法，如：”Conjugate gradient”, “BFGS”, and “L-BFGS”。

这些方法更复杂，但是可以自动找到合适的学习率，还能更快的收敛。

1.7 Multiclass Classification

使用Logistic Regression来进行多类分类需要对每个类分别计算概率，找出最大值，即：

$\begin{align*}& y \in \lbrace0, 1 ... n\rbrace \newline& h_\theta^{(0)}(x) = P(y = 0 | x ; \theta) \newline& h_\theta^{(1)}(x) = P(y = 1 | x ; \theta) \newline& \cdots \newline& h_\theta^{(n)}(x) = P(y = n | x ; \theta) \newline& \mathrm{prediction} = \max_i( h_\theta ^{(i)}(x) )\newline\end{align*}$

Multiclass Classification

2. Overfitting

过拟合：如果特征过多，习得的假设函数可能会很好地拟合训练集，但是推广到新的样本时效果不好。

Fitting

从左到右分别是欠拟合（underfitting）、拟合、过拟合（overfitting）。

2.1 Addressing overfitting

减少特征数量
- 人工选择
- 模型选择算法（后面课程会讲）
正则化
- 保留所有特征，但是 减少参数（magnitude）
- 当有大量特征且都对预测类别有贡献时效果很好

2.2 Regularization

2.2.1 Cost Function

正则化的想法： Idea of Regularization

在保证拟合效果的同时尽可能减小参数值，这有两个作用：

简化假设函数
不容易过拟合

损失函数中加入正则项：$\lambda\sum_{j=1}^n\theta_j^2$，注意从1开始而不是0，因为 $\theta_0$ 是偏置项，与特征无关。

正则化后的损失函数：

$J(\theta) = \frac{1}{2m}\left[\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\sum_{j=1}^n\theta_j^2\right]$

$\lambda$ 是正则化系数，它的作用是控制假设函数更好的拟合数据，同时保持参数值较小。

如果 $\lambda$ 过大，会导致所有参数约等于0，假设函数几乎只剩下偏置项，导致欠拟合。

2.2.2 Regularized Linear Regression

梯度下降：

$\begin{align*} & \text{Repeat}\ \lbrace \newline & \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} \newline & \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right] &\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\newline & \rbrace \end{align*}$

其中 $(1-\alpha\frac{\lambda}{m})<1$，因此每次迭代参数会自然减小。

若转化为最小二乘求解方式，正则化后的公式为：

$\begin{align*}& \theta = \left( X^TX + \lambda \cdot L \right)^{-1} X^Ty \newline& \text{where}\ \ L = \begin{bmatrix} 0 & & & & \newline & 1 & & & \newline & & 1 & & \newline & & & \ddots & \newline & & & & 1 \newline\end{bmatrix}\end{align*}$

且可以证明，正则化后的 $(X^TX+\lambda\cdot L)$ 一定可逆。

备注： 正则化之前，如果 $m\leq n$，$X^TX$ 不可逆，即无法使用最小二乘求解。

2.2.3 Regularized Logistic Rregression

损失函数为：

$J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$