2019-08-22

吴恩达深度学习--改善深层神经网络：超参数调试、正则化以及优化

正则化

参数惩罚项

在吴恩达的机器学习课程中，降到了加入参数项的惩罚项的正则化方法，目的是为了防止过拟合。正则化的方法如下：

${J{ \left( \theta \right) }=-\frac{ {1} }{ {m} }{\mathop{ \sum }\limits_{ {i=1} }^{ {m} }{ \left[ {y\mathop{ {} }\nolimits^{ { \left( i \right) } }log{ \left( {\mathop{ {h} }\nolimits_{ \theta }{ \left( \mathop{ {x} }\nolimits^{ { \left( i \right) } } \right) } } \right) }+{ \left( {1-y\mathop{ {} }\nolimits^{ { \left( i \right) } } } \right) }log{ \left( {1-\mathop{ {h} }\nolimits_{ \theta } { \left( \mathop{ {x} }\nolimits^{ { \left( i \right) } } \right) } } \right) }+\frac{ { \lambda } }{ {2m} }{ {\mathop{ \sum }\limits_{ {j=1}}^{ {n} }{ \theta \mathop{ { } }\nolimits_{ {j} }^{ {2} } } } } } \right] } } }$

dropout

在这个课程中，介绍了dropout正则化。dropout正则化就是每次迭代训练网络时随机删除网络中的一些神经元节点。dropout正则化可以防止过拟合的原因可能是这种方法限制了单一输入节点的作用，因为节点可以随机的被取消。实际上也是起到了压缩权重的方法，和加入参数项的惩罚项的正则化方法有类似之处。

使用dropout正则化的缺点是损失函数J不能被明确定义。

提前终止训练

可以通过提前终止训练的方法，在交叉验证集表现最好的时候保存模型。在实际操作中，可以每隔一定次数的迭代就保存一个模型，最后选择在交叉验证集表现最好的模型进行使用。

正则化输入

使得输入数据遵循：

${ \mu =\frac{ {1} } { {m} } {\mathop{ \sum } \limits_{ {i=1} } ^{ {m} } {\mathop{ {x} } \nolimits^{ { { \left( {i} \right) } } } } } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } x \leftarrow x- \mu }$ ${ \sigma \mathop{ {} } \nolimits^{ {2} } =\frac{ {1} } { {m} } {\mathop{ \sum } \limits_{ {i=1} } ^{ {m} } { { \left( {x\mathop{ {} } \nolimits^{ { { \left( {i} \right) } } } } \right) } \mathop{ {} } \nolimits^{ {2} } } } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } x \leftarrow \frac{ {x} } { { \sigma \mathop{ {} } \nolimits^{ {2} } } } }$

梯度消失与梯度爆炸

在深层神经网络中，每一层都是以乘法形式运算。如果数值比1大，则随着网络的加深，数值会越来越大，产生梯度爆炸的现象；反正，则会产生梯度消失的现象。
解决方法是按照一定的规律设置初始权重为1左右，并且对于不同的激活函数，设置方法也不太相同。

优化算法

mini-batch

以特定数量的训练样本为一组进行计算损失和更新网络参数值，而不是所有训练样本为一组。这样可以加快训练速度，避免训练集样本过多时内存占用大的问题。

momentum

动量梯度下降（momentum）算法是利用指数加权平均的方法，在更新需要训练的网络参数的时候，参考之前的偏差，按照下式进行更新：

${\begin{array} {*{20} {l} } {V\mathop{ {} } \nolimits_{ {dW} } = \beta V\mathop{ {} } \nolimits_{ {dW} } +{ \left( {1- \beta } \right) } dW} \\ {W \leftarrow W- \alpha V\mathop{ {} } \nolimits_{ {dW} } } \end{array} }$ ${\begin{array} {*{20} {l} } {V\mathop{ {} } \nolimits_{ {db} } = \beta V\mathop{ {} } \nolimits_{ {db} } +{ \left( {1- \beta } \right) } db} \\ {b \leftarrow b- \alpha V\mathop{ {} } \nolimits_{ {db} } } \end{array} }$

其中α为学习率。

由于初始的dW、db值为0，导致初期的计算结果与实际值偏移较大，可以用偏差修正的方法，即：

${V\mathop{ {} } \nolimits_{ {dW} } =\frac{ {V\mathop{ {} } \nolimits_{ {dW} } } } { {1- \beta \mathop{ {} } \nolimits^{ {t} } } } }$

其中t为迭代次数。这样就可以在前期也可以得到较为准确的结果。

RMSprop

RMSprop算法应用于减缓b方向的学习和加速W方向的学习。（具体原理我也没弄明白）

公式如下：

${\begin{array} {*{20} {l} } {\mathop{ {S} } \nolimits_{ {dW} } = \beta S\mathop{ {} } \nolimits_{ {dW} } +{ \left( {1- \beta } \right) } dW\mathop{ {} } \nolimits^{ {2} } } \\ {W \leftarrow W- \alpha \frac{ {dW} } { {\sqrt{ {S\mathop{ {} } \nolimits_{ {dW} } } } + \varepsilon } } } \end{array} }$ ${\begin{array} {*{20} {l} } {\mathop{ {S} } \nolimits_{ {db} } = \beta S\mathop{ {} } \nolimits_{ {db} } +{ \left( {1- \beta } \right) } db\mathop{ {} } \nolimits^{ {2} } } \\ {b \leftarrow b- \alpha \frac{ {db} } { {\sqrt{ {S\mathop{ {} } \nolimits_{ {db} } } } + \varepsilon } } } \end{array} }$

其中ε是一个很小的数，目的是防止分母为0。

Adam

Adam算法（Adaptive Moment Estimation）就是将momentum和RMSprop结合起来，作用于更新网络参数的时候。

公式如下：

${\mathop{ {V} } \nolimits_{ {dW} } = \beta \mathop{ {} } \nolimits_{ {1} } V\mathop{ {} } \nolimits_{ {dW} } +{ \left( {1- \beta \mathop{ {} } \nolimits_{ {1} } } \right) } dW\text{ } {\mathop{ {\text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } V} } \nolimits_{ {db} } = \beta \mathop{ {} } \nolimits_{ {1} } V\mathop{ {} } \nolimits_{ {db} } +{ \left( {1- \beta \mathop{ {} } \nolimits_{ {1} } } \right) } db} }$ ${\mathop{ {S} } \nolimits_{ {dW} } = \beta \mathop{ {} } \nolimits_{ {2} } S\mathop{ {} } \nolimits_{ {dW} } +{ \left( {1- \beta \mathop{ {} } \nolimits_{ {2} } } \right) } dW\mathop{ {} } \nolimits^{ {2} } \text{ } {\mathop{ {\text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } S} } \nolimits_{ {db} } = \beta \mathop{ {} } \nolimits_{ {2} } S\mathop{ {} } \nolimits_{ {db} } +{ \left( {1- \beta \mathop{ {} } \nolimits_{ {2} } } \right) } db\mathop{ {} } \nolimits^{ {2} } } }$ ${\mathop{ {V} } \nolimits_{ {dW} } \leftarrow \frac{ {V\mathop{ {} } \nolimits_{ {dW} } } } { {1- \beta \mathop{ {} } \nolimits_{ {1} } ^{ {t} } } } {\mathop{ {\text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } V} } \nolimits_{ {db} } \leftarrow \frac{ {V\mathop{ {} } \nolimits_{ {db} } } } { {1- \beta \mathop{ {} } \nolimits_{ {1} } ^{ {t} } } } } }$ ${\mathop{ {S} } \nolimits_{ {dW} } \leftarrow \frac{ {S\mathop{ {} } \nolimits_{ {dW} } } } { {1- \beta \mathop{ {} } \nolimits_{ {2} } ^{ {t} } } } {\mathop{ {\text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } S} } \nolimits_{ {db} } \leftarrow \frac{ {S\mathop{ {} } \nolimits_{ {db} } } } { {1- \beta \mathop{ {} } \nolimits_{ {2} } ^{ {t} } } } } }$ ${ {W \leftarrow W- \alpha \frac{ {V\mathop{ {} } \nolimits_{ {dW} } } } { {\sqrt{ {S\mathop{ {} } \nolimits_{ {dW} } } } + \varepsilon } } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } b \leftarrow b- \alpha \frac{ {V\mathop{ {} } \nolimits_{ {db} } } } { {\sqrt{ {S\mathop{ {} } \nolimits_{ {db} } } } + \varepsilon } } } }$

通常β1=0.9、β2=0.999、ε=10-8。

学习率衰减

通过迭代次数来减小学习率而不是使用固定的学习率，有助于提升模型训练速度和训练出更接近最优的模型。学习率衰减的方法有很多，视情况而定。