2019-09-04

吴恩达深度学习--序列模型

循环序列模型

模型的输入

每一条输入x^(i)包括一些有时间差异的序列数据，如x^(i,t)表示输入样例i在时刻t时的数据；以及表示这条序列的总长度T_x^(i)。

网络结构

每个时刻对应一层网络。每层网络的输入有两部分，分别是时刻输入数据和前一层网络的输出。第一层的前层网络输入（也就是T=0时刻的值）一般设为0。
网络中每一层的参数是共享的。
从网络结构可以看出，循环网络可以实现通过当前时刻之前的输入预测当前时刻的输出。

RNN有多种网络结构，包括一对多，多对一，多对多（分为输入序列长度与输出序列长度相同或者不同）。

前向传播

在每一层网络中，前向传播的方式如下：

${\begin{array} {*{20} {l} } {a\mathop{ {} } \nolimits^{ { \left( t \right) } } =g{ \left( {W\mathop{ {} } \nolimits_{ {aa} } a\mathop{ {} } \nolimits^{ { \left( t-1 \right) } } +W\mathop{ {} } \nolimits_{ {ax} } x\mathop{ {} } \nolimits^{ { \left( t \right) } } +b\mathop{ {} } \nolimits_{ {a} } } \right) } } \\ {\mathop{ {y} } \limits^{ {\text{^} } } \mathop{ {} } \nolimits^{ { \left( t \right) } } =g{ \left( {W\mathop{ {} } \nolimits_{ {ya} } a\mathop{ {} } \nolimits^{ { \left( t \right) } } +b\mathop{ {} } \nolimits_{ {y} } } \right) } } \end{array} }$

激活函数g一般采用tanh函数；有时候也用ReLU函数。层与层之间的参数是共享的。

反向传播

循环神经网络的损失函数如下：

${L\mathop{ {} } \nolimits^{ { \left( t \right) } } { \left( {\mathop{ {y} } \limits^{ {\text{^} } } \mathop{ {} } \nolimits^{ { \left( t \right) } } ,y\mathop{ {} } \nolimits^{ { \left( t \right) } } } \right) } =-y\mathop{ {} } \nolimits^{ { \left( t \right) } } log\mathop{ {y} } \limits^{ {\text{^} } } \mathop{ {} } \nolimits^{ { \left( t \right) } } -{ \left( {1-y\mathop{ {} } \nolimits^{ { \left( t \right) } } } \right) } log{ \left( {1-\mathop{ {y} } \limits^{ {\text{^} } } \mathop{ {} } \nolimits^{ { \left( t \right) } } } \right) } }$ ${ {L{ \left( {\mathop{ {y} } \limits^{ {\text{^} } } ,y} \right) } ={\mathop{ \sum } \limits_{ {t=1} } ^{ {\mathop{ {T} } \nolimits_{ {y} } } } {L\mathop{ {} } \nolimits^{ { \left( t \right) } } { \left( {\mathop{ {y} } \limits^{ {\text{^} } } \mathop{ {} } \nolimits^{ { \left( t \right) } } ,y\mathop{ {} } \nolimits^{ { \left( t \right) } } } \right) } } } } }$

前向传播和反向传播的大致过程如下图所示（Tx=Ty的情况）：
RNN

（在课程作业中，反向传播似乎没有更新Wya和by的值。）

梯度消失

RNN中，梯度消失问题主要是由于网络深度的增加，靠后的网络层的损失难以传播到靠前的网络层来影响其权重。下面介绍的GRU和LSTM可以一定程度上解决这个问题。

GRU

GRU（Gate Recurrent Unit）可以很好的捕获深层连接，改善RNN中梯度消失的问题。（某种程度上可以看做LSTM的简化版本）

GRU中的门函数的取值在0-1之间，往往接近0或者1，作用是选择保留或遗忘此时刻的数据。

GRU相比基本的RNN网络，加入了能起到记忆或者遗忘作用的门函数，如果这个时刻需要被记忆，则门函数的值为1，不需要被记忆则为0。GRU单层网络的公式如下：

${\mathop{ {c} } \limits^{ {\text{~} } } \mathop{ {} } \nolimits^{ { \left( t \right) } } =tanh{ \left( {W\mathop{ {} } \nolimits_{ {c} } \left[ \Gamma \mathop{ {} } \nolimits_{ {r} } *c\mathop{ {} } \nolimits^{ { \left( t-1 \right) } } ,x\mathop{ {} } \nolimits^{ { \left( t \right) } } \left] +\mathop{ {b} } \nolimits_{ {c} } \right. \right. } \right) } }$ ${ \Gamma \mathop{ {} } \nolimits_{ {u} } =sigmoid \left( W\mathop{ {} } \nolimits_{ {u} } { \left[ {c\mathop{ {} } \nolimits^{ { \left( t-1 \right) } } ,x\mathop{ {} } \nolimits^{ { \left( t \right) } } } \right] } +\mathop{ {b} } \nolimits_{ {u} } \right) }$ ${ \Gamma \mathop{ {} } \nolimits_{ {r} } =sigmoid \left( W\mathop{ {} } \nolimits_{ {r} } { \left[ {c\mathop{ {} } \nolimits^{ { \left( t-1 \right) } } ,x\mathop{ {} } \nolimits^{ { \left( t \right) } } } \right] } +\mathop{ {b} } \nolimits_{ {r} } \right) }$ ${ {c\mathop{ {} } \nolimits^{ { \left( t \right) } } = \Gamma \mathop{ {} } \nolimits_{ {u} } \mathop{ {c} } \limits^{ {\text{~} } } \mathop{ {} } \nolimits^{ { \left( t \right) } } +} { \left( {1- \Gamma \mathop{ {} } \nolimits_{ {u} } } \right) } c\mathop{ {} } \nolimits^{ { \left( t-1 \right) } } }$

其中：

${ \Gamma \mathop{ {} } \nolimits_{ {u} }\text{门函数，取值在0和1之间，取值为时表示遗忘这一时刻的信息。取值为1时保留这一时刻的信息} }$ ${ { \Gamma \mathop{ {} } \nolimits_{ {r} } \text{门函数表} \text{示} c} \mathop{ {} } \nolimits^{ { \left( t-1 \right) } } \text{和} c\mathop{ {} } \nolimits^{ { \left( t \right) } } \text{之} \text{间} \text{的} \text{相} \text{关} \text{性} }$

LSTM

LSTM（Long Short-Term Memory）也能很好的捕获深层连接，每一层的网络结构比GRU要复杂一些，有三个门结构，分别是更新门（update）、遗忘门（forget）、输出门（output）。LSTM单层网络的公式如下：

${\mathop{ {c} } \limits^{ {\text{~} } } \mathop{ {} } \nolimits^{ { \left( t \right) } } =tanh{ \left( {W\mathop{ {} } \nolimits_{ {c} } \left[ a\mathop{ {} } \nolimits^{ { \left( t-1 \right) } } ,x\mathop{ {} } \nolimits^{ { \left( t \right) } } \left] +\mathop{ {b} } \nolimits_{ {c} } \right. \right. } \right) } }$ ${ \Gamma \mathop{ {} } \nolimits_{ {u} } =sigmoid \left( W\mathop{ {} } \nolimits_{ {u} } { \left[ {a\mathop{ {} } \nolimits^{ { \left( t-1 \right) } } ,x\mathop{ {} } \nolimits^{ { \left( t \right) } } } \right] } +\mathop{ {b} } \nolimits_{ {u} } \right) }$ ${ \Gamma \mathop{ {} } \nolimits_{ {f} } =sigmoid \left( W\mathop{ {} } \nolimits_{ {f} } { \left[ {a\mathop{ {} } \nolimits^{ { \left( t-1 \right) } } ,x\mathop{ {} } \nolimits^{ { \left( t \right) } } } \right] } +\mathop{ {b} } \nolimits_{ {f} } \right) }$ ${ \Gamma \mathop{ {} } \nolimits_{ {o} } =sigmoid \left( W\mathop{ {} } \nolimits_{ {o} } { \left[ {a\mathop{ {} } \nolimits^{ { \left( t-1 \right) } } ,x\mathop{ {} } \nolimits^{ { \left( t \right) } } } \right] } +\mathop{ {b} } \nolimits_{ {o} } \right) }$ ${ {c\mathop{ {} } \nolimits^{ { \left( t \right) } } = \Gamma \mathop{ {} } \nolimits_{ {u} }* \mathop{ {c} } \limits^{ {\text{~} } } \mathop{ {} } \nolimits^{ { \left( t \right) } } +} \Gamma \mathop{} \nolimits_{ {f} } *c\mathop{ {} } \nolimits^{ { \left( t-1 \right) } } }$ ${a\mathop{ {} } \nolimits^{ { \left( t \right) } } = \Gamma \mathop{ {} } \nolimits_{ {o} } *tanh{ \left( {c\mathop{ {} } \nolimits^{ { \left( t \right) } } } \right) } }$

LSTM单层网络的示意图如下：

LSTM

双向RNN&深层RNN

双向RNN比普通的RNN加入了沿时间序列反向计算的模块，所以双向RNN的输出公式为：

${\mathop{ {y} } \limits^{ {\text{^} } } \mathop{ {} } \nolimits^{ { \left( t \right) } } =g{ \left( {W\mathop{ {} } \nolimits_{ {y} } { \left[ {\mathop{ {a} } \limits^{ { \to } } \mathop{ {} } \nolimits^{ { \left( t \right) } } ,\mathop{ {a} } \limits^{ { \leftarrow } } \mathop{ {} } \nolimits^{ { \left( t \right) } } } \right] } +b\mathop{ {} } \nolimits_{ {y} } } \right) } }$

当然在GRU和LSTM中也可以实现双向传播。双向传播的缺点是需要完整的序列输入。

深层RNN就是加深在同一时刻RNN网络纵向的深度（也可以在GRU和LSTM的网络单元上进行加深，也可以变成双向传播的结构），加深出的单元可以与相邻时刻同一深度的单元横向连接，也可以不连接。感觉像是在每一时刻上都拥有一个深层的全连接网络。

attention机制

注意力机制（attention）就是提供一个表示某时刻的输出应该注意哪个时刻的输入的一个权重矩阵。其中：

${a\mathop{ {} } \nolimits^{ { < t,t\mathop{ {} } \nolimits^{ {\text{'} } } > } } \text{表} \text{示} \text{此} \text{时} \text{刻} t\text{把} \text{注} \text{意} \text{力} \text{放} \text{在} t\mathop{ {} } \nolimits^{ {\text{'} } } \text{时} \text{刻} \text{的} \text{程} \text{度} }$

网络结构如下图：

att

attn

词嵌入

词嵌入是对单词之间的相似度进行评价的一个矩阵，每个单词对应矩阵的一个高维的向量。如果两个单词关联相近，那么两个向量的距离就相近。