变分自编码器（VAEs）

知乎专栏处女作，写得不好请轻拍。在此之前已在微信公众号写了一些文章，立个flag，后续会逐步转过来。这个专栏将记录我看过的一些paper，也将监督我多看paper少水手机。好了，进入正题。

VAEs简介

变分自编码器（Variational auto-encoder，VAE）是一类重要的生成模型（generative model），它于2013年由Diederik P.Kingma和Max Welling提出[1]。2016年Carl Doersch写了一篇VAEs的tutorial[2]，对VAEs做了更详细的介绍，比文献[1]更易懂。这篇读书笔记基于文献[1]。

除了VAEs，还有一类重要的生成模型GANs（对GANs感兴趣可以去我的微信公众号看介绍文章：学术兴趣小组）。

我们来看一下VAE是怎样设计的。

上图是VAE的图模型。我们能观测到的数据是\text{x}，而\text{x}由隐变量\text{z}产生，由\text{z}\rightarrow \text{x}是生成模型p_{\theta}(\text{x}|\text{z})，从自编码器（auto-encoder）的角度来看，就是解码器；而由\text{x}\rightarrow \text{z}是识别模型（recognition model）q_{\phi}(\text{z}|\text{x})，类似于自编码器的编码器。

VAEs现在广泛地用于生成图像，当生成模型p_{\theta}(\text{x}|\text{z})训练好了以后，我们就可以用它来生成图像了。与GANs不同的是，我们是知道图像的密度函数（PDF）的（或者说，是我们设定的），而GANs我们并不知道图像的分布。

VAEs模型的理论推导

以下的推导参考了文献[1]和[3]，文献[3]是变分推理的课件。

首先，假定所有的数据都是独立同分布的（i.i.d），两个观测不会相互影响。我们要对生成模型p_{\theta}(\text{x}|\text{z})做参数估计，利用对数最大似然法，就是要最大化下面的对数似然函数：

\log p_{\theta}(\text{x}^{(1)},\text{x}^{(2)},\cdots,\text{x}^{(N)})=\sum_{i=1}^N \log p_{\theta}(\text{x}^{(i)})

VAEs用识别模型q_{\phi}(\text{z}|\text{x}^{(i)})去逼近真实的后验概率p_{\theta}(\text{z}|\text{x}^{(i)})，衡量两个分布的相似程度，我们一般采用KL散度，即

\begin{align} KL(q_{\phi}(\text{z}|\text{x}^{(i)})||p_{\theta}(\text{z}|\text{x}^{(i)}))&=\mathbb{E}_{q_{\phi}(\text{z}|\text{x}^{(i)})} \log \frac{q_{\phi}(\text{z}|\text{x}^{(i)})}{p_{\theta}(\text{z}|\text{x}^{(i)})}\\ &=\mathbb{E}_{q_{\phi}(\text{z}|\text{x}^{(i)})} \log \frac{q_{\phi}(\text{z}|\text{x}^{(i)})p_{\theta}(\text{x}^{(i)})}{p_{\theta}(\text{z}|\text{x}^{(i)})p_{\theta}(\text{x}^{(i)})}\\ &=\mathbb{E}_{q_{\phi}(\text{z}|\text{x}^{(i)})} \log \frac{q_{\phi}(\text{z}|\text{x}^{(i)})}{p_{\theta}(\text{z},\text{x}^{(i)})}+\mathbb{E}_{q_{\phi}(\text{z}|\text{x}^{(i)})} \log p_{\theta}(\text{x}^{(i)})\\ &=\mathbb{E}_{q_{\phi}(\text{z}|\text{x}^{(i)})} \log \frac{q_{\phi}(\text{z}|\text{x}^{(i)})}{p_{\theta}(\text{z},\text{x}^{(i)})}+\log p_{\theta}(\text{x}^{(i)}) \end{align}

于是

\log p_{\theta}(\text{x}^{(i)})=KL(q_{\phi}(\text{z}|\text{x}^{(i)}), p_{\theta}(\text{z}|\text{x}^{(i)}))+\mathcal{L}(\theta,\phi;\text{x}^{(i)})

其中，

\begin{align} \mathcal{L}(\theta,\phi;\text{x}^{(i)})& = -\mathbb{E}_{q_{\phi}(\text{z}|\text{x}^{(i)})} \log \frac{q_{\phi}(\text{z}|\text{x}^{(i)})}{p_{\theta}(\text{z},\text{x}^{(i)})}\\ &=\mathbb{E}_{q_{\phi}(\text{z}|\text{x}^{(i)})} \log p_{\theta}(\text{z}, \text{x}^{(i)}) - \mathbb{E}_{q_{\phi}(\text{z}|\text{x}^{(i)})} \log q_{\phi}(\text{z}|\text{x}^{(i)}) \end{align}

由于KL散度非负，当两个分布一致时（允许在一个零测集上不一致），KL散度为0。于是\log p_{\theta}(\text{x}^{(i)}) \geq \mathcal{L}(\theta,\phi;\text{x}^{(i)})。 \mathcal{L}(\theta,\phi;\text{x}^{(i)})称为对数似然函数的变分下界。

直接优化\log p_{\theta}(\text{x}^{(i)})是不可行的，因此一般转而优化它的下界 \mathcal{L}(\theta,\phi;\text{x}^{(i)})。对应的，优化对数似然函数转化为优化 \mathcal{L}(\theta,\phi;\text{X})=\sum_{i=1}^N \mathcal{L}(\theta,\phi;\text{x}^{(i)})。

作者指出， \mathcal{L}(\theta,\phi;\text{x}^{(i)})对\phi的梯度方差很大，不适于用于数值计算。为了解决这个问题，假定识别模型q_{\phi}(\text{z}|\text{x})可以写成可微函数g_{\phi}(\epsilon, \text{x})，其中，\epsilon为噪声，\epsilon \sim p(\epsilon)。于是， \mathcal{L}(\theta,\phi;\text{x}^{(i)})可以做如下估计（利用蒙特卡罗方法估计期望）：

\mathcal{\tilde{L}}^A(\theta,\phi;\text{x}^{(i)})=\frac{1}{L}\sum_{l=1}^L [\log p_{\theta}(\text{x}^{(i)}, \text{z}^{(i,l)}) - \log q_{\phi}(\text{z}^{(i,l)}|\text{x}^{(i)})]

其中，\text{z}^{(i,l)}=g_{\phi}(\epsilon^{(i,l)}, \text{x}^{(i)}), \quad \epsilon^{(i,l)} \sim p(\epsilon)。

此外， \mathcal{L}(\theta,\phi;\text{x}^{(i)})还可以改写为

\mathcal{L}(\theta,\phi;\text{x}^{(i)})=-KL(q_{\phi}(\text{z}|\text{x}^{(i)})||p_{\theta}(\text{z})) + \mathbb{E}_{q_{\phi}(\text{z}|\text{x}^{(i)})} \log p_{\theta}(\text{x}^{(i)}|\text{z})

由此可以得到另外一个估计

\mathcal{\tilde{L}}^B(\theta, \phi; \text{x}^{(i)})=-KL(q_{\phi}(\text{z}|\text{x}^{(i)})||p_{\theta}(\text{z})) +\frac{1}{L} \sum_{l=1}^L \log p_{\theta}(\text{x}^{(i)}|\text{z}^{(i,l)})

其中，\text{z}^{(i,l)}=g_{\phi}(\epsilon^{(i,l)}, \text{x}^{(i)}), \quad \epsilon^{(i,l)} \sim p(\epsilon)。

实际试验时，如果样本量N很大，我们一般采用minibatch的方法进行学习，对数似然函数的下界可以通过minibatch来估计：

\mathcal{L}(\theta,\phi;\text{X})\simeq \mathcal{\tilde{L}}^M (\theta,\phi;\text{X}^M)=\frac{N}{M}\sum_{i=1}^M \mathcal{\tilde{L}}(\theta,\phi;\text{x}^{(i)})

可以看到，为了计算\mathcal{L}(\theta,\phi;\text{X})，我们用了两层估计。当M较大时，内层估计可以由外层估计来完成，也就是说，取L=1即可。实际计算中，作者取M=100,L=1。由上述推导得到AEVB算法：

VAEs模型

上面给的AEVB算法是一个算法框架，只有给定了\epsilon, p_{\theta}(\text{x}|\text{z}), q_{\phi}(\text{z}|\text{x}), p_{\theta}(\text{z})分布的形式以及g_{\phi}(\epsilon, \text{x})，我们才能启动算法。实际应用中，作者取

\begin{align} p(\epsilon) &= \mathcal{N}(\epsilon; 0,\text{I})\\ q_{\phi}(\text{z}|\text{x}^{(i)}) &= \mathcal{N}(\text{z}; {\mu}^{(i)}, {\sigma}^{2(i)}\text{I})\\ p_{\theta}(\text{z})&=\mathcal{N}(\text{z}; 0,\text{I})\\ g_{\phi}(\epsilon^{(l)}, \text{x}^{(i)}) &= {\mu}^{(i)}+{\sigma}^{(i)}\odot \epsilon^{(l)} \end{align}

而p_{\theta}(\text{x}|\text{z})根据样本是实值还是二元数据进行选择，若样本为二元数据，则选择

p_{\theta}(x_i|\text{z})=\mathcal{B}(x_i;1,y_i)=y_i^{x_i}\cdot (1-y_i)^{1-x_i}, \quad i=1,2,\cdots,D_{\text x}(D_{\text x}=\dim(\text{x}))

若样本是实值数据，则选择

p_{\theta}(\text{x}^{(i)}|\text{z})=\mathcal{N}(\text{x}^{(i)}; \mu'^{(i)},\sigma'^{2(i)}\text{I})

实验中，作者选择多层感知器（MLP）对p_{\theta}(\text{x}|\text{z}), q_{\phi}(\text{z}|\text{x})进行拟合，具体来说，

对p_{\theta}(\text{x}|\text{z})，参数为\theta=(\mu', \sigma')，若样本为二

编辑于 2017-11-07 00:28

深度学习（Deep Learning）

机器学习

变分推断

VAEs简介

VAEs模型的理论推导

VAEs模型

文章被以下专栏收录

学术兴趣小组