So far, we talked about supervised learning and self-supervised learning, where the goal is to encode an object $x$
into a meaningful feature representation $E(x)$ such that we can distinguish different object.
The goal of generative models is different:
Given samples $x_1, \ldots, x_N$ from the unknown probability distribution $p(x)$, we want to learn the probability distribution itself.
The minimum what we want to learn is a model that allows to sample from the distribution itself.
In some cases, we can also learn the density $p(x)$, so we can evaluate the log-likelihood itself.
We have encountered generative models when we talked about transformer models.
GPT is a generative autoregressive model.
Today, there are quite a lot generative models.
Historically, one of the first family for generative models were autoregressive models, that focus on predicting sequence values in the future.
Other important directions which we will cover in this course include:
They are very different concepts, but they all solve the problem of learning (parametric form) of high-dimensional probability distribution from samples.
The autoregressive model is based on the representation of the probability density (for the continuous density) or probability (for discrete symbols as)
$$p(x_1, \ldots, x_T) = \prod_{t=1}^T p(x_t \vert x_{t-1}, \ldots, x_1).$$The first neural autoregressive distribution estimator has been proposed in The Neural Autoregressive Distribution Estimator in 2011
The architecture is not up-to-date, but quite instructive. It was motivated by fully visible sigmoid belief network, proposed as early as 1992! In FVSBN, the value is conditioned to the previous ones using 1-layer network.
The idea of variational like-lihood based model for images have been proposed in two papers:
Conditional image generation with PixelCNN decoders
Pixel Recurrent Neural Networks
Images formally is not a sequence, so we need to define a certain order.
In PixelCNN , the conditional distribution is modelled using convolutional neural network.
There are several variants to model such conditional distributions.
PixelRNN paper uses LSTM to build such conditional distribution.
In Conditional image generation with PixelCNN decoders the ReLU units are replaced with masked gated convolutions:
$$y = \operatorname{tanh}W_{k, f} * x) \odot \sigma(W_{k, g} * x).$$It uses the raster scan with the following mask on the filter!
In also uses two blocks to increase the receptive field: vertical and horizontal. One conditions only on the current row, the second conditions on all the previous rows.
Certain improvents have been proposed in the paper Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. PixelSNAIL: An Improved Autoregressive Generative Model
The available options to model conditional distributions:
The autoregressive generative models have been proposed for images, but the biggest usage and success have been achieved not for image, but for time-series (which is natural).
For audio signals and tasks, such models held state-of-the art for quite a long time and still are being used.
Namely, this is the WaveNet architecture.
In Generative model for raw audio the WaveNet architecture has been proposed.
Similarly to PixelCNNs (van den Oord et al., 2016a;b), the conditional probability distribution is modelled by a stack of convolutional layers. There are no pooling layers in the network, and the output of the model has the same time dimensionality as the input. The model outputs a categorical distribution over the next value xt with a softmax layer and it is optimized to maximize the loglikelihood of the data w.r.t. the parameters. Because log-likelihoods are tractable, we tune hyperparameters on a validation set and can easily measure if the model is overfitting or underfitting.
The model can not depend on future timesteps, i.e. be causal.
For images, the analogue of the causal convolution is masked convolution, where in the convolution the 'wrong' pixels are multiplied to zero.
At training, such convolutions can be done simultaneously, whereas at inference we have to do prediction one at a time.
Q: What is the difference to recurrent neural networks (RNN)?
filter is applied over an area larger than its length by skipping input values with a certain step. It is equivalent to a convolution with a larger filter derived from the original filter by dilating it with zeros, but is significantly more efficient
Stacked dilated convolutions enable networks to have very large receptive fields with just a few layers, while preserving the input resolution throughout the network as well as computational efficiency.
The dilation is doubled every layer
The model in the non-linearity uses gated units of the form
$$\mathbf{z}=\tanh \left(W_{f, k} * \mathbf{x}\right) \odot \sigma\left(W_{g, k} * \mathbf{x}\right)$$WaveNet architecture has been used for Text-to-Speech generation.
In order to condition the generation on text, the conditioning has been introduced into the architecture.
It is achieved by upsampling such local features to the level and using inside the gating mechanism
Given $h_t$, possibly with a lower sampling frequency than the audio signal, e.g. linguistic features in a Text-to-Speech (TTS) model.
This time series is first transformed using a transposed convolutional network (learned upsampling) that maps it to a new time series $\mathbf{y}=f(\mathbf{h})$ with the same resolution as the audio signal, which is then used in the activation unit as follows:
$$ \mathbf{z}=\tanh \left(W_{f, k} * \mathbf{x}+V_{f, k} * \mathbf{y}\right) \odot \sigma\left(W_{g, k} * \mathbf{x}+V_{g, k} * \mathbf{y}\right) $$Note the gating mechanism again (actually, quite powerful idea).
The autoregressive models represent the probability in a tractable manner.
The variational autoencoders (VAE) represent probability in a way that likelihood can not be computed, but can be optimized using variational inference and samples.
But first, we will discuss models based on the transformations of the random variable.
The idea of real NVP is to parametrize the distribution as:
$$z \sim p(z), x = f^{-1}(z) \quad \mbox{(generation)},$$where $f$ is a parametrized invertible transformation.
The change of variables formula gives $$p_X(x)=p_Z(z)\left|\operatorname{det}\left(\frac{\partial g(z)}{\partial z^T}\right)\right|^{-1}$$
In RealNVP model, $f$ is parametrized as a superposition of affine coupling layers of the form $$ \begin{aligned} y_{1: d} & =x_{1: d} \\ y_{d+1: D} & =x_{d+1: D} \odot \exp \left(s\left(x_{1: d}\right)\right)+t\left(x_{1: d}\right) \end{aligned} $$
The Jacobian of such kind of mapping is triangular, and the determinant (and its logarithm) is easily computed.
Now, lets get to the variational autoencoders!
Now we will discuss VAE, which is one of the most interesting and powerful class of generative models.
VAE has been introduced in Autoencoder Variational Bayes and
in the paper and has become one of the most influencial approaches in generative models.
Several reviews are available, for example this one.
We did not discuss too much classical autoencoder: we want to map an input $x$ to the latent space and then map back in order to reconstruct the initial image.
Problems:
In VAE, we learn not the exact features, but some intervals (distributions of them).
I.e., the encoder maps the input into a certain distribution (which can be, of course, close to a delta-function).
Now, lets come up with a mathematical formalism.
Suppose there exists a hidden variable $z$ that generates the output variable $x$.
We know the distribution of $z$, $p(z)$, we know the distribution of $x$ conditioned to $z$, $p(x \vert z)$.
If we want to encode, we need to know the distribution $p(z \vert x)$. This can be done by using Bayes formula:
$$p(z \vert x) = \frac{p(x \vert z) p(z)}{p(x)}, \quad p(x) = \int p(x \vert z) p(z) dz. $$$p(x)$ is very difficult to evaluate in the general case!
We can use variational inference!
Lets approximate $p(z \vert x)$ by another distribution $q(z \vert x)$
Now we can minimize the distance $\min \mathrm{KL}\left( {q\left( {z|x} \right)||p\left( {z|x} \right)} \right)$, where $\mathrm{KL}$ is the Kullback–Leibler divergence between to distributions.
KL-divergence between two continuous distributions is defined as
$$\mathrm{KL}(p || q) = E_p \log \frac{p}{q} = \int p(x) \log \frac{p}{q}$$KL is a metric between two distributions, it has the following properties:
Recall, that we are minimizing $\min \mathrm{KL}\left( {q\left( {z|x} \right)||p\left( {z|x} \right)} \right)$ over $q$.
\begin{aligned} \mathrm{KL}[\mathrm{q}(\mathrm{z}) \| \mathrm{p}(\mathrm{z} \mid \mathrm{x})] & =\int_z \mathrm{q}(\mathrm{z}) \log \frac{\mathrm{q}(\mathrm{z})}{\mathrm{p}(\mathrm{z} \mid \mathrm{x})} \\ & =-\int_z \mathrm{q}(\mathrm{z}) \log \frac{\mathrm{p}(\mathrm{z} \mid \mathrm{x})}{\mathrm{q}(\mathrm{z})} \\ & =-\left(\int_z \mathrm{q}(\mathrm{z}) \log \frac{\mathrm{p}(\mathrm{x}, \mathrm{z})}{\mathrm{q}(\mathrm{z})}-\int_{\mathrm{z}} \mathrm{q}(\mathrm{z}) \log \mathrm{p}(\mathrm{x})\right) \\ & =-\int_z \mathrm{q}(\mathrm{z}) \log \frac{\mathrm{p}(\mathrm{x}, \mathrm{z})}{\mathrm{q}(\mathrm{z})}+\log \mathrm{p}(\mathrm{x}) \int_z \mathrm{q}(\mathrm{z}) \\ & =-\mathrm{L}+\log \mathrm{p}(\mathrm{x}) \end{aligned}The value $L$ is called variational lower bound.
We get
$$L = \log p(x) - \mathrm{KL}(\mathrm{q}(\mathrm{z}) \| \mathrm{p}(\mathrm{z} \mid \mathrm{x}))$$Since KL-divergence is non-negative, we get
$$\log p(x) \geq L.$$Also, since for a fixed $x$ $\log p(x)$ is a constant, minimizing over $q = q(z \vert x)$ of the original KL divergence is equivalent to the maximization of $L$.
This bound is typically referred to as ELBO (Evidence Lower Bound).
We have (recall that $p(x, z) = p(x \vert z) p(z)$,
$$L = \int_z q(z) \log \frac{p(x, z)}{q(z)} = {E_{q\left( {z|x} \right)}}\log p\left( {x|z} \right) - KL\left( {q\left( {z|x} \right)||p\left( z \right)} \right) $$These two terms have clear interpretation.
The first term is a reconstruction term of the output from the latent vector.
The second term is the similarity of the the distribution of $q(z \vert x)$ to a prior distribution $p(z)$.
We can use $q(z \vert x)$ as the encoder in the architecture to infer possible latent variables $z$ for a given $x$.
We can parametrize $q(z \vert x)$ by a neural network.
But what actually this neural network will output?
The classical VAE parametrizes the latent distribution $q(z \vert x)$ as a Gaussian with mean value $\mu$ and diagonal covariance matrix $\sigma$.
The neural network will predict $\mu_j$ and $\sigma_j$ for each latent variable.
$\sigma_j$ has to be non-negative, so typically $\log \sigma_j$ is predicted rather than $\sigma_j$ itself.
But, the computational graph is stochastic.
How we differentiate through it?
Before it is needed to mention that the KL-divergence between two Gaussian can be computed analytically.
The decoder network $p(x \vert z)$ will also produce mean value and variance for each pixel!
So:
ELBO has the following form:
$$L = {E_{q\left( {z|x} \right)}}\log p\left( {x|z} \right) - KL\left( {q\left( {z|x} \right)||p\left( z \right)} \right)$$We need to compute the gradient with respect to the parametres of $p(x \vert z)$ and parameters of the distribution $q$ (also called variational parameters).
The main challenge is the first terms, since for Gaussians the second term is an analytic formula.
Gradient of $E_{q\left( {z|x} \right)}\log p\left( {x|z} \right)$ with respect to $p$ can be computed as an expectation:
$$ \nabla_p E_{q\left( {z|x} \right)}\log p\left( {x|z} \right) = E_{q\left( {z|x} \right)} \nabla \log p\left( {x|z} \right)$$The gradient with respect to the varitional parameters $q$ is much trickier.
Fortunately, for basic distributions the expectation over parameter-dependent distribution can be estimated using reparametrization trick.
For VAEs, a special technique called reparametrization trick has been proposed.
For Gaussians, the trick is simple.
Now we can differentiate with respect to $\sigma$ and $\varepsilon$.
In general, when we have losses of the form
$E_{z \sim q} h(q)$ it is not so easy to compute their gradient efficiently.
You can do quite
Once the model is trained, we can:
Taking random samples from $q_\phi(\mathbf{z} \mid \mathbf{x})$, a Monte Carlo estimator of this is: $$ \log p_{\boldsymbol{\theta}}(\mathbf{x}) \approx \log \frac{1}{L} \sum_{l=1}^L p_{\boldsymbol{\theta}}\left(\mathbf{x}, \mathbf{z}^{(l)}\right) / q_\phi\left(\mathbf{z}^{(l)} \mid \mathbf{x}\right) $$
Note that $L=1$ corresponds to the VAE objective (noisy).
Better estimates
We have seen in several places that discrete latent codes could be more preferrable to Gaussian-type latent vectors.
The original idea (VQ-VAE) has been proposed in the paper
For example, for I-BERT we need visual tokens (i.e. discrete representation of each patch).
Thus, we need to specify the architecture and the learning process.
In VQ-VAE (vector quantization with VAE) the latent space has defined as $\mathbb{R}^{K \times D}$, where
$K$ is the size of the discrete latent space (i.e., $K$-way categorical) and $D$ is the dimensionality of each latent embedding vector $e_i$.
The posterior categorical distribution $q(z \mid x)$ probabilities are defined as one-hot as follows: $$ q(z=k \mid x)= \begin{cases}1 & \text { for } \mathrm{k}=\operatorname{argmin}_j\left\|z_e(x)-e_j\right\|_2, \\ 0 & \text { otherwise }\end{cases} $$ where $z_e(x)$ is the output of the encoder network. This model is viewed as a VAE in which we can bound $\log p(x)$ with the ELBO. Our proposal distribution $q(z=k \mid x)$ is deterministic, and by defining a simple uniform prior over $z$ we obtain a KL divergence constant and equal to $\log K$.
The representation $z_e(x)$ is passed through the discretisation bottleneck followed by mapping onto the nearest element of embedding $e$ as given in equations 1 and 2 . $$ z_q(x)=e_k, \quad \text { where } \quad k=\operatorname{argmin}_j\left\|z_e(x)-e_j\right\|_2 $$
We will be training encoder, decoder and embeddings simulaneously!
The distribution over categories is assumed in the original paper to be uniform.
This, KL-loss is constant an is not involved in optimization, so other losses are needed!
Instead, the following loss is used:
Thus, the total training objective becomes: $$ L=\log p\left(x \mid z_q(x)\right)+\left\|\operatorname{sg}\left[z_e(x)\right]-e\right\|_2^2+\beta\left\|z_e(x)-\operatorname{sg}[e]\right\|_2^2, $$
where $\operatorname{sg}$ is a stop-gradient operation (i.e. it is identity in the forward pass, and has zero partial derivatives in the backward pass).
The last loss, the commitment loss, which only applies to the encoder weights, encourages the output of the encoder to stay close to the chosen codebook vector to prevent it from fluctuating too frequently from one code vector to another
The improvement of VQ-VAE scheme has been proposed in Generating Diverse High-Fidelity Images with VQ-VAE-2
It uses the same loss, but tries to learn hierarhical codes.
In the first stage, hierarchical latent codes are used.
For example, for 256 × 256 images, a two level latent hierarchy is used. The encoder network first transforms and downsamples the image by a factor of 4 to a 64 × 64 representation which is quantized to our bottom level latent map. Another stack of residual blocks then further scales down the representations by a factor of two, yielding a top-level 32 × 32 latent map after quantization.
The decoder is similarly a feed-forward network that takes as input all levels of the quantized latent hierarchy. It consists of a few residual blocks followed by a number of strided transposed convolutions to upsample the representations back to the original image size.
Once we have the encoder, we can model the distribution of the discrete latent
by a PixelCNN autoregressive architecture (in fact, PixelSnail approach with self-attention).
This is done independently.
Of course, one can use other discrete distribution models over those latents.
VQ-VAE architecture:
One can easily use GPT to model the dependencies over the resulting latent variables, for example, as in VideoGPT paper, where GPT is used to predict the discrete latent codes.
Significant improvement of VQ-VAE has been achieved by combining the ideas of VQ-VAE with the ideas of generative networks (GANs), resulting in the VQ-GAN.
In short (we will cover this in the lecture on GANs) is to add an additional loss on the quality of the generated images, and the loss is borrowed from GANs.