lecture-14

What is GANs¶

We are considering the problem of learning a distribution from samples.

We parametrize the distribution as

$$G(z), z \sim p(z).$$

So, we have access to real samples and fake samples (generated from the generator).

How we distinguish between them?

The original idea of Generative Adversarial Nets

has the following form: we treat the problem of distinguishing between real and fake samples as binary classification task.

Optimization formulation¶

We have a min-max problem: we want the discriminator to detect the fakes, whereas we want the generator to fool the discriminator

$$\min_G \max_D L(D, G), \quad L(D, G) = E_{x \sim p_{r}} \log D(x) + E_{z \sim p_z} \log(1-D(G(z))$$

What the solution of this problem?

Exact solution of the min-max problem¶

Suppose that real data has density $\rho_r$ and synthetic data has density $\rho_g$.

Then, we have

$$\int \rho_r(x) \log D(x) + \int \rho_g(x) \log (1 - D(x)).$$

By optimizing over $D(x)$ we get

$$\frac{\rho_r}{D} - \frac{\rho_g}{1-D} = 0,$$$$\rho_g(1-D) - \rho_rD = 0, \quad D = \frac{\rho_r}{\rho_g + \rho_r}$$

This is the optimal discriminator.

If we put the optimal discriminator into the formula, we get

$$L(G, D_*) = \int \rho_r \log \frac{\rho_r}{\rho_g + \rho_r} + \int \rho \log \frac{\rho_g}{\rho_g + \rho_r}$$

Optimal formula¶

For the optimal discriminator defined before, we have

$$L(G, D^*) = 2D_{JS}(p_{r} \| p_g) - 2\log2, $$

where $D_{JS}$ is the Jensen-Shannon divergence (symmetrized version of the KL-divergence)

$$D_{JS}(p \| q) = \frac{1}{2} D_{KL}(p \| \frac{p + q}{2}) + \frac{1}{2} D_{KL}(q \| \frac{p + q}{2})$$

Thus, minimization over $G$ will lead to $p_r = p_g$ if the generator is expressive enough,

and $D_* = \frac{1}{2}$.

Training a basic GAN¶

In training a basic GAN, we:

Make a few optimizer steps in $G$
Make a few optimizer step in $D$.

Sometimes, in early training $G$ is quite poor and $\log(1 - D(G(z))$ saturates and it has been proposed to train $G$ to maximize $\log D(G(z))$ rather than minimize $\log(1 - D(G(z))$.

The gradients are much larger.

The problem with this 'non-saturating' loss is that it is heuristic and is not well-described by theory.

Note on non-saturating loss¶

With the optimal $D_*$ in this case we have $$ \begin{array}{r} E_{x \sim p_g}\left[-\log \left(D_G^*(x)\right)\right]+E_{x \sim p_g}\left[\log \left(1-D_G^*(x)\right)\right] \\ =E_{x \sim p_g}\left[\log \frac{\left(1-D_G^*(x)\right)}{D_G^*(x)}\right]=E_{x \sim p_g}\left[\log \frac{p_g(x)}{p_{\text {data }}(x)}\right] \\ =K L\left(p_g \| p_{\text {data }}\right) . \end{array} $$

After putting it all together, we will get the following loss for $G$ as $$ \begin{aligned} & E_{x \sim p_g}\left[-\log \left(D_G^*(x)\right)\right] \\ & =K L\left(p_g \| p_{\text {data }}\right)-2 J S\left(p_{\text {data }} \| p_g\right) \\ & +E_{x \sim p_{\text {data }}}\left[\log D_G^*(x)\right]+2 \log 2 . \\ & \end{aligned} $$ The first loss minimizes the difference, while the second maximizes. This may result in unstable training for $G$.

Two extreme cases:

If $p_{\text {data }}(x) \rightarrow 0$ and $p_g(x) \rightarrow 1$, we have $K L\left(p_g\right.$ $\left.\| p_{\text {data }}\right) \rightarrow+\infty$.
If $p_{\text {data }}(x) \rightarrow 1$ and $p_g(x) \rightarrow 0$, we have $K L\left(p_g\right.$ $\left.\| p_{\text {data }}\right) \rightarrow 0$

First error: $G$ produces implausible examples. The second error: generator produces insufficiently diverse ('safe' example). This problem is known as mode collapse.

DCGAN¶

Original GAN model involved MLP and is not scalable.

DeepConvolutional GAN was the first working GAN for large-scale image generation. It provided several recipes for building architectures and stable training,

such as

Using BatchNorm
Using fully-convolutional layers
Remove pooling layers

Wasserstein GAN (WGAN)¶

Wasserstein GAN was based on the idea of using another type of metric between distribution to minimize, namely the Wasserstein distance.

The Earth-Mover (EM) distance or Wasserstein-1 $$ W\left(\mathbb{P}_r, \mathbb{P}_g\right)=\inf _{\gamma \in \Pi\left(\mathbb{P}_r, \mathbb{P}_g\right)} \mathbb{E}_{(x, y) \sim \gamma}[\|x-y\|] $$ where $\Pi\left(\mathbb{P}_r, \mathbb{P}_g\right)$ denotes the set of all joint distributions $\gamma(x, y)$ whose marginals are respectively $\mathbb{P}_r$ and $\mathbb{P}_g$. Intuitively, $\gamma(x, y)$ indicates how much "mass" must be transported from $x$ to $y$ in order to transform the distributions $\mathbb{P}_r$ into the distribution $\mathbb{P}_g$. The EM distance then is the "cost" of the optimal transport plan.

Example of when Wasserstein can be better¶

Example 1 (Learning parallel lines). Let $Z \sim U[0,1]$ the uniform distribution on the unit interval. Let $\mathbb{P}_0$ be the distribution of $(0, Z) \in \mathbb{R}^2$ (a 0 on the x-axis and the random variable $Z$ on the y-axis), uniform on a straight vertical line passing through the origin. Now let $g_\theta(z)=(\theta, z)$ with $\theta$ a single real parameter. It is easy to see that in this case,

$W\left(\mathbb{P}_0, \mathbb{P}_\theta\right)=|\theta|$
$J S\left(\mathbb{P}_0, \mathbb{P}_\theta\right)= \begin{cases}\log 2 & \text { if } \theta \neq 0 \\ 0 & \text { if } \theta=0\end{cases}$
$K L\left(\mathbb{P}_\theta \| \mathbb{P}_0\right)=K L\left(\mathbb{P}_0 \| \mathbb{P}_\theta\right)= \begin{cases}+\infty & \text { if } \theta \neq 0, \\ 0 & \text { if } \theta=0,\end{cases}$
and $\delta\left(\mathbb{P}_0, \mathbb{P}_\theta\right)= \begin{cases}1 & \text { if } \theta \neq 0 \\ 0 & \text { if } \theta=0\end{cases}$,

where $\delta$ is a Total Variation distance, defined as

$$ \delta\left(\mathbb{P}_r, \mathbb{P}_g\right)=\sup _{A \in \Sigma}\left|\mathbb{P}_r(A)-\mathbb{P}_g(A)\right| . $$

As we see, the only 'smooth' is the Wasserstein distance.

Training objective of WGAN¶

Computing infimum in

$$ W\left(\mathbb{P}_r, \mathbb{P}_g\right)=\inf _{\gamma \in \Pi\left(\mathbb{P}_r, \mathbb{P}_g\right)} \mathbb{E}_{(x, y) \sim \gamma}[\|x-y\|] $$

is intractable (or not easy at least). There is a Kantorovich duality representation of this distance as

$$ W\left(\mathbb{P}_r, \mathbb{P}_\theta\right)=\sup _{\|f\|_L \leq 1} \mathbb{E}_{x \sim \mathbb{P}_r}[f(x)]-\mathbb{E}_{x \sim \mathbb{P}_\theta}[f(x)] $$

where the supremum is over all the 1-Lipschitz functions $f: \mathcal{X} \rightarrow \mathbb{R}$.

It is also sufficient to optimize over bounded-Lipschitz functions

The objective is similar to the GAN, but with the Lipschitz constraint.

Enforcing Lipschitz constraint¶

The original WGAN paper proposed to use weight clipping to $[-c, c]$ (and confirmed it is a terrible idea).

A subsequent paper Improved Wasserstein GAN training uses gradient penalty instead

The resulting loss function has the form

$$L=\underbrace{\underset{\tilde{\boldsymbol{x}} \sim \mathbb{P}_g}{\mathbb{E}}[D(\tilde{\boldsymbol{x}})]-\underset{\boldsymbol{x} \sim \mathbb{P}_r}{\mathbb{E}}[D(\boldsymbol{x})]}_{\text {Original critic loss }}+\underbrace{\lambda \underset{\hat{\boldsymbol{x}} \sim \mathbb{P}_{\hat{\boldsymbol{x}}}}{\mathbb{E}}\left[\left(\left\|\nabla_{\hat{\boldsymbol{x}}} D(\hat{\boldsymbol{x}})\right\|_2-1\right)^2\right] .}_{\text {Our gradient penalty }}$$

and $\hat{x}$ is sampled from straight lines between original samples and fake samples.

Spectral normalization¶

Another way of stabilizing GAN training is Spectral normalization

The idea is to introduce spectral norm bound on the weights (by doing power iteration)

Algorithm 1 SGD with spectral normalization

Initialize $\tilde{\boldsymbol{u}}_l \in \mathcal{R}^{d_l}$ for $l=1, \ldots, L$ with a random vector (sampled from isotropic distribution).
For each update and each layer $l$ :

Apply power iteration method to a unnormalized weight $W^l$ :

$$ \begin{aligned} & \tilde{\boldsymbol{v}}_l \leftarrow\left(W^l\right)^{\mathrm{T}} \tilde{\boldsymbol{u}}_l /\left\|\left(W^l\right)^{\mathrm{T}} \tilde{\boldsymbol{u}}_l\right\|_2 \\ & \tilde{\boldsymbol{u}}_l \leftarrow W^l \tilde{\boldsymbol{v}}_l /\left\|W^l \tilde{\boldsymbol{v}}_l\right\|_2 \end{aligned} $$

Calculate $\bar{W}_{\mathrm{SN}}$ with the spectral norm:

$$ \bar{W}_{\mathrm{SN}}^l\left(W^l\right)=W^l / \sigma\left(W^l\right), \text { where } \sigma\left(W^l\right)=\tilde{\boldsymbol{u}}_l^{\mathrm{T}} W^l \tilde{\boldsymbol{v}}_l $$

Update $W^l$ with SGD on mini-batch dataset $\mathcal{D}_M$ with a learning rate $\alpha$ :

$$ W^l \leftarrow W^l-\alpha \nabla_{W^l} \ell\left(\bar{W}_{\mathrm{SN}}^l\left(W^l\right), \mathcal{D}_M\right) $$

Spectral normalization is present in many GAN architectures nowdays, but it is not present in StyleGAN.

LSGAN¶

The cross-entropy is not a panacea. One can actually use different loss functions, for example Least Squares GAN

with the min-max problem of the form $$ \begin{aligned} \min _D V_{\mathrm{LSGAN}}(D) & =\frac{1}{2} \mathbb{E}_{\boldsymbol{x} \sim p_{\text {data }}(\boldsymbol{x})}\left[(D(\boldsymbol{x})-b)^2\right] \\ & +\frac{1}{2} \mathbb{E}_{\boldsymbol{z} \sim p_{\boldsymbol{z}}(\boldsymbol{z})}\left[(D(G(\boldsymbol{z}))-a)^2\right] \\ \min _G V_{\mathrm{LSGAN}}(G) & =\frac{1}{2} \mathbb{E}_{\boldsymbol{z} \sim p_{\boldsymbol{z}}(\boldsymbol{z})}\left[(D(G(\boldsymbol{z}))-c)^2\right] \end{aligned} $$

The resulting objective is smoother.

Convergence of the training process¶

GANs are much more difficult to train, than standard networks due to the adversarial loss.

It can be shown, that min-max problems solved with gradient descent/ascent are prone to oscillatory behaviour

due to the presence of complex eigenvalues in the linearization of the model.

Many methods have been proposed to 'monotonize' the convergence, but most of them are not very functional.

Adaptive discriminator augmentation¶

If we train on a small dataset, the discriminator tends to overfit.

In Training Generative Adversarial Networks with Limited Data, by T. Karras et. al

it has been proposed to make problem more difficult for the discriminator by data augmentation.

Note, we need to use differentiable data augmentation.

The strength of augmentation also has to be determined.

It significantly improves on small and medium datasets!

Metrics for evaluating the quality of generated images¶

The standard metrics for evaluating the qualities of generated images are:

Inception Score
Frechet Inception distance (FID)
Precision and Recall

Inception Score¶

Inception score has been proposed in Improved Techniques for Training GANs

The idea is to compute conditional label distribution $p(y \mid x)$ with an Inception model. Good images are assumed to have low-entropy, quantized by the formula

$$\exp \left(\mathbb{E}_{\boldsymbol{x}} \operatorname{KL}(p(y \mid \boldsymbol{x}) \| p(y))\right)$$

Frechet Inception Distance¶

By far the most popular metric, proposed in GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

The images are mapped by the Inception v2. network to a latent space.
In the latent space we fit two Gaussians, one for real images, another for fake images
We compute the Wasserstein-2 distance between those two Gaussians by using the formula

$$d^2\left((\boldsymbol{m}, \boldsymbol{C}),\left(\boldsymbol{m}_w, \boldsymbol{C}_w\right)\right)=\left\|\boldsymbol{m}-\boldsymbol{m}_w\right\|_2^2+\operatorname{Tr}\left(\boldsymbol{C}+\boldsymbol{C}_w-2\left(\boldsymbol{C} \boldsymbol{C}_w\right)^{1 / 2}\right)$$

Disadvantage: it only compares two momemts of the distributions, thus ignoring higher-order topological representations.

Generating higher resolution images¶

In order to generate higher-resolution images, several approaches have been proposed.

In SAGAN (Self-attention GAN) a self-attention is built into the CNN architecture.

SAGAN was later scaled to BigGAN by using several engineering tricks.

The next step has been progressive growing of GANS, which we will discuss.

Progressive growing of GANS¶

In Progressive Growing of GANs for Improved Quality, Stability, and Variation a new methodology for training GANs has been proposed.

The idea is to train on lower resolutions and add layers to $G$ and $D$.

Other techniques include minibatch discrimination

StyleGAN¶

A Style-Based Generator Architecture for Generative Adversarial Networks by T. Karras et. al.

the StyleGAN architecture has been proposed.

The loss function and discriminator are the same, only generator (+few tricks) is redesigned to put noise in different resolutions.

One of the new block is Adaptive Normalization given as

$$\operatorname{AdaIN}\left(\mathbf{x}_i, \mathbf{y}\right)=\mathbf{y}_{s, i} \frac{\mathbf{x}_i-\mu\left(\mathbf{x}_i\right)}{\sigma\left(\mathbf{x}_i\right)}+\mathbf{y}_{b, i}$$

StyleGAN version 2¶

In Analyzing and Improving the Image Quality of StyleGAN the quality of generated images has improved even further.

The AdaIN operations has been removed, replaced by the weight demodulation, which is just normalization of convolutional weights.
Regularization term in the loss is used less frequently
The new architecture for the generator does not use progressive growing and uses skip connection

StyleGAN version 2: results¶

At a single $\mathbf{w} \in \mathcal{W}$ the local metric scaling properties of the generator mapping $g(\mathbf{w}): \mathcal{W} \rightarrow \mathcal{Y}$ are captured by the Jacobian matrix $\mathbf{J}_{\mathbf{w}}=\delta g(\mathbf{w}) / \delta \mathbf{w}$. Motivated by the desire to preserve the expected lengths of vectors regardless of the direction, we formulate the regularizer as: $$ \mathbb{E}_{\mathbf{w}, \mathbf{y} \sim \mathcal{N}(0, \mathbf{I})}\left(\left\|\mathbf{J}_{\mathbf{w}}^{\mathbf{T}} \mathbf{y}\right\|_2-a\right)^2 $$ where $y$ are random images with normally distributed pixel intensities, and $w \sim f(z)$, where $z$ are normally distributed.

Quality of generation of images¶

Generative image2image transformations¶

Conditional adversarial networks showed that it is quite easy to modify the generation subject to some conditioning information.

Instead of $p(x)$ we just model $p(x \mid y)$.

The original paper conditioned on the class, but you can also do image2image transformations!

Image2image transformations with conditional GANs¶

Image-to-Image Translation with Conditional Adversarial Networks

shows how to modify GAN architecture for image-to-image translation.

A conditional GAN learns a mapping from a pair $\{x, z\}$ to the output $y$.

The objective of a conditional GAN can be expressed as $$ \begin{aligned} \mathcal{L}_{c G A N}(G, D)= & \mathbb{E}_{x, y}[\log D(x, y)]+ \\ & \mathbb{E}_{x, z}[\log (1-D(x, G(x, z))] \end{aligned} $$

Since we have to output something close to $y$, it is beneficial to add reconstruction loss to the model:

$$ \mathcal{L}_{L 1}(G)=\mathbb{E}_{x, y, z}\left[\|y-G(x, z)\|_1\right] $$

Our final objective is $$ G^*=\arg \min _G \max _D \mathcal{L}_{c G A N}(G, D)+\lambda \mathcal{L}_{L 1}(G) $$

Unpair image translation¶

The approach described before requires a parallel corpus. GAN, however, learn distributions.

Can we learn a mapping without having paired data?

One of the seminal methods has been cycleGAN proposed in Unpaired image-to-image translation using cycle-consistent adversarial networks.

CycleGAN¶

GAN-loss has the standard form: \begin{aligned} \mathcal{L}_{\mathrm{GAN}}\left(G, D_Y, X, Y\right) & =\mathbb{E}_{y \sim p_{\text {data }}(y)}\left[\log D_Y(y)\right] \\ & +\mathbb{E}_{x \sim p_{\text {data }}(x)}\left[\log \left(1-D_Y(G(x))\right]\right. \end{aligned}

Cycle consistency loss: \begin{aligned} \mathcal{L}_{\mathrm{cyc}}(G, F) & =\mathbb{E}_{x \sim p_{\text {data }}(x)}\left[\|F(G(x))-x\|_1\right] \\ & +\mathbb{E}_{y \sim p_{\text {data }}(y)}\left[\|G(F(y))-y\|_1\right] \end{aligned}

Full formulation:

\begin{aligned} \mathcal{L}\left(G, F, D_X, D_Y\right)= & \mathcal{L}_{\mathrm{GAN}}\left(G, D_Y, X, Y\right) \\ & +\mathcal{L}_{\mathrm{GAN}}\left(F, D_X, Y, X\right) \\ & +\lambda \mathcal{L}_{\mathrm{cyc}}(G, F) \end{aligned}

Multiple domains¶

CycleGAN and cGAN with paired data can only learn for two domains.

The approach of StarGAN conditions the generation on the domain type with a single model.

The loss has 3 components: the adversarial loss, the reconstruction loss and the domain classification loss (obtain from the discriminator).

StarGAN also has 2 versions.

Applications of GANs¶

Image Synthesis (images that do not exist, deepfakes, avatars)
Image2Image translation
Data Augmentation

GAN inversion¶

One of the problem of a GAN model is that it is not invertible easily.

There is a big review on the subject.

Given a real image $x$, it is not always possible to find $z$ such that $G(z) \approx x$.

For standard GAN, we can look for optimal $z$.

For StyleGAN, additionally, we can optimize over

A typical solution to this problem now for StyleGAN is to optimize over $W$ or even every style vector (named $W^{+}$).

Interpretable directions¶

The latent space of GANs has interesting structure.

One can find linear direction in it which corresponds to interpretable directions.

There are many papers on this and many way to disentangle such directions.

One of the has been proposed by Voynov and Babenko

Finding interpretable directions¶

We need to find directions in the latent space that are interpretable.

We introduce an additional network $R$ (called reconstructor)

Reconstructor receives a pair $G(z), G(z + \varepsilon a_k)$ and tries to predict the shift.

$$\min _{A, R} \underset{z, k, \varepsilon}{\mathbb{E}} L(A, R)=\min _{A, R} \underset{z, k, \varepsilon}{\mathbb{E}}\left[L_{c l}(k, \widehat{k})+\lambda L_r(\varepsilon, \widehat{\varepsilon})\right]$$

Lecture 14: Generative models II (Generative adversarial models)¶

Previous lecture: Generative models I¶

Current lecture:¶