Understand Deep Generative Models from Ground Up



The lecture has 40 min time limit and we have length slides to consume. We may skip some pages if we have to :3.

Mathematical Foundations

In this section, we will have brief intro on information theory. Then we dive into these concepts that are fundamental to deep generative models.

  • KL-divergence

  • Jensen–Shannon divergence

Information Theory - Recap

  • The amount of information can be understood as the surpriseness of an event.

  • That is, rare events give us more information. We need more bits to encode that event in terms of binary encoding.

  • Suppose we have a fair dice, which is represented by a rv \(X\) with events \(\{1, 2, 3, 4, 5, 6\}\). The amount of information of event \(X = 3\) is measured by \(I(X = 3) = \log_2 \frac{1}{P(X = 3)} \approx 2.58\).

Information Theory - Information Content

  • The amount of information can be intuitively understood as # of perfect queries to locate that event.

  • Suppose our \(X = 3\) example. We ask three yes-no queries to find it out.

    • Is it greater than or equal to 4? No

    • Is it greater or equal to 2? Yes

    • Is it 3? Yes

  • With perfect yes-no query, it is expected to ask \(\log_2 \frac{1}{P(X = 3)}\) questions.

Information Theory - Information Content

  • In fact, we need law of large numbers and many theories to show this idea. It is out of scope in this lecture.

  • The information content is additive on individual events. Suppose we roll the dice twice. The information content of joint event \(X_1 = 3, X_2 = 2\) is

\[\begin{aligned} \log_2 \frac{1}{P(X_1 = 3, X_2 = 2)} \\ = \log_2 \frac{1}{P(X_1 = 3)} + \log_2 \frac{1}{P(X_2 = 2)} \\ = I(X_1 = 3) + I(X_2 = 2) \end{aligned}\]

Information Theory - Entropy

  • The (Shannon) entropy of rv \(X\), denoted as \(H(x)\), can be understood as expected amount of information.

  • Suppose a rv \(X\) and its pdf/pmf \(P(x)\).

\[\begin{aligned} H(X) \ \\ = E_{X \sim P}[I(X)] \ \\ = \sum_x P(x) \log_b \frac{1}{P(x)} \ \text{(discrete)} \\ = \int_x P(x) \log_b \frac{1}{P(x)} \ \text{(continuous)} \end{aligned}\]
  • The entropy is usually interpreted as the randomness of a random variable.

Information Theory - Information Gain

  • What if we encode a true distribution \(P\) with artifact distribution \(Q\)?

  • Encoding based on \(Q\) can be thought of as an imperfect query for \(P\).

  • The difference \(E_{x \sim P(x)}[\frac{1}{Q(x)}] - E_{x \sim P(x)}[\frac{1}{P(x)}]\) is always non-negative due to Gibb’s inequality.

Information Theory - Conditional Entropy

  • Suppose you have the knowledge of rv \(Y\), and would like to know the randomness of rv \(X\). It can be modeled as conditional entropy \(H(X|Y)\).

  • The rationale goes as following:

    • Given value \(y\), \(H(X | Y = y) = \sum_x P(X = x | Y = y) \log \frac{1}{P(X = x | Y = y)} \)

    • \(H(X|Y) = \sum_y P(y) H (X | Y = y)\) is expected value over \(y\).

Information Theory - Conditional Entropy

  • \(H(X | Y)\) is smaller than or equal to \(H(X)\) because you have prior knowledge of \(Y\).

  • For example, \(X\) indicates whether teacher is inside R442, and \(Y\) is whether you observe teacher walk into R442.

  • \(H(x) \gt 0\) because you cannot make sure his or her existence. \(H(X|Y)\) is strictly less than \(H(X)\) because you know the outcome of \(X\) if \(Y = \text{true}\).

  • \(H(X) = H(X|Y)\) if \(X\) and \(Y\) are individual events.

Information Theory - Mutual Information

  • The dependence of rvs \(X, Y\) is modeled as mutual information \(I(X; Y)\).

  • In fact,

\[\begin{aligned} I(X; Y) \\ = H(X) - H(X | Y) \\ = H(Y) - H(Y | X) \\ = H(X) + H(Y) - H(X, Y) \\ = H(X, Y) - H(X|Y) - H(Y|X) \end{aligned}\]

Information Theory - Mutual Information

Kullback–Leibler Divergence

  • Suppose two probability distributions \(P\) and \(Q\). The Kullback–Leibler divergence of \(P\) with respect to \(Q\) is formulated as:

\[D_{\text{KL}} (P \parallel Q) = E_{x \sim P}[\log \frac{1}{Q(x)}] - E_{x \sim P}[\log \frac{1}{P(x)}]\]

Kullback–Leibler Divergence - Interpretations

  • \(P\) is regarded as true probability, and \(Q\) is regarded as theory.

  • In coding theory, it measures expected # of extra bits to encode samples from \(P\) using code optimized for \(Q\).

  • In Bayesian inference, \(Q\) is prior and \(P\) is posterior. In detail,

    • \(Q = p(\theta)\) the probability over our model parameters \(\theta\).

    • \(P = p(\theta | x)\) give a new observation \(x\).

    • The divergence is information gain by revising the belief \(Q\) to \(P\).

Kullback–Leibler Divergence & Jensen–Shannon Divergence

  • KL-div is not symmetric! \(D_{\text{KL}(P \| Q)}\) may not be the same with \(D_{\text{KL}(Q \| P)}\).

  • Jensen–Shannon divergence fixes this by

\[D_{\text{JS}} (P \| Q) = \frac{1}{2} D_{\text{JS}} (P \| M) + \frac{1}{2} D_{\text{JS}} (Q \| M)\]

where \(M = \frac{1}{2}(P + Q)\)

Kullback–Leibler Divergence & Jensen–Shannon Divergence


  • KL-divergence should never be misnamed as KL-metric or KL-distance. Since it is not symmetric, it does not fit the mathematical definition of metric.

  • \(D_{\text{KL}}(P \| Q)\) could be infinite if \(P(x) \gt 0\) and \(Q(x) = 0\) for some \(x\).

  • Both KL-div and JS-div does not respect the metric on support, leading to vanishing gradient.

  • We will introduce Watterstein metric to resolve these issues. === Conclusion

  • Understand basics of information theory and intuition behind KL-divergence.

  • Knows Jensen-Shannon divergence.

Variational Autoencoder (VAE)

  • Vanilla autoencoder and noise autoencoder

  • Gaussian mixture model

  • Evidence lower bound (ELBO)


  • Intuition: Given data sample \(x\), encode it into latent space code \(m\). Then decode it into reconstruction \(x'\).

  • Trained by minimizing the difference b/w input and reconstruction \(L(x, x')\), usually by a L2 or cross-entropy.

  • We can represent \(x\) by lower dimensional code \(m\).


Noisy Autoencoder

  • The autoencoder may not be robust to slight changes on input \(x\).

  • One solution is to add some noise \(z\) on input \(x\).

noisy autoencoder.png

Deriving VAE

  • Can we slightly change the code \(m\) to \(m'\), and generate new reasonable sample by decoding \(m'\)?

  • In fact, it does not work as expected, because both encoder and decoder are non-linear. We cannot expect the latent space has that good property.

  • Solution: add some noise \(z\) on latent code \(m\).

Deriving VAE