Understand Deep Generative Models from Ground Up

林祥瑞

Preface

The lecture has 40 min time limit and we have length slides to consume. We may skip some pages if we have to :3.

Mathematical Foundations

In this section, we will have brief intro on information theory. Then we dive into these concepts that are fundamental to deep generative models.

  • KL-divergence

  • Jensen–Shannon divergence

Information Theory - Recap

  • The amount of information can be understood as the surpriseness of an event.

  • That is, rare events give us more information. We need more bits to encode that event in terms of binary encoding.

  • Suppose we have a fair dice, which is represented by a rv \(X\) with events \(\{1, 2, 3, 4, 5, 6\}\). The amount of information of event \(X = 3\) is measured by \(I(X = 3) = \log_2 \frac{1}{P(X = 3)} \approx 2.58\).

Information Theory - Information Content

  • The amount of information can be intuitively understood as # of perfect queries to locate that event.

  • Suppose our \(X = 3\) example. We ask three yes-no queries to find it out.

    • Is it greater than or equal to 4? No

    • Is it greater or equal to 2? Yes

    • Is it 3? Yes

  • With perfect yes-no query, it is expected to ask \(\log_2 \frac{1}{P(X = 3)}\) questions.

Information Theory - Information Content

  • In fact, we need law of large numbers and many theories to show this idea. It is out of scope in this lecture.

  • The information content is additive on individual events. Suppose we roll the dice twice. The information content of joint event \(X_1 = 3, X_2 = 2\) is

\[\begin{aligned} \log_2 \frac{1}{P(X_1 = 3, X_2 = 2)} \\ = \log_2 \frac{1}{P(X_1 = 3)} + \log_2 \frac{1}{P(X_2 = 2)} \\ = I(X_1 = 3) + I(X_2 = 2) \end{aligned}\]

Information Theory - Entropy

  • The (Shannon) entropy of rv \(X\), denoted as \(H(x)\), can be understood as expected amount of information.

  • Suppose a rv \(X\) and its pdf/pmf \(P(x)\).

\[\begin{aligned} H(X) \ \\ = E_{X \sim P}[I(X)] \ \\ = \sum_x P(x) \log_b \frac{1}{P(x)} \ \text{(discrete)} \\ = \int_x P(x) \log_b \frac{1}{P(x)} \ \text{(continuous)} \end{aligned}\]
  • The entropy is usually interpreted as the randomness of a random variable.

Information Theory - Information Gain

  • What if we encode a true distribution \(P\) with artifact distribution \(Q\)?

  • Encoding based on \(Q\) can be thought of as an imperfect query for \(P\).

  • The difference \(E_{x \sim P(x)}[\frac{1}{Q(x)}] - E_{x \sim P(x)}[\frac{1}{P(x)}]\) is always non-negative due to Gibb’s inequality.

Information Theory - Conditional Entropy

  • Suppose you have the knowledge of rv \(Y\), and would like to know the randomness of rv \(X\). It can be modeled as conditional entropy \(H(X|Y)\).

  • The rationale goes as following:

    • Given value \(y\), \(H(X | Y = y) = \sum_x P(X = x | Y = y) \log \frac{1}{P(X = x | Y = y)} \)

    • \(H(X|Y) = \sum_y P(y) H (X | Y = y)\) is expected value over \(y\).

Information Theory - Conditional Entropy

  • \(H(X | Y)\) is smaller than or equal to \(H(X)\) because you have prior knowledge of \(Y\).

  • For example, \(X\) indicates whether teacher is inside R442, and \(Y\) is whether you observe teacher walk into R442.

  • \(H(x) \gt 0\) because you cannot make sure his or her existence. \(H(X|Y)\) is strictly less than \(H(X)\) because you know the outcome of \(X\) if \(Y = \text{true}\).

  • \(H(X) = H(X|Y)\) if \(X\) and \(Y\) are individual events.

Information Theory - Mutual Information

  • The dependence of rvs \(X, Y\) is modeled as mutual information \(I(X; Y)\).

  • In fact,

\[\begin{aligned} I(X; Y) \\ = H(X) - H(X | Y) \\ = H(Y) - H(Y | X) \\ = H(X) + H(Y) - H(X, Y) \\ = H(X, Y) - H(X|Y) - H(Y|X) \end{aligned}\]

Information Theory - Mutual Information

Kullback–Leibler Divergence

  • Suppose two probability distributions \(P\) and \(Q\). The Kullback–Leibler divergence of \(P\) with respect to \(Q\) is formulated as:

\[D_{\text{KL}} (P \parallel Q) = E_{x \sim P}[\log \frac{1}{Q(x)}] - E_{x \sim P}[\log \frac{1}{P(x)}]\]

Kullback–Leibler Divergence - Interpretations

  • \(P\) is regarded as true probability, and \(Q\) is regarded as theory.

  • In coding theory, it measures expected # of extra bits to encode samples from \(P\) using code optimized for \(Q\).

  • In Bayesian inference, \(Q\) is prior and \(P\) is posterior. In detail,

    • \(Q = p(\theta)\) the probability over our model parameters \(\theta\).

    • \(P = p(\theta | x)\) give a new observation \(x\).

    • The divergence is information gain by revising the belief \(Q\) to \(P\).

Kullback–Leibler Divergence & Jensen–Shannon Divergence

  • KL-div is not symmetric! \(D_{\text{KL}(P \| Q)}\) may not be the same with \(D_{\text{KL}(Q \| P)}\).

  • Jensen–Shannon divergence fixes this by

\[D_{\text{JS}} (P \| Q) = \frac{1}{2} D_{\text{JS}} (P \| M) + \frac{1}{2} D_{\text{JS}} (Q \| M)\]

where \(M = \frac{1}{2}(P + Q)\)

Kullback–Leibler Divergence & Jensen–Shannon Divergence

Remarks

  • KL-divergence should never be misnamed as KL-metric or KL-distance. Since it is not symmetric, it does not fit the mathematical definition of metric.

  • \(D_{\text{KL}}(P \| Q)\) could be infinite if \(P(x) \gt 0\) and \(Q(x) = 0\) for some \(x\).

  • Both KL-div and JS-div does not respect the metric on support, leading to vanishing gradient.

  • We will introduce Watterstein metric to resolve these issues. === Conclusion

  • Understand basics of information theory and intuition behind KL-divergence.

  • Knows Jensen-Shannon divergence.

Variational Autoencoder (VAE)

  • Vanilla autoencoder and noise autoencoder

  • Gaussian mixture model

  • Evidence lower bound (ELBO)

Autoencoder

  • Intuition: Given data sample \(x\), encode it into latent space code \(m\). Then decode it into reconstruction \(x'\).

  • Trained by minimizing the difference b/w input and reconstruction \(L(x, x')\), usually by a L2 or cross-entropy.

  • We can represent \(x\) by lower dimensional code \(m\).

autoencoder.png

Noisy Autoencoder

  • The autoencoder may not be robust to slight changes on input \(x\).

  • One solution is to add some noise \(z\) on input \(x\).

noisy autoencoder.png

Deriving VAE

  • Can we slightly change the code \(m\) to \(m'\), and generate new reasonable sample by decoding \(m'\)?

  • In fact, it does not work as expected, because both encoder and decoder are non-linear. We cannot expect the latent space has that good property.

  • Solution: add some noise \(z\) on latent code \(m\).

Deriving VAE