林祥瑞

The lecture has 40 min time limit and we have length slides to consume. We may skip some pages if we have to :3.

In this section, we will have brief intro on information theory. Then we dive into these concepts that are fundamental to deep generative models.

KL-divergence

Jensen–Shannon divergence

The amount of information can be understood as the

*surpriseness*of an event.That is, rare events give us more information. We need more bits to encode that event in terms of binary encoding.

Suppose we have a fair dice, which is represented by a rv \(X\) with events \(\{1, 2, 3, 4, 5, 6\}\). The amount of information of event \(X = 3\) is measured by \(I(X = 3) = \log_2 \frac{1}{P(X = 3)} \approx 2.58\).

The amount of information can be intuitively understood as # of perfect queries to locate that event.

Suppose our \(X = 3\) example. We ask three yes-no queries to find it out.

Is it greater than or equal to 4? No

Is it greater or equal to 2? Yes

Is it 3? Yes

With perfect yes-no query, it is expected to ask \(\log_2 \frac{1}{P(X = 3)}\) questions.

In fact, we need law of large numbers and many theories to show this idea. It is out of scope in this lecture.

The information content is additive on individual events. Suppose we roll the dice twice. The information content of joint event \(X_1 = 3, X_2 = 2\) is

\[\begin{aligned}
\log_2 \frac{1}{P(X_1 = 3, X_2 = 2)} \\
= \log_2 \frac{1}{P(X_1 = 3)} + \log_2 \frac{1}{P(X_2 = 2)} \\
= I(X_1 = 3) + I(X_2 = 2)
\end{aligned}\]

The (Shannon) entropy of rv \(X\), denoted as \(H(x)\), can be understood as expected amount of information.

Suppose a rv \(X\) and its pdf/pmf \(P(x)\).

\[\begin{aligned}
H(X) \ \\
= E_{X \sim P}[I(X)] \ \\
= \sum_x P(x) \log_b \frac{1}{P(x)} \ \text{(discrete)} \\
= \int_x P(x) \log_b \frac{1}{P(x)} \ \text{(continuous)}
\end{aligned}\]

The entropy is usually interpreted as the randomness of a random variable.

What if we encode a true distribution \(P\) with artifact distribution \(Q\)?

Encoding based on \(Q\) can be thought of as an imperfect query for \(P\).

The difference \(E_{x \sim P(x)}[\frac{1}{Q(x)}] - E_{x \sim P(x)}[\frac{1}{P(x)}]\) is always non-negative due to Gibb’s inequality.

Suppose you have the knowledge of rv \(Y\), and would like to know the randomness of rv \(X\). It can be modeled as conditional entropy \(H(X|Y)\).

The rationale goes as following:

Given value \(y\), \(H(X | Y = y) = \sum_x P(X = x | Y = y) \log \frac{1}{P(X = x | Y = y)} \)

\(H(X|Y) = \sum_y P(y) H (X | Y = y)\) is expected value over \(y\).

\(H(X | Y)\) is smaller than or equal to \(H(X)\) because you have prior knowledge of \(Y\).

For example, \(X\) indicates whether teacher is inside R442, and \(Y\) is whether you observe teacher walk into R442.

\(H(x) \gt 0\) because you cannot make sure his or her existence. \(H(X|Y)\) is strictly less than \(H(X)\) because you know the outcome of \(X\) if \(Y = \text{true}\).

\(H(X) = H(X|Y)\) if \(X\) and \(Y\) are individual events.

The dependence of rvs \(X, Y\) is modeled as mutual information \(I(X; Y)\).

In fact,

\[\begin{aligned}
I(X; Y) \\
= H(X) - H(X | Y) \\
= H(Y) - H(Y | X) \\
= H(X) + H(Y) - H(X, Y) \\
= H(X, Y) - H(X|Y) - H(Y|X)
\end{aligned}\]

Reference: Wikipedia: Conditional Entropy

Suppose two probability distributions \(P\) and \(Q\). The Kullback–Leibler divergence of \(P\) with respect to \(Q\) is formulated as:

\[D_{\text{KL}} (P \parallel Q) = E_{x \sim P}[\log \frac{1}{Q(x)}] - E_{x \sim P}[\log \frac{1}{P(x)}]\]

The divergence is non-negative due to Gibb’s inequality.

\(P\) is regarded as

*true*probability, and \(Q\) is regarded as*theory*.In coding theory, it measures expected # of extra bits to encode samples from \(P\) using code optimized for \(Q\).

In Bayesian inference, \(Q\) is prior and \(P\) is posterior. In detail,

\(Q = p(\theta)\) the probability over our model parameters \(\theta\).

\(P = p(\theta | x)\) give a new observation \(x\).

The divergence is information gain by revising the belief \(Q\) to \(P\).

KL-div is not symmetric! \(D_{\text{KL}(P \| Q)}\) may not be the same with \(D_{\text{KL}(Q \| P)}\).

Jensen–Shannon divergence fixes this by

\[D_{\text{JS}} (P \| Q) = \frac{1}{2} D_{\text{JS}} (P \| M) + \frac{1}{2} D_{\text{JS}} (Q \| M)\]

where \(M = \frac{1}{2}(P + Q)\)

Reference: GAN — Wasserstein GAN & WGAN-GP

KL-

*divergence*should never be misnamed as KL-*metric*or KL-*distance*. Since it is not symmetric, it does not fit the mathematical definition of metric.\(D_{\text{KL}}(P \| Q)\) could be infinite if \(P(x) \gt 0\) and \(Q(x) = 0\) for some \(x\).

Both KL-div and JS-div does not respect the metric on support, leading to vanishing gradient.

We will introduce Watterstein metric to resolve these issues. === Conclusion

Understand basics of information theory and intuition behind KL-divergence.

Knows Jensen-Shannon divergence.

Vanilla autoencoder and noise autoencoder

Gaussian mixture model

Evidence lower bound (ELBO)

Intuition: Given data sample \(x\), encode it into latent space code \(m\). Then decode it into reconstruction \(x'\).

Trained by minimizing the difference b/w input and reconstruction \(L(x, x')\), usually by a L2 or cross-entropy.

We can represent \(x\) by lower dimensional code \(m\).

The autoencoder may not be robust to slight changes on input \(x\).

One solution is to add some noise \(z\) on input \(x\).

Can we slightly change the code \(m\) to \(m'\), and generate new reasonable sample by decoding \(m'\)?

In fact, it does not work as expected, because both encoder and decoder are non-linear. We cannot expect the latent space has that good property.

Solution: add some noise \(z\) on latent code \(m\).