Information Theory

A Short Introduction to Entropy, Cross-Entropy and KL-Divergence - YouTube

Motivation. Suppose a weather station is sending you information about the current weather. There is 50% chance of sun, and 50% chance of rain. Then the weather station can send you just one bit of information to sum this information: 1 if sunny, 0 if rainy.

def. Shannon Information.¹ Given a random variable $X$ , the information given in a particular realization $x$ of $X$ is:

I_{X} (x) := - lo g_{b} (p (x))

where $b$ , the base, determines the units (either bits when $b = 2$ or nats when $b = e$ ) Intuition. Information is the reduction of ambiguity/uncertainty.

def. Shannon Entropy.² Given a random variable $X$ , the entropy of this random variable is:

H (X) := - \forall x \sum p (x) I_{X} (x) = E_{X} [I_{X} (x)]

Intuition. Entropy is the average amount of information transmitted in total.

Example. In our weather station example, $X$ is the random variable:

X = {10 if sunny with p = 0.5 if rainy with p = 0.5

When $1 : sunny$ , transmitted is $I_{X} (1) = - lo g_{2} 0.5 = 1 bit$ of information; same when $0 : rainy$ is transmitted. Entropy of $X$ is:

H (X) = - (0.5 lo g_{2} (0.5) + 0.5 lo g_{2} (0.5)) = 1 bit

which matches our intution that entropy is the average amount of information transmitted.

def. Cross Entropy. The cross-entropy of the distribution $q$ relative to distribution $p$ is:^bic1ka

H (p, q) = - expectation E_{p} [lo g q]

Intution. Cross-entropy is a measure of a “distance” between two distributions. Remark. Often used in Deep Learning but it doesn’t have anything to do with information theory, but simply that coincidentally they had to calculate the “distance between two distributions.”

def. Kullback-Leibler Divergence (KL-Divergence). Another measure of “distance” between two distributions

D_{KL} (p ∥ q) := x \in X \sum p (x) ln (\frac{p ( x )}{q ( x )}) = E_{x \sim p (x)} [ln p (x) - ln q (x)]

Example. KL Diverge between two multivariate gaussians $p (x) \sim N (μ_{p}, Σ_{p})$ and $q (x) \sim N (μ_{q}, Σ_{q})$ both of dimension $k$ can be derived³ to be:

D_{KL} (p ∥ q) = \frac{1}{2} (ln \frac{∣ Σ _{q} ∣}{∣ Σ _{p} ∣} - k + (μ_{p} - μ_{q})^{⊤} Σ_{q}^{- 1} (μ_{p} - μ_{q}) + tr {Σ_{q}^{- 1} Σ_{p}})

PK's Notes

Explorer

Information Theory

Graph View

Backlinks

PK's Notes

Explorer

Information Theory

Footnotes §

Graph View

Backlinks

Footnotes