A Short Introduction to Entropy, Cross-Entropy and KL-Divergence - YouTube

Motivation. Suppose a weather station is sending you information about the current weather. There is 50% chance of sun, and 50% chance of rain. Then the weather station can send you just one bit of information to sum this information: 1 if sunny, 0 if rainy.

def. Shannon Information.1 Given a random variable , the information given in a particular realization of is:

where , the base, determines the units (either bits when or nats when )

def. Shannon Entropy.2 Given a random variable , the entropy of this random variable is:

Intuition. Entropy is the average amount of information transmitted in total.

Example. In our weather station example, is the random variable:

When , transmitted is of information; same when is transmitted. Entropy of is:

which matches our intution that entropy is the average amount of information transmitted.

def. Cross Entropy. The cross-entropy of the distribution relative to distribution is:^bic1ka

Intution. Cross-entropy is a measure of a “distance” between two distributions. Remark. Often used in Deep Learning but it doesn’t have anything to do with information theory, but simply that coincidentally they had to calculate the “distance between two distributions.”

def. Kullback-Leibler Divergence (KL-Divergence). Another measure of “distance” between two distributions

Example. KL Diverge between two multivariate gaussians and both of dimension can be derived3 to be:

Footnotes

  1. Information content - Wikipedia

  2. Entropy (information theory) - Wikipedia

  3. KL Divergence between 2 Gaussian Distributions