Actually this post is mainly to test the MathJax installation I put into WordPress via this plugin. But information theory is great, why not?

The probability of a symbol is \(p\).

It takes \(\log \frac{1}{p} = -\log p\) bits to encode one symbol — sometimes called its “surprisal”. Surprisal is 0 for a 100% probable symbol, and ranges up to \(\infty\) for extremely low probability symbols. This is because you use a coding scheme that encodes common symbols as very short strings, and less common symbols as longer ones. (e.g. Huffman or arithmetic coding.) We should say logarithms are base-2 so information is measured in bits.\(^*\)

If you have a stream of such symbols and a probability distribution \(\vec{p}\) for them, where a symbol \(i\) comes at probability \(p_i\), then the average message size is the expected surprisal:

\[ H(\vec{p}) = \sum_i p_i \log \frac{1}{p_i} \]

this is the Shannon **entropy** of the probability distribution \( \vec{p} \), which is a measure of its uncertainty. In fact, if you start with a few pretty reasonable axioms for how to design a measurement of uncertainty of a discrete probability distribution, you end up with the above equation as the only possible measure. (I think. This is all in Shannon’s original paper.)

Now, what if you have symbols at a distribution \( \vec{p} \) but you encode then with the wrong distribution \( \vec{q} \)? You pay \(\log\frac{1}{q}\) bits per symbol but the expectation is under the true distribution \(\vec{p}\). Then the average message size is called the **cross-entropy** between the distributions:

\[ H(\vec{p},\vec{q}) = \sum_i p_i \log \frac{1}{q_i} \]

How much worse is this coding compared to the optimal one? (I.e. how much a cost do you pay for encoding with the wrong distribution?) The optimal one is size \( \sum -p_i \log p_i \) so it’s just

\[ \begin{align}

& \sum_i -p_i \log q_i + p_i \log p_i \\

KL(\vec{p} || \vec{q})=

&\sum_i p_i \log \frac{p_i}{q_i}

\end{align} \]

which is called the **relative entropy** or Kullback-Leibler divergence, and it’s a measurement of the disssimilarity of the distributions \(\vec{p}\) and \(\vec{q}\). You can see it’s about dissimilarity because if \(\vec{p}\) and \(\vec{q}\) were the same, then the inner term \(\log\frac{p}{q}\) would always be 0 and the whole thing comes out to be 0.

For more, I rather like the early chapters of the free online textbook by David MacKay: “Information Theory, Inference, and Learning Algorithms”. That’s where I picked up the habit of saying surprisal is \( \log \frac{1}{p} \) instead of \(-\log p\); the former seems more intuitive to me, and then you don’t have a pesky negative sign in the entropy and cross-entropy equations. In general the book is great at making things intuitive. Its main weakness is you can’t trust the insane negative things he says about frequentist statistics, but that’s another discussion.

\(^*\) You can use natural logs or whatever and it’s just different sized units: “nats”, as you can see in the fascinating Chapter 18 of MacKay on codebreaking, which features Bletchley Park, Alan Turing, and Nazis.

Unfortunately I cannot seem to use MathJax in my comments. I was hoping to nerd out about some of my favorite things about the KL divergence.

MathJax works great when I come directly to your web site, but when I read your blog via Google Reader I only see TeX commands, not processed images. This usually happens the first time anyone uses MathJax on a blog. There’s some way to fix it, but I haven’t used MathJax myself so I can’t tell you the solution.

Hm. test: [latex]x^2[/latex]

test again: \(x^2\)

Aha! You have to use the backslash-paren notation, or backslash-bracket notation.

this is inline: B( x^2 B)

this is display mode:

B[ x^2 B]

Actually replace “B” with a single backslash. These then become:

this is inline: \(x^2\)

this is display mode:

\[ x^2 \]