Bayes update view of pointwise mutual information


This is fun. Pointwise Mutual Information (e.g. Church and Hanks 1990) between two variable outcomes \(x\) and \(y\) is

\[ PMI(x,y) = \log \frac{p(x,y)}{p(x)p(y)} \]

It’s called “pointwise” because Mutual Information, between two (discrete) variables X and Y, is the expectation of PMI over possible outcomes of X and Y: \( MI(X,Y) = \sum_{x,y} p(x,y) PMI(x,y) \).

One interpretation of PMI is it’s measuring how much deviation from independence there is — since \(p(x,y)=p(x)p(y)\) if X and Y were independent, so the ratio is how non-independent they (the outcomes) are.

You can get another interpretation of this quantity if you switch into conditional probabilities. Looking just at the ratio, apply the definition of conditional probability:

\[ \frac{p(x,y)}{p(x)p(y)} = \frac{p(x|y)}{p(x)} \]

Think about doing a Bayes update for your belief about \(x\). Start with the prior \(p(x)\), then learn \(y\) and you update to the posterior belief \(p(x|y)\). How much your belief changes is measured by that ratio; the log-scaled ratio is PMI. (Positive PMI = increase belief, negative PMI = decrease belief. Positive vs. negative associations.)

Interestingly, it’s symmetric (obvious from the original definition of PMI, sure):
\[ \frac{p(x|y)}{p(x)} = \frac{p(y|x)}{p(y)} \]

So under this measurement of “amount of information you learn,” the amount you learn about \(x\) from \(y\) is actually the same as how much you learn about \(y\) from \(x\).

This is closer to the information gain view of mutual information, when you decompose it into relative and conditional entropies; the current Wikipedia page has some of the derivations back and forth for them.

Lots more about this stuff on the MI and KL Divergence Wikipedia pages. And early chapters of the (free) MacKay 2003 textbook. There seems to be lots of recent work using PMI for association scores between words or concepts and such (I did this with Facebook “Like” data at my internship there, it is quite fun); it’s nice because with MLE or fixed-Dirichlet-MAP estimation it only requires simple counts and no optimization/sampling, so you can use it on very large datasets, and it seems to give good pairwise association results in many circumstances.

This entry was posted in Uncategorized. Bookmark the permalink.

Comments are closed.