A language model’s perplexity is exponentiated negative average log-likelihood,

$$\exp( -\frac{1}{N} \log(p(x)))$$

Where the inner term usually decomposes into a sum over individual items; for example, as \(\sum_i \log p(x_i | x_1..x_{i-1})\) or \(\sum_i \log p(x_i)\) depending on independence assumptions, where for language modeling word tokens are usually taken as the individual units. (In which case it is the geometric mean of per-token negative log-likelihoods.) It’s equivalent to exponentiated cross-entropy between the model and the empirical data distribution, since \(-1/N \sum_i^N \log p(x_i) = -\sum_k^K \hat{p}_k \log p_k = H(\hat{p};p)\) where \(N\) is the number of items and \(K\) is the number of discrete classes (e.g. word types for language modeling) and \(\hat{p}_k\) is the proportion of data having class \(k\).

A nice interpretation of any exponentiated entropy measure is as branching factor: entropy measures uncertainty in bits or nats, but in exponentiated form it’s measured as the size of an equally weighted distribution with equivalent uncertainty. That is, \(\exp(-H(p))\) is how many sides you need on a fair die to get the same uncertainty as the distribution \(p\).

Entropy differs by a constant depending whether you measured using base-2 or natural logarithms (then your units are bits vs. nats, respectively). But perplexity is the same with whichever base you want. The following works with base-2 instead:

\[ \exp(-\sum_k p_k \log p_k) = \exp(\sum_k \log p_k^{-p_k}) = \prod_k p_k^{-p_k} \]

Neat Wikipedia discovery: in ecology and economics, the diversity index measures were developed and they are in fact exponentiated entropy, just like perplexity. In fact, the different diversity indexes correspond to exponentiated Renyi entropies. The term “diversity” is a nice alternative term to “uncertainty” — less epistemologically loaded, just a description of the distribution.