## Confusion matrix diagrams

I wrote a little note and diagrams on confusion matrix metrics: Precision, Recall, F, Sensitivity, Specificity, ROC, AUC, PR Curves, etc.

brenocon.com/confusion_matrix_diagrams.pdf

also, graffle source.

## Movie summary corpus and learning character personas

Here is one of our exciting just-finished ACL papers.  David and I designed an algorithm that learns different types of character personas — “Protagonist”, “Love Interest”, etc — that are used in movies.

To do this we collected a brand new dataset: 42,306 plot summaries of movies from Wikipedia, along with metadata like box office revenue and genre.  We ran these through parsing and coreference analysis to also create a dataset of movie characters, linked with Freebase records of the actors who portray them.  Did you see that NYT article on quantitative analysis of film scripts?  This dataset could answer all sorts of things they assert in that article — for example, do movies with bowling scenes really make less money?  We have released the data here.

Our focus, though, is on narrative analysis.  We investigate character personas: familiar character types that are repeated over and over in stories, like “Hero” or “Villian”; maybe grand mythical archetypes like “Trickster” or “Wise Old Man”; or much more specific ones, like “Sassy Best Friend” or “Obstructionist Bureaucrat” or “Guy in Red Shirt Who Always Gets Killed”.  They are defined in part by what they do and who they are — which we can glean from their actions and descriptions in plot summaries.

Our model clusters movie characters, learning posteriors like this:

Each box is one automatically learned persona cluster, along with actions and attribute words that pertain to it.  For example, characters like Dracula and The Joker are always “hatching” things (hatching plans, presumably).

One of our models takes the metadata features, like movie genre and gender and age of an actor, and associates them with different personas.  For example, we learn the types of characters in romantic comedies versus action movies.  Here are a few examples of my favorite learned personas:

One of the best things I learned about during this project was the website TVTropes (which we use to compare our model against).

We’ll be at ACL this summer to present the paper.  We’ve posted it online too:

## What inputs do Monte Carlo algorithms need?

Monte Carlo sampling algorithms (either MCMC or not) have a goal to attain samples from a distribution.  They can be organized by what inputs or prior knowledge about the distribution they require.  This ranges from a low amount of knowledge, as in slice sampling (just give it an unnormalized density function), to a high amount, as in Gibbs sampling (you have to decompose your distribution into individual conditionals).

Typical inputs include $$f(x)$$, an unnormalized density or probability function for the target distribution, which returns a real number for a variable value.  $$g()$$ and $$g(x)$$ represent sample generation procedures (that output a variable value); some generators require an input, some do not.

Here are the required inputs for a few algorithms.  (For an overview, see e.g. Ch 29 of MacKay.)  There are many more out there of course.  I’m leaving off tuning parameters.

Black-box samplers: Slice samplingAffine-invariant ensemble
- unnorm density $$f(x)$$

Metropolis (symmetric proposal)
- unnorm density/pmf $$f(x)$$
- proposal generator $$g(x)$$

Hastings (asymmetric proposal)
- unnorm density/pmf $$f(x)$$
- proposal generator $$g(x)$$
- proposal unnorm density/pmf $$q(x’; x)$$  .
… [For the proposal generator at state $$x$$, probability it generates $$x'$$]

Importance sampling, rejection sampling
- unnorm density/pmf $$f(x)$$
- proposal generator $$g()$$
- proposal unnorm density/pmf $$q(x)$$

Independent Metropolis-Hastings: the proposal is always the same, but still have to worry about asymmetric corrections
- unnorm density/pmf $$f(x)$$
- proposal generator $$g()$$
- proposal unnorm density/pmf $$q(x’; x)$$

Hamiltonian Monte Carlo
- unnorm density $$f(x)$$
- unnorm density gradient $$gf(x)$$

Gibbs Sampling
- local conditional generators $$g_i(x_{-i})$$
… [which have to give samples from $$p(x_i | x_{-i})$$]

Note importance/rejection sampling are stateless, but the MCMC algorithms are stateful.

I’m distinguishing a sampling procedure $$g$$ from a density evaluation function $$f$$ because having the latter doesn’t necessarily give you the former.  (For the one-dimension case, having an inverse CDF indeed gives you a a sampler, but multidimensional gets harder — part of why all these techniques were invented in the first place!)  Shay points out their relationship is analogous to 3-SAT: it’s easy to evaluate a full variable setting, but hard to generate them.  (Or specifically, think about a 3-SAT PMF $$p(x) = 1\{\text{\(x$$ is boolean satisfiable}\}\) where only one $$x$$ has non-zero probability; PMF evaluation is easy but the best known sampler is exponential time.)

And of course there’s a related organization of optimization algorithms.  Here’s a rough look at a few unconstrained optimizers:

Black-box optimizers: Grid search, Nelder-Mead, evolutionary, …
- objective $$f(x)$$

BFGS, CG
- objective $$f(x)$$
- gradient $$gf(x)$$

Newton-Raphson
- objective $$f(x)$$
- gradient $$gf(x)$$
- hessian $$hf(x)$$

Simulated annealing
- objective $$f(x)$$
- proposal generator $$g(x)$$

SGD
- one-example gradient $$gf_i(x)$$

I think it’s neat that Gibbs Sampling and SGD don’t always require you to implement a likelihood/objective function.  It’s nice to do to ensure you’re actually optimizing or exploring the posterior, but strictly speaking the algorithms don’t require it.

## Rise and fall of Dirichlet process clusters

Here’s Gibbs sampling for a Dirichlet process 1-d mixture of Gaussians. On 1000 data points that look like this.

I gave it fixed variance and a concentration and over MCMC iterations, and it looks like this.

The top is the number of points in a cluster. The bottom are the cluster means. Every cluster has a unique color. During MCMC, clusters are created and destroyed. Every cluster has a unique color; when a cluster dies, its color is never reused.

I’m showing clusters every 100 iterations. If there is a single point, that cluster was at that iteration but not before or after. If there is a line, the cluster lived for at least 100 iterations. Some clusters live long, some live short, but all eventually die.

Usually the model likes to think there are about two clusters, occupying positions at the two modes in the data distribution. It also entertains the existence of several much more minor ones. Usually these are shortlived clusters that die away. But sometimes, they rise up and kick out one of the dominant clusters, and take over its space. This is evocative at least to me: for example, around iteration 2500 is a crisis of the two-mode regime, the fall of green and the rise of blue. (Maybe there are analogies to ideal points and coalitions or something, but call that future work…)

In fact the real story is a little more chaotic. Here’s the same run, but at finer resolution (every 10 iterations).

Around iteration 2500 you can see blue suddenly appear in green’s territory, where it’s bouncing around trying to get data points to convert to its cause. The clusters struggle and blue eventually wins out. Not all challenges are successful, though; e.g. at 1500 or 3600.

Ultimately, the dynamism is fake; looking at the broad sweep of history, it’s all part of a globally unchanging, static steady state of MCMC. The name of the cluster at mean -2 might change from time to time, but really, it occupies a position in the system analogous to the old regime.

Actually not just “analogous” but mathematically the same, as implied by CRP exchangeability; the cluster IDs are just an auxiliary variable for the DP. And the point of MCMC is to kill the dynamism by averaging over it for useful inference. This nicely illustrates you can’t directly use the actual clusters for averaging for an MCMC mixture model, since new clusters might slide into the place of old ones. (You might average over smaller spans, maybe; or perhaps look at statistics that are invariant to changing clusters, like the probability two datapoints belong to the same cluster. Or only use a single sample, which is at least guaranteed to be consistent?)

Technical details: this is actually a maybe-lame uncollapsed Gibbs sampler; I think it’s Algorithm 2 from Neal 2000 or something close. Everyone asks about the plots but they are easy to make; given a logfile with tuples like (iternum, cluster ID, n, mu), first melt() it, then ggplot with something like qplot(iternum, value, colour=factor(clusterID), data=x, geom=c(‘line’,'point’)) + facet_grid(~variable, scales=’free’).

## Correlation picture

Paul Moore posted a comment pointing out this great discussion of the correlation coefficient:

• Joseph Lee Rodgers and W. Alan Nicewander. “Thirteen Ways to Look at the Correlation Coefficient.” The American Statistician, Vol. 42, No. 1. (Feb., 1988), pp. 59-66. Link

It’s related to the the post on cosine similarity, correlation and OLS. Anyway, I was just struck by the following diagram. It almost has a pop-art feel.

## R scan() for quick-and-dirty checks

One of my favorite R tricks is scan(). I was using it to verify whether I wrote a sampler recently, which was supposed to output numbers uniformly between 1 and 100 into a logfile; this loads the logfile, counts the different outcomes, and plots.

plot(table(scan(“log”)))

As the logfile was growing, I kept replotting it and found it oddly compelling.

This was useful: in fact, an early version had an off-by-one bug, immediately obvious from the plot. And of course, chisq.test(table(scan(“log”))) does a null-hypothesis to check uniformity.

Really liking whoever made @SottedReviewer, e.g.

There’s an entire great story here about ICWSM and interdisiplinariness or lack thereof. (Somehow I’ve never reviewed for ICSWM. Maybe I’m missing out.)

## Wasserman on Stats vs ML, and previous comparisons

Larry Wasserman has a new position paper (forthcoming 2013) with a great comparison the Statistics and Machine Learning research cultures, “Rise of the Machines”. He has a very conciliatory view in terms of intellectual content, and a very pro-ML take on the research cultures. Central to his argument is that ML has recently adopted rigorous statistical concepts, and the fast-moving conference culture (and heavy publishing by its grad students) have helped with this and other good innovations. (I agree with a comment from Sinead that he’s going a little easy on ML, but it’s certainly worth a read.)

There’s now a little history of “Statistics vs Machine Learning” position papers that this can be compared to. A classic is Leo Breiman (2001), “Statistical Modeling: The Two Cultures”, which isn’t exactly about stats vs. ML, but is about the focus on modeling vs algorithms, and maybe about description vs. prediction.

It’s been a while since I’ve looked at it, but I’ve also enjoyed Jerome Friedman (1998)’s “Data Mining and Statistics: What’s the Connection?” which looks at what I’d now consider a kind of dark ages, the heuristic data mining approaches I associate with the 90′s.

I think Larry in 2013 would argue that NIPS or ICML-style mainstream ML research has substantially moved away from the crazy heuristic world of data mining and into something that looks more like statistical theory. Larry’s obviously focusing on areas he’s interested in. For example, current work in deep belief networks feels much more heuristicky — some of us (like me) think this area is very important and are sympathetic to the approach, though plenty aren’t — but of course the combination of theory and creative algorithmic innovation is useful to have even when they haven’t yet been reconciled. That’s what you get if you want to follow bleeding-edge research.

It’s always interesting to read this sort of thing. Before going to grad school I wrote a blog rant about stats and ML; it’s nice to actually know something about the topic now. Or conversely, maybe my worldview has been sufficiently shaped by all of Larry’s classes that I think the same way. Whatever.

## Perplexity as branching factor; as Shannon diversity index

A language model’s perplexity is exponentiated negative average log-likelihood,

$$\exp( -\frac{1}{N} \log(p(x)))$$

Where the inner term usually decomposes into a sum over individual items; for example, as $$\sum_i \log p(x_i | x_1..x_{i-1})$$ or $$\sum_i \log p(x_i)$$ depending on independence assumptions, where for language modeling word tokens are usually taken as the individual units. (In which case it is the geometric mean of per-token negative log-likelihoods.) It’s equivalent to exponentiated cross-entropy between the model and the empirical data distribution, since $$-1/N \sum_i^N \log p(x_i) = -\sum_k^K \hat{p}_k \log p_k = H(\hat{p};p)$$ where $$N$$ is the number of items and $$K$$ is the number of discrete classes (e.g. word types for language modeling) and $$\hat{p}_k$$ is the proportion of data having class $$k$$.

A nice interpretation of any exponentiated entropy measure is as branching factor: entropy measures uncertainty in bits or nats, but in exponentiated form it’s measured as the size of an equally weighted distribution with equivalent uncertainty. That is, $$\exp(-H(p))$$ is how many sides you need on a fair die to get the same uncertainty as the distribution $$p$$.

Entropy differs by a constant depending whether you measured using base-2 or natural logarithms (then your units are bits vs. nats, respectively). But perplexity is the same with whichever base you want. The following works with base-2 instead:

$\exp(-\sum_k p_k \log p_k) = \exp(\sum_k \log p_k^{-p_k}) = \prod_k p_k^{-p_k}$

Neat Wikipedia discovery: in ecology and economics, the diversity index measures were developed and they are in fact exponentiated entropy, just like perplexity. In fact, the different diversity indexes correspond to exponentiated Renyi entropies. The term “diversity” is a nice alternative term to “uncertainty” — less epistemologically loaded, just a description of the distribution.

## Graphs for SANCL-2012 web parsing results

I was just looking at some papers from the SANCL-2012 workshop on web parsing from June this year, which are very interesting to those of us who wish we had good parsers for non-newspaper text. The shared task focus was on domain adaptation from a setting of lots of Wall Street Journal annotated data and very little in-domain training data. (Previous discussion here; see Ryan McDonald’s detailed comment.) Here are some graphs of the results (last page in the Petrov & McDonald overview).

I was most interested in whether parsing accuracy on the WSJ correlates to accuracy on web text. Fortunately, it does. They evaluated all systems on four evaluation sets: (1) Text from a question/answer site, (2) newsgroups, (3) reviews, and (4) Wall Street Journal PTB. Here is a graph across system entries, with the x-axis being the labeled dependency parsing accuracy on WSJPTB, and the y-axis the average accuracy on the three web evaluation sets. Note the axis scales are different: web accuracies are much worse than WSJ.

There are two types of systems here: direct dependency parsers, and constituent parsers whose output is converted to dependencies then evaluated. There’s an interesting argument in the overview paper that the latter type seems to perform better out-of-domain, perhaps because they learn more about latent characteristics of English grammatical structure that transfer between domains. To check this, I grouped on this factor: blue triangles are constituents-then-convert parsers, and the red circles are direct dependency parsers. (Boxes are the two non-domain-adapted baselines.) The relationship between WSJ versus web accuracy can be checked with linear regression on the subgroups. The constituent parsers are higher up on the graph; so indeed, the constituent parsers have higher web accuracy relative to their WSJ accuracy compared to the direct dependency parsers. A difference in the slopes might also be interesting, but the difference between them is mostly driven by a single direct-dependency outlier (top left, “DCU-Paris13″); excluding that, the slopes are quite similar. Actually, this example probably shouldn’t count as a direct-dependency system, because it’s actually an ensemble including constituency conversion as a component (if it is “DCU-Paris13-Dep” here). In any case, lines with and without it are both shown.

Since the slopes are similar, we shouldn’t need varying-slopes hierarchical regression to analyze the differences, so just throw it all in to one regression (webacc ~ wsjacc + ConstitIndicator); so constituent parsers get an absolute 1.6% better out-of-domain accuracy compared to dependency parsers with the same WSJ parsing accuracy. (This is excluding DCU-Paris13.) It’s not clear if this is due to better grammar learning, or if it’s due to an issue with SD giving bad conversions on web text.

To see the individual systems’ names, here are both sets of numbers. These include all systems; for the scatterplot above I excluded the ones that performed worse than the non-domain-adapted baselines. (Looking at those papers, they seemed to be less elaborately engineering-heavy efforts; for example, the very interesting Pitler paper focuses on issues in how the representation handles conjunctions.)

CI’s can’t be seen on those skinny graphs, but they’re there; computed via binomial normal approximations using the number of tokens (i.e. assuming correctness is independent at the token level — though probably it should have sentence-level or doc-level grouping, of course, so they’re tighter than they should be.)

Finally, this is less useful, but here’s the scatterplot matrix of all four evaluation sets (excluding worse-than-baseline systems again).

All R code here.