What the ACL-2014 review scores mean

I’ve had several people ask me what the numbers in ACL reviews mean — and I can’t find anywhere online where they’re described. (Can anyone point this out if it is somewhere?)

So here’s the review form, below. They all go from 1 to 5, with 5 the best. I think the review emails to authors only include a subset of the below — for example, “Overall Recommendation” is not included?

The CFP said that they have different types of review forms for different types of papers. I think this one is for a standard full paper. I guess what people really want to know is what scores tend to correspond to acceptances. I really have no idea and I get the impression this can change year to year. I have no involvement with the ACL conference besides being one of many, many reviewers.

Continue reading


Scatterplot of KN/PYP language model results

I should make a blog where all I do is scatterplot results tables from papers. I do this once in a while to make them eaiser to understand…

I think the following are results are from Yee Whye Teh’s paper on hierarchical Pitman-Yor language models, and in particular comparing them to Kneser-Ney and hierarchical Dirichlets. They’re specifically from these slides by Yee Whye Teh (page 25), which shows model perplexities. Every dot is for one experimental condition, which has four different results from each of the models. So a pair of models can be compared in one scatterplot.



  • ikn = interpolated kneser-ney
  • mkn = modified kneser-ney
  • hdlm = hierarchical dirichlet
  • hpylm = hierarchical pitman-yor

My reading: the KN’s and HPYLM are incredibly similar (as Teh argues should be the case on theoretical grounds). MKN and HPYLM edge out IKN. HDLM is markedly worse (this is perplexity, so lower is better). While HDLM is a lot worse, it does best, relatively speaking, on shorter contexts — that’s the green dot, the only bigram model that was tested, where there’s only one previous word of context. The other models have longer contexts, so I guess the hierarchical summing of pseudocounts screws up the Dirichlet more than the PYP, maybe.

The scatterplot matrix is from this table (colored by N-1, meaning the n-gram size):

Screen Shot 2014-02-17 at 8.21.39 PM

1 Comment

tanh is a rescaled logistic sigmoid function

This confused me for a while when I first learned it, so in case it helps anyone else:

The logistic sigmoid function, a.k.a. the inverse logit function, is

\[ g(x) = \frac{ e^x }{1 + e^x} \]

Its outputs range from 0 to 1, and are often interpreted as probabilities (in, say, logistic regression).

The tanh function, a.k.a. hyperbolic tangent function, is a rescaling of the logistic sigmoid, such that its outputs range from -1 to 1. (There’s horizontal stretching as well.)

\[ tanh(x) = 2 g(2x) - 1 \]

It’s easy to show the above leads to the standard definition \( tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}} \). The (-1,+1) output range tends to be more convenient for neural networks, so tanh functions show up there a lot.

The two functions are plotted below. Blue is the logistic function, and red is tanh.

Screen Shot 2013-10-31 at 4.32.04 PM


Response on our movie personas paper

Update (2013-09-17): See David Bamman‘s great guest post on Language Log on our latent personas paper, and the big picture of interdisciplinary collaboration.

I’ve been informed that an interesting critique of my, David Bamman’s and Noah Smith’s ACL paper on movie personas has appeared on the Language Log, a guest post by Hannah Alpert-Abrams and Dan Garrette. I posted the following as a comment on LL.

Thanks everyone for the interesting comments. Scholarship is an ongoing conversation, and we hope our work might contribute to it. Responding to the concerns about our paper,

We did not try to make a contribution to contemporary literary theory. Rather, we focus on developing a computational linguistic research method of analyzing characters in stories. We hope there is a place for both the development of new research methods, as well as actual new substantive findings. If you think about the tremendous possibilities for computer science and humanities collaboration, there is far too much to do and we have to tackle pieces of the puzzle to move forward. Clearly, our work falls more into the first category — it was published at a computational linguistics conference, and we did a lot of work focusing on linguistic, statistical, and computational issues like:

  • how to derive useful semantic relations from current syntactic parsing and coreference technologies,
  • how to design an appropriate probabilistic model on top of this,
  • how to design a Bayesian inference algorithm for the model,

and of course, all the amazing work that David did in assembling a large and novel dataset — which we have released freely for anyone else to conduct research on, as noted in the paper. All the comments above show there are a wealth of interesting questions to further investigate. Please do!

We find that, in these multidisciplinary projects, it’s most useful to publish part of the work early and get scholarly feedback, instead of waiting for years before trying to write a “perfect” paper. Our colleagues Noah Smith, Tae Yano, and John Wilkerson did this in their research on Congressional voting; Brendan did this with Noah and Brandon Stewart on international relations events analysis; there’s great forthcoming work from Yanchuan Sim, Noah, Brice Acree and Justin Gross on analyzing political candidates’ ideologies; and at the Digital Humanities conference earlier this year, David presented his joint work with the Assyriologist Adam Anderson on analyzing social networks induced from Old Assyrian cuneiform texts. (And David’s co-teaching a cool digital humanities seminar with Christopher Warren in the English department this semester — I’m sure there will be great cross-fertilization of ideas coming out of there!)

For example, we’ve had useful feedback here already — besides comments from the computational linguistics community through the ACL paper, just in the discussion on LL there have been many interesting theories and references presented. We’ve also been in conversation with other humanists — as we stated in our acknowledgments (noted by one commenter) — though apparently not the same humanists that Alpert-Abrams and Garrett would rather we had talked to. This is why it’s better to publish early and participate in the scholarly conversation.

For what it’s worth, some of these high-level debates on whether it’s appropriate to focus on progress in quantitative methods, versus directly on substantive findings, have been playing out for decades in the social sciences. (I’m thinking specifically about economics and political science, both of which are far more quantitative today than they were just 50 years ago.) And as several commenters have noted, and as we tried to in our references, there’s certainly been plenty of computational work in literary/cultural analysis before. But I do think the quantitative approach still tends to be seen as novel in the humanities, and as the original response notes, there have been some problematic proclamations in this area recently. I just hope there’s room to try to advance things without being everyone’s punching bag for whether or not they liked the latest Steven Pinker essay.


Probabilistic interpretation of the B3 coreference resolution metric

Here is an intuitive justification for the B3 evaluation metric often used in coreference resolution, based on whether mention pairs are coreferent. If a mention from the document is chosen at random,

  • B3-Recall is the (expected) proportion of its actual coreferents that the system thinks are coreferent with it.
  • B3-Precision is the (expected) proportion of its system-hypothesized coreferents that are actually coreferent with it.

Does this look correct to people? Details below: Continue reading


Some analysis of tweet shares and “predicting” election outcomes

Everyone recently seems to be talking about this newish paper by Digrazia, McKelvey, Bollen, and Rojas (pdf here) that examines the correlation of Congressional candidate name mentions on Twitter against whether the candidate won the race.  One of the coauthors also wrote a Washington Post Op-Ed about it.  I read the paper and I think it’s reasonable, but their op-ed overstates their results.  It claims:

“In the 2010 data, our Twitter data predicted the winner in 404 out of 435 competitive races”

But this analysis is nowhere in their paper.  Fabio Rojas has now posted errata/rebuttals about the op-ed and described this analysis they did here.  There are several major issues off the bat:

  1. They didn’t ever predict 404/435 races; they only analyzed 406 races they call “competitive,” getting 92.5% (in-sample) accuracy, then extrapolated to all races to get the 435 number.
  2. They’re reporting about in-sample predictions, which is really misleading to a non-scientific audience; more notes on this further below.
  3. These aren’t predictions from just Twitter data, but a linear model that includes incumbency status and a bunch of other variables.  (Noted by Jonathan Nagler, who guessed this even before Rojas posted the errata/rebuttal.)

Given that the op-ed uses their results to proclaim that social media “will undermine the polling industry,” this sort of scrutiny is entirely fair.  Let’s take #3.  If you look at their Figure 1, as Nagler reproduces, it’s obvious that tweet share alone gives much less than that much accuracy. I’ve reproduced it again and added a few annotations:

Their original figure is nice and clear.  ”Tweet share” is: out of the name mentions of the two candidates in the race, the percentage that are of the Republican candidate. “Vote margin” is: how many more votes the Republican candidate got.  One dot per race.  Thus, if you say “predict the winner to be whoever got more tweet mentions,” then the number of correct predictions would be the number of dots in the shaded yellow areas, and the accuracy rate are them divided by the total number of dots.  This looks like much less than 93% accuracy.  [1]

It’s also been pointed out that incumbency alone predicts most House races; are tweets really adding anything here?  The main contribution of the paper is to test tweets alongside many controlling variables, including incumbency status.  The most convincing analysis the authors could have done would be to add an ablation test: use the model with the tweet share variable, and a model without it, and see how different the accuracies are.  This isn’t in the paper.  However, we can look at the regression coefficients to get an idea of relative variable importance, and the authors do a nice job reporting this.  I took their coefficient numbers from their “Table 1″ in the paper, and plotted them, below:

The effect sizes and their standard errors are on the right.  Being the incumbent is worth, on average, 49,000 votes, and it is much more important than all the other variables.  One additional percentage point of tweet share is worth 155 votes.  [2]  The predictive effect of tweet share is significant, but small.  In the paper they point out that a standard deviation worth of tweet share margin comes out to around 5000 votes — so roughly speaking, tweet shares are 10% as important as incumbency? [3]  In the op-ed Rojas calls this a “strong correlation”; another co-author Johan Bollen called it a “strong relation.” I guess it’s a matter of opinion whether you call Figure 1 a “strong” correlation.

On the other hand, tweet share is telling something that those greyed-out, non-significant demographic variables aren’t, so something interesting might be happening.  The paper also has some analysis of the outliers where the model fails.  Despite being clearly oversold, this is hardly the worst study of Twitter and elections; I learned something from reading it.

As always, I recommend Daniel Gayo-Avello’s 2012 review of papers on Twitter and election prediction:

… (Update: there’s also a newer review) and also see Metaxas and Mustafaraj (2012) for a broader and higher level overview of social media and elections. (Gayo-Avello also wrote a lengthy but sensible post on this paper.)

Next, point #2 — this “prediction” analysis shares a sadly often-repeated flaw, that the so-called “predictions” are evaluated on the training data (in ML-speak), i.e. they’re in-sample predictions (in socialscience-speak).  This is cheating: it’s awfully easy to predict what you’ve already seen!  XKCD has a great explanation of election model overfitting.  As we should know by now, the right thing to do is report accuracy on an out-of-sample, held-out test set; and the best test is to make forecasts about the future and wait to see if they turn out true.

It’s scientifically irresponsible to take the in-sample predictions and say “we predicted N number of races correctly” in the popular press.  It sounds like you mean predicting on new data.  Subsequent press articles that Rojas links to use verbs like “foretell” and “predict elections” — it’s pretty clear what people actually care about, and how they’re going to interpret a researcher using the term “prediction.”  In-sample predictions are a pretty technical concept and I think it’s misleading to call them “predictions.” [4]

Finally, somewhere in this whole kerfluffle hopefully there’s a lesson about cool social science and press coverage.  I feel a little bad for the coauthors given how many hostile messages I’ve seen about their paper on Twitter and various blogs; presumably this motivates what Rojas says at the end of their errata/rebuttal:

The original paper is a non-peer reviewed draft. It is in the process of being corrected, updated, and revised for publication. Many of these criticisms have already been incorporated into the current draft of the paper, which will be published within the next few months.

That sounds great and I look forward to seeing the final and improved version of the paper.  But, I feel like in the area of Twitter research, you have to be really cautious about your claims; they will get overblown by readers and the media otherwise.  Here, the actual paper is reasonable if limited; the problem is they wrote an op-ed in a major newspaper with incredibly expansive and misleading claims about this preliminary research!  This is going to bring out some justifiable criticism from the scientific community, I’m afraid.

[1] Also weird: many of the races have a 100% tweet share to one candidate.  Are the counts super low, like 3-vs-0? Does it need smoothing or priors? Are these from astroturfing or spamming efforts?  Do they create burstiness/overdispersion?  Name mention frequency is an interesting but quite odd sort of variable that needs more analysis in the future.

[2] These aren’t literal vote counts, but number of votes normalized by district size; I think it might be interpretable as, expected number of votes in an average-sized city.  Some blog posts have complained they don’t model vote share as a percentage, but I think their normalization preprocessing actually kind of handles that, albeit in a confusing/non-transparent way.

[3] I guess we could compare the variables’ standardized coefficients.  Incumbency as a 0-1 indicator, for 165 Republican incumbents out of 406 total in their dataset, is stdev ~ 0.5; so I guess that’s more like, a standardized unit of tweet share is worth 20% of standardized impact of incumbency?  I’m not really sure what’s the right way to compare here…  I still think difference in held-out accuracy on an ablation test is the best way to tell what’s going on with one variable, if you really care about it (which is the case here).

[4] I wish we had a different word for “in-sample predictions,” so we can stop calling them “predictions” to make everything clearer.  They’re still an important technical concept since they’re very important to the math and intuitions of how these models are defined.  I guess you could say “yhat” or “in-sample response variable point estimate”?  Um, need something better… Update: Duh, how about “fitted value” or “in-sample fits” or “model matched the outcome P% of the time”… (h/t Cosma)

[5] Numbers and graphics stuff I did are here.


Confusion matrix diagrams

I wrote a little note and diagrams on confusion matrix metrics: Precision, Recall, F, Sensitivity, Specificity, ROC, AUC, PR Curves, etc.


also, graffle source.

Leave a comment

Movie summary corpus and learning character personas

Here is one of our exciting just-finished ACL papers.  David and I designed an algorithm that learns different types of character personas — “Protagonist”, “Love Interest”, etc — that are used in movies.

To do this we collected a brand new dataset: 42,306 plot summaries of movies from Wikipedia, along with metadata like box office revenue and genre.  We ran these through parsing and coreference analysis to also create a dataset of movie characters, linked with Freebase records of the actors who portray them.  Did you see that NYT article on quantitative analysis of film scripts?  This dataset could answer all sorts of things they assert in that article — for example, do movies with bowling scenes really make less money?  We have released the data here.

Our focus, though, is on narrative analysis.  We investigate character personas: familiar character types that are repeated over and over in stories, like “Hero” or “Villian”; maybe grand mythical archetypes like “Trickster” or “Wise Old Man”; or much more specific ones, like “Sassy Best Friend” or “Obstructionist Bureaucrat” or “Guy in Red Shirt Who Always Gets Killed”.  They are defined in part by what they do and who they are — which we can glean from their actions and descriptions in plot summaries.

Our model clusters movie characters, learning posteriors like this:

Screen Shot 2013-05-07 at 10.11.23 PM


Each box is one automatically learned persona cluster, along with actions and attribute words that pertain to it.  For example, characters like Dracula and The Joker are always “hatching” things (hatching plans, presumably).

One of our models takes the metadata features, like movie genre and gender and age of an actor, and associates them with different personas.  For example, we learn the types of characters in romantic comedies versus action movies.  Here are a few examples of my favorite learned personas:

Screen Shot 2013-05-07 at 11.02.19 PM

One of the best things I learned about during this project was the website TVTropes (which we use to compare our model against).

We’ll be at ACL this summer to present the paper.  We’ve posted it online too:


What inputs do Monte Carlo algorithms need?

Monte Carlo sampling algorithms (either MCMC or not) have a goal to attain samples from a distribution.  They can be organized by what inputs or prior knowledge about the distribution they require.  This ranges from a low amount of knowledge, as in slice sampling (just give it an unnormalized density function), to a high amount, as in Gibbs sampling (you have to decompose your distribution into individual conditionals).

Typical inputs include \(f(x)\), an unnormalized density or probability function for the target distribution, which returns a real number for a variable value.  \(g()\) and \(g(x)\) represent sample generation procedures (that output a variable value); some generators require an input, some do not.

Here are the required inputs for a few algorithms.  (For an overview, see e.g. Ch 29 of MacKay.)  There are many more out there of course.  I’m leaving off tuning parameters.

Black-box samplers: Slice samplingAffine-invariant ensemble
- unnorm density \(f(x)\)

Metropolis (symmetric proposal)
- unnorm density/pmf \(f(x)\)
- proposal generator \(g(x)\)

Hastings (asymmetric proposal)
- unnorm density/pmf \(f(x)\)
- proposal generator \(g(x)\)
- proposal unnorm density/pmf \(q(x’; x)\)  .
… [For the proposal generator at state \(x\), probability it generates \(x'\)]

Importance sampling, rejection sampling
- unnorm density/pmf \(f(x)\)
- proposal generator \(g()\)
- proposal unnorm density/pmf \(q(x)\)

Independent Metropolis-Hastings: the proposal is always the same, but still have to worry about asymmetric corrections
- unnorm density/pmf \(f(x)\)
- proposal generator \(g()\)
- proposal unnorm density/pmf \(q(x’; x)\)

Hamiltonian Monte Carlo
- unnorm density \(f(x)\)
- unnorm density gradient \(gf(x)\)

Gibbs Sampling
- local conditional generators \(g_i(x_{-i})\)
… [which have to give samples from \( p(x_i | x_{-i}) \)]

Note importance/rejection sampling are stateless, but the MCMC algorithms are stateful.

I’m distinguishing a sampling procedure \(g\) from a density evaluation function \(f\) because having the latter doesn’t necessarily give you the former.  (For the one-dimension case, having an inverse CDF indeed gives you a a sampler, but multidimensional gets harder — part of why all these techniques were invented in the first place!)  Shay points out their relationship is analogous to 3-SAT: it’s easy to evaluate a full variable setting, but hard to generate them.  (Or specifically, think about a 3-SAT PMF \(p(x) = 1\{\text{\(x\) is boolean satisfiable}\}\) where only one \(x\) has non-zero probability; PMF evaluation is easy but the best known sampler is exponential time.)

And of course there’s a related organization of optimization algorithms.  Here’s a rough look at a few unconstrained optimizers:

Black-box optimizers: Grid search, Nelder-Mead, evolutionary, …
- objective \(f(x)\)

- objective \(f(x)\)
- gradient \(gf(x)\)

- objective \(f(x)\)
- gradient \(gf(x)\)
- hessian \(hf(x)\)

Simulated annealing
- objective \(f(x)\)
- proposal generator \(g(x)\)

- one-example gradient \(gf_i(x)\)

I think it’s neat that Gibbs Sampling and SGD don’t always require you to implement a likelihood/objective function.  It’s nice to do to ensure you’re actually optimizing or exploring the posterior, but strictly speaking the algorithms don’t require it.


Rise and fall of Dirichlet process clusters

Here’s Gibbs sampling for a Dirichlet process 1-d mixture of Gaussians. On 1000 data points that look like this.

I gave it fixed variance and a concentration and over MCMC iterations, and it looks like this.

The top is the number of points in a cluster. The bottom are the cluster means. Every cluster has a unique color. During MCMC, clusters are created and destroyed. Every cluster has a unique color; when a cluster dies, its color is never reused.

I’m showing clusters every 100 iterations. If there is a single point, that cluster was at that iteration but not before or after. If there is a line, the cluster lived for at least 100 iterations. Some clusters live long, some live short, but all eventually die.

Usually the model likes to think there are about two clusters, occupying positions at the two modes in the data distribution. It also entertains the existence of several much more minor ones. Usually these are shortlived clusters that die away. But sometimes, they rise up and kick out one of the dominant clusters, and take over its space. This is evocative at least to me: for example, around iteration 2500 is a crisis of the two-mode regime, the fall of green and the rise of blue. (Maybe there are analogies to ideal points and coalitions or something, but call that future work…)

In fact the real story is a little more chaotic. Here’s the same run, but at finer resolution (every 10 iterations).

Around iteration 2500 you can see blue suddenly appear in green’s territory, where it’s bouncing around trying to get data points to convert to its cause. The clusters struggle and blue eventually wins out. Not all challenges are successful, though; e.g. at 1500 or 3600.

Ultimately, the dynamism is fake; looking at the broad sweep of history, it’s all part of a globally unchanging, static steady state of MCMC. The name of the cluster at mean -2 might change from time to time, but really, it occupies a position in the system analogous to the old regime.

Actually not just “analogous” but mathematically the same, as implied by CRP exchangeability; the cluster IDs are just an auxiliary variable for the DP. And the point of MCMC is to kill the dynamism by averaging over it for useful inference. This nicely illustrates you can’t directly use the actual clusters for averaging for an MCMC mixture model, since new clusters might slide into the place of old ones. (You might average over smaller spans, maybe; or perhaps look at statistics that are invariant to changing clusters, like the probability two datapoints belong to the same cluster. Or only use a single sample, which is at least guaranteed to be consistent?)

Technical details: this is actually a maybe-lame uncollapsed Gibbs sampler; I think it’s Algorithm 2 from Neal 2000 or something close. Everyone asks about the plots but they are easy to make; given a logfile with tuples like (iternum, cluster ID, n, mu), first melt() it, then ggplot with something like qplot(iternum, value, colour=factor(clusterID), data=x, geom=c(‘line’,'point’)) + facet_grid(~variable, scales=’free’).