I haven’t done a paper review on this blog for a while, so here we go.
Coreference resolution is an interesting NLP problem. (Examples.) It involves honest-to-goodness syntactic, semantic, and discourse phenomena, but still seems like a real cognitive task that humans have to solve when reading text [1]. I haven’t read the whole literature, but I’ve always been puzzled by the crop of papers on it I’ve seen in the last year or two. There’s a big focus on fancy graph/probabilistic/constrained optimization algorithms, but often these papers gloss over the linguistic features — the core information they actually make their decisions with [2]. I never understood why the latter isn’t the most important issue. Therefore, it was a joy to read
- Aria Haghighi and Dan Klein, EMNLP-2009. “Simple Coreference Resolution with Rich Syntactic and Semantic Features.”
They describe a simple, essentially non-statistical system that outperforms previous unsupervised systems, and compares favorably to supervised ones, by using smart features. It has two-ish modular components:
- Syntactic constraints: entity type agreement, appositives, and a few other things that get at syntactic salience.
- Semantic filter: non-pronoun mentions must be corpus-pattern-compatible with their antecedents; described below.
For each mention, these constraints filter previous mentions to several possible antecedents. If there’s more than one, the system picks the closest. Entity clusters are formed in the simplest (dumbest) way possible, by taking the transitive closure of these pairwise mention-antecedent decisions.
The lexical semantic filter is interesting. They found that syntactic cues have recall issues for non-pronoun references, e.g. the company referring to AOL. You need to know that these two words tend to be compatible with each other. They create a very specific lexical resource — of word pairs compatible to be coreferent — by finding coreferent expressions via bootstrapping in a large unlabelled text corpus (Wikipedia abstracts and newswire articles. But they say only 25k Wiki abstracts? There are >10 million total; how were they selected?).
Using a parsed corpus, they seeded with appositive and predicate-nominative patterns: I’m guessing, something like ”Al Gore, vice-president…” and “Al Gore was vice-president”. Then they extracted connecting paths on those pairs. E.g., the text “Al Gore served as the vice-president” then yields the path-pattern “X served as Y”. Then there’s one more iteration to extract more word pairs — pairs that appear a large number of times.
They cite Hearst 1992, Snow et al 2005, and Phillips and Riloff 2007. But note, they don’t describe their technique as trying to be a general hypernym finder; rather, its only goal is to find pairs of words (noun phrase heads) that might corefer in text that the final system encounters. In fact, they describe the bootstrapping system as merely trying to find instances of noun phrase pairs that exhibit coreference. I wonder if it’s fair to think of the bootstrapping system as a miniature coreference detector itself, but tuned for high-precision, by only considering very short syntactic paths (no longer than 1 sentence). I also wonder if there are instances of non-pronoun coreference that aren’t hyponym-hypernym pairs; if not, my analysis here is silly.
Coverage seems to be good, or at least useful: many non-pronoun coreference recall errors are solved by using this data. (I can’t tell whether it’s two-thirds of all recall errors after the syntactic system, or two-thirds of the errors in Table 1.) And they claim word pairs are usually correct, with a few interesting types of errors (that look quite solvable). As for coverage, I wonder if they tried WordNet hypernym/synonym information, and whether it was useful. My bet is that WordNet’s coverage here would be significantly poorer than a bootstrapped system.
This paper was fun to read because it’s written very differently than the usual NLP paper. Instead of presenting a slew of modelling, it cuts to the chase, using very simple algorithms and clear argumentation to illustrate why a particular set of approaches is effective. There’s lots of error analysis motivating design decisions, as well as suggesting concusions for future work. In particular, they think discourse, pragmatics, and salience aren’t the most important issues; instead, better syntactic and semantic modelling would give the biggest gains.
There’s also something very nice about reading a paper that doesn’t have a single equation yet makes a point, and is easy to implement yourself to boot. I think the machine learning approach to NLP research can really hurt insight. Every paper is obsessed with held-out predictive accuracy. If you’re lucky, a paper will list out all the features they used, then (only sometimes!) they make a cursory attempt at finding which features were important. A simple hand-coded system lends itself to easily describing and motivating every feature by itself — better narrative explanations and insight. Which type of research is more useful as science?
Final note: it’s not totally fair to consider this system a non-statistical one, because its syntactic and semantic subsystems rest on complicated statistical components that required boatloads of labelled training data — the Stanford parser, Stanford NER, and the Charniak parser. (I wonder how sensitive performance is relative to these components. Could rule-based parsing and NER work as well?) Further, as they point out, more complicated structured approaches to the problem of forming entity partitions from features should improve performance. (But how much?)
[1] As opposed to, say, the rarified activity of treebanking, a 318-page-complex linguistic behavior that maybe several dozen people on Earth are capable of executing. There’s a whole other rant here, on the topic of the behavioral reality of various linguistic constructs. (Treebank parsers were extensively used in this paper, so maybe I shouldn’t hate too much…)
[2] There are certainly exceptions to this like Bengston and Roth, or maybe Denis and Baldridge (both EMNLP-2008). I should emphasize my impression of the literature is from a small subsample.
I also have the peeve that I can’t reproduce people’s papers, because they go into great detail about CRFs (which is standard fare), but then gloss over the details of their features. I really liked Jenny Finkel’s bakeoff paper on CRFs for Biocreative that described their features and the recent Ratinov and Roth paper on named entity for the same reason. But even those fall short of a reproducible system description, which I still hold out as one of the hallmarks of science.
Papers that squeeze half a percent improvement on no-longer-held out data seem useless (one more paper evaluating on section 23 of the Treebank or the ModApte split of Reuters and I’ll scream).
I also like papers that introduce statistical techniques or applications, and those papers are important, too. For instance, I loved Haghighi and Klein’s previous approach to coref, based on the Dirichlet process, because it seems such a natural fit (other than computationally).
Of course there are non hyponym/hypernym coref cases! It’d be a stretch to call “Ronald Reagan, the U.S. president, …” a hypernym case, because “president” isn’t in any sense more generic than “Ronald Reagan” (and it’s temporally dependent; the relation isn’t permanent). But even eliminating names, you get things like, “the president, a former actor”, where there’s clearly no hypernym/hyponym relation.
The earliest work I know of for bootstrapping hypernyms/hyponyms was Ann Copestake’s work on LKB from 1990/1991.
We’ve probably both been thinking about natural tagging examples (your footnote [1]), because those are things you can get done with Mechanical Turk! A deeper problem is that annotation schemes like the Penn Treebank (even just the POS part) are built on theories of underlying linguistic structure which may be nothing like “the truth”, and may not even be useful in real world tasks. For instance, distributional clustering can be much more effective than POS tags as features in CRFs or parsers, and require no labeled data. (Though you can also use both in a discriminative model like CRFs.) But even named entities and coref have a large number of corner cases that need to be documented or just punted (e.g. “erstwhile Phineas Phoggs” from MUC-6 or “the various Bob Dylans” from recent film reviews).
Yikes! That system architecture sure looks familiar with the exception of the semantic filtering bootstrapped from Wikipedia–only it was 1995 and I was calling Michael Collins “boy”. It was the UPenn MUC-6 entry that “unofficially” won coref done with Jeff Reynar, Michael Collins, Jason Eisner, Adwait Ratnaparkhi, Joseph Rosenzweig, Anoop Sarkar and Srinivas UNIVERSITY OF PENNSYLVANIA :DESCRIPTION OF THE UNIVERSITY OF PENNSYLVANIA SYSTEM USED FOR MUC- 6
The performance improvement is stunning, 87.2% precision, 77.3% recall over our performance of 72% precision, 63% recall. But we had to find our own mentions lemme tell ya, it was up hill both ways to the LINC lab and we thought a megabyte of memory was a pretty huge thing.
But seriously, when did coreference papers get to report performance without finding their mentions? I know ACE provided that option but it somehow just seems wrong to have gold standard mentions that you only have to do coref over since there is no natural data out there that contains perfect noun phrases that are known to corefer to something.
Amit Bagga and I did experiments with the MUC-6 data like score the case where all gold standard mentions were made coreferent–you get 100% recall with around 35% precsion as I recall. Those results let us to consider other scoring metrics which then led to B-cubed which additionally involved Alan Bierman.
A request: Run the same system with a mention detection pass. Those numbers should be exciting.