I haven’t done a paper review on this blog for a while, so here we go.
Coreference resolution is an interesting NLP problem. (Examples.) It involves honest-to-goodness syntactic, semantic, and discourse phenomena, but still seems like a real cognitive task that humans have to solve when reading text [1]. I haven’t read the whole literature, but I’ve always been puzzled by the crop of papers on it I’ve seen in the last year or two. There’s a big focus on fancy graph/probabilistic/constrained optimization algorithms, but often these papers gloss over the linguistic features — the core information they actually make their decisions with [2]. I never understood why the latter isn’t the most important issue. Therefore, it was a joy to read
They describe a simple, essentially non-statistical system that outperforms previous unsupervised systems, and compares favorably to supervised ones, by using smart features. It has two-ish modular components:
- Syntactic constraints: entity type agreement, appositives, and a few other things that get at syntactic salience.
- Semantic filter: non-pronoun mentions must be corpus-pattern-compatible with their antecedents; described below.
For each mention, these constraints filter previous mentions to several possible antecedents. If there’s more than one, the system picks the closest. Entity clusters are formed in the simplest (dumbest) way possible, by taking the transitive closure of these pairwise mention-antecedent decisions.
The lexical semantic filter is interesting. They found that syntactic cues have recall issues for non-pronoun references, e.g. the company referring to AOL. You need to know that these two words tend to be compatible with each other. They create a very specific lexical resource — of word pairs compatible to be coreferent — by finding coreferent expressions via bootstrapping in a large unlabelled text corpus (Wikipedia abstracts and newswire articles. But they say only 25k Wiki abstracts? There are >10 million total; how were they selected?).
Using a parsed corpus, they seeded with appositive and predicate-nominative patterns: I’m guessing, something like ”Al Gore, vice-president…” and “Al Gore was vice-president”. Then they extracted connecting paths on those pairs. E.g., the text “Al Gore served as the vice-president” then yields the path-pattern “X served as Y”. Then there’s one more iteration to extract more word pairs — pairs that appear a large number of times.
They cite Hearst 1992, Snow et al 2005, and Phillips and Riloff 2007. But note, they don’t describe their technique as trying to be a general hypernym finder; rather, its only goal is to find pairs of words (noun phrase heads) that might corefer in text that the final system encounters. In fact, they describe the bootstrapping system as merely trying to find instances of noun phrase pairs that exhibit coreference. I wonder if it’s fair to think of the bootstrapping system as a miniature coreference detector itself, but tuned for high-precision, by only considering very short syntactic paths (no longer than 1 sentence). I also wonder if there are instances of non-pronoun coreference that aren’t hyponym-hypernym pairs; if not, my analysis here is silly.
Coverage seems to be good, or at least useful: many non-pronoun coreference recall errors are solved by using this data. (I can’t tell whether it’s two-thirds of all recall errors after the syntactic system, or two-thirds of the errors in Table 1.) And they claim word pairs are usually correct, with a few interesting types of errors (that look quite solvable). As for coverage, I wonder if they tried WordNet hypernym/synonym information, and whether it was useful. My bet is that WordNet’s coverage here would be significantly poorer than a bootstrapped system.
This paper was fun to read because it’s written very differently than the usual NLP paper. Instead of presenting a slew of modelling, it cuts to the chase, using very simple algorithms and clear argumentation to illustrate why a particular set of approaches is effective. There’s lots of error analysis motivating design decisions, as well as suggesting concusions for future work. In particular, they think discourse, pragmatics, and salience aren’t the most important issues; instead, better syntactic and semantic modelling would give the biggest gains.
There’s also something very nice about reading a paper that doesn’t have a single equation yet makes a point, and is easy to implement yourself to boot. I think the machine learning approach to NLP research can really hurt insight. Every paper is obsessed with held-out predictive accuracy. If you’re lucky, a paper will list out all the features they used, then (only sometimes!) they make a cursory attempt at finding which features were important. A simple hand-coded system lends itself to easily describing and motivating every feature by itself — better narrative explanations and insight. Which type of research is more useful as science?
Final note: it’s not totally fair to consider this system a non-statistical one, because its syntactic and semantic subsystems rest on complicated statistical components that required boatloads of labelled training data — the Stanford parser, Stanford NER, and the Charniak parser. (I wonder how sensitive performance is relative to these components. Could rule-based parsing and NER work as well?) Further, as they point out, more complicated structured approaches to the problem of forming entity partitions from features should improve performance. (But how much?)
[1] As opposed to, say, the rarified activity of treebanking, a 318-page-complex linguistic behavior that maybe several dozen people on Earth are capable of executing. There’s a whole other rant here, on the topic of the behavioral reality of various linguistic constructs. (Treebank parsers were extensively used in this paper, so maybe I shouldn’t hate too much…)
[2] There are certainly exceptions to this like Bengston and Roth, or maybe Denis and Baldridge (both EMNLP-2008). I should emphasize my impression of the literature is from a small subsample.