Probabilistic interpretation of the B3 coreference resolution metric

Here is an intuitive justification for the B3 evaluation metric often used in coreference resolution, based on whether mention pairs are coreferent. If a mention from the document is chosen at random,

  • B3-Recall is the (expected) proportion of its actual coreferents that the system thinks are coreferent with it.
  • B3-Precision is the (expected) proportion of its system-hypothesized coreferents that are actually coreferent with it.

Does this look correct to people? Details below:

In B3′s basic form, it’s a clustering evaluation metric, to evaluate a gold-standard clustering of mentions against a system-produced clustering of mentions.

Let \(G\) mean a gold-standard entity and \(S\) mean a system-predicted entity, where an entity is a set of mentions. \(i\) refers to a mention; there are \(n\) mentions in the document. \(G_i\) means the gold entity that contains mention \(i\); and \(S_i\) means the system entity that has \(i\).

The B3 precision and recall for a document are usually defined as (though it doesn’t seem to be exactly this in the original paper…):
B3Prec &= \frac{1}{n} \sum_i \frac{|G_i \cap S_i|}{|S_i|} \\
B3Rec &= \frac{1}{n} \sum_i \frac{|G_i \cap S_i|}{|G_i|}

Consider B3Prec. Think about it like,

&= E_{ment}\left[ \frac{ |G_i \cap S_i| }{ |S_i| } \right] \\
&= E_{ment}\left[ P(G_j = G_i \mid j \in S_i) \right]

The first step is the expectation under the distribution of “pick a mention \(i\) at random from the document”. The second step is from restating \(|G_i \cap S_i|\) as: out of the system-hypothesized coreferents of \(i\), how many are in the same gold cluster as \(i\)? Thus \(|G_i \cap S_i|/|S_i|\) is: if you choose a mention \(j\) randomly out of \(S_i\), how often does it have the same gold cluster as \(i\)? (I think that last line might be collapsable via the law of total expectation, but tracking those two random variables nested in there makes me confused.)

Similarly, \(B3Rec = E_{ment}[ P(S_j = S_i \mid j \in G_i) ]\).

Does the above look correct? I hadn’t seen this intuitive justification given anywhere before and that’s at least how I’m used to thinking about B3 so I was curious what other people think. This is why I like B3: I can explain it in terms of mention pairs.

I think this also gives an additional justification to Cai and Strube (2010)‘s proposal to handle divergent gold versus system mentions. So say the system produces a spurious mention \(i\) that isn’t part of the gold standard’s mentions (a “twinless” mention). If you assume that mentions not in the gold standard should be considered to have no coreferents, then all of \(i\)’s system-hypothesized coreferents are false positives. Therefore, to think about precision under this assumption, the system’s non-gold-mentions should be added to the gold as singleton entities, before computing precision. And analogously for recall (add gold-only mentions as system-side singletons: the system has failed to find any coreference links to them). I think this is what they call \(B^3_{sys}\) (section 2.2.2).

I also like the pairwise linking metric since it’s defined only in terms of mentions; to be analogous to the presentation of B3 here,

  • Pairwise-Prec: choose a pair of mentions the system thinks are coreferent. How often are they actually coreferent?
  • Pairwise-Rec: choose a pair of coreferent mentions. How often does the system think they’re coreferent?

Or algorithmically: take all entities to be fully connected mention graphs and compute link recovery precision/recall. But pairwise Prec/Rec/F1 doesn’t seem as popular as B3 Prec/Rec/F1; in particular, it’s not part of the CoNLL-2011 scoring script everyone seems to use now (or I guess its bugfixed version I kept hearing mutterings about at conferences this summer, though I don’t see any information about this online — I was told a series of bugs were discovered in it recently…). (Unlike pairwise, B3 does not have quadratic scaling effects in cluster and document size, though I never understood why that’s a-priori an important consideration?)  It is apparent though, that the Cai and Strube method can be adapted to pairwise metrics, maybe including BLANC, under the same justification given here for why it should apply to B3.

(As far as I know B3 hasn’t been proposed before as a pure clustering metric … you could actually think of it in comparison to Rand index, VI, etc.  I think it has some sort of relationship to VI if you think about Renyi entropies and precision/recall averaging — VI is a kind of precision/recall average, except log-loss variants of precision/recall, i.e. conditional Shannon entropies… that’s another long story though.)

This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Probabilistic interpretation of the B3 coreference resolution metric

  1. Interesting post! One quick note, the changes to the scorer are on the 2012 shared task site:

    Hopefully we’ll be seeing a new version with the other metrics fixed soon too.

  2. Pairwise metrics are like micro-F measures (equally weighted by instance); B3 is like macro-F measure (equally weighted by category).

    I like the pairwise metrics for evaluation because they’re interpretable as estimates of future performance — what’s the probability that you recover a link between two mentions? But that may not be what matters in an application. If I’m clustering news items to display in Google News, I probably have different cost/benefit for linking/overlinking than I do with electronic health records. And this is only a problem for clustering/coref, not for linkage to a database of entities; for the latter, I think individual scores make sense.

    The quadratic nature is important because it tells you how breaking 200 mentions down into subgroups gets scored. With 200 mentions, if I recover two clusters of 100, I recover 19,800 of the 39,800 links. One cluster of 150 with two more clusters of size 25 is better, scoring 23,550. That’s not much better than 150 with 50 singletons, which scores 22,350.

    Probabilistically, precision = TP/(TP + FP) is tricky because the denominator depends on how many positive results the classifier returns (i.e., TP + FP). Recall (aka sensitivity) = TP / (TP + FN) is different — the denominator depends only on the number of positive instances in the test data. So if you do a system evaluation, all systems have the same denominator for recall, but they vary in denominators for precision. It’s easier to work probabilistically with specificity = TN / (TN + FP), which is like recall for negative cases. Along with prevalence = (TP + FN) / (TP + FN + TN + FP) it lets you predict precision.

  3. says:


    日本から見た日中戦争 → 中華にいずる日本あり。
    右目を外す場合右目の右端の皮を右手で軽く右に引っ張って 梅本とか宮脇はカラコン入れてないとまともに見れたもんじゃない…