Comments on: Pairwise comparisons for relevance evaluation

By: Brendan

Brendan — Fri, 11 Jul 2008 23:59:00 +0000

Certainly sounds like it could work…

By: John

John — Wed, 02 Jul 2008 13:11:00 +0000

Sorry, let me clarify this further. What I mean is that I find, for some tasks, annotators have a kind of burn-in period, where the results aren’t consistent with their later work. For example, figuring out how many partial credit points to give on an exam is often strongly informed by answers of other tests. However, there are also cases where the annotators get bored and just start being arbitrary, or point of wearing out. My idea is to use pairwise comparison across randomized orderings to attempt to detect these changes in quality within a given annotator, using the judgment of similar annotators.

By: Brendan

Brendan — Tue, 01 Jul 2008 20:12:00 +0000

Hey no problem, it’s off-the-cuff for me too :) I guess I’m a little confused what you’re trying to do: are you identifying annotators who disagree a lot or who are “bad”? And you want a judging metric that’s invariant to the internal rescaling they do over time?

That sounds reasonable I guess, as long as you don’t need comparisons over a large amount of data to be annotated. Pairwise comparisons break down if you’re interested in comparing across *results*.

By: John

John — Thu, 19 Jun 2008 13:15:00 +0000

My apologies if I’m revisiting a well-known topic, since this is just off-the-cuff. What’s your intuition on using limited pairwise judgments for reevaluation? Oftentimes, annotators just starting out will make very different judgments than after they’ve been going for a while. For example, suppose we randomize the order of the annotation set between different annotators. Then, if annotators disagree for items at different times in the training, but generally have good pairwise agreement during similar phases in the training. Is that any better, or is it just as good with absolute score comparisons after a linear rescaling?