Not much on this blog lately, so I’ll repost a comment I just wrote on whether to use pairwise vs. absolute judgments for relevance quality evaluation. (A fun one I know!)
From this post on the Dolores Labs blog.
The paper being talked about is Here or There: Preference Judgments for Relevance by Carterette et al.
I skimmed through the Carterette paper and it’s interesting. My concern with pairwise setup is, in order to get comparability among query-result pairs, you need to get annotators to do an O(N^2) amount of work. (Unless you do something horribly complicated with partial orders.) The absolute judgment task scales linearly, of course. Given the AMT environment and a fixed budget, if I stay in the smaller-volume task, instead of spending a lot on a quadratic taskload, I can simply get a higher number of workers per result and boil out more noise. Of course, if it’s true the pairwise judgment task is easier — as the paper claims — that might make my spending more efficient. But since it’s polynomial, no matter the cost/benefit ratios, there has to be a tipping point where, for a given data set size, you’d always want to switch back to absolute judgments.
Absolute judgments are just so much easier to compute with — both for analysis and to use as machine learning training data. I really don’t want to have fancy utility inference or stopping rule schemes just to know the relative ranking of my data. (And I think real-valued scores will always become a necessity. Theoretical microeconomists have made boatloads of theorems about representing preferences by pairwise comparisons. It turns out that when you add enough rationality assumptions — e.g. the sort that are demanded of search engine ranking tasks anyways — then your fancy ordering can always be mapped back to real-valued utility function.)
I’d be most interested in a paper that compares real-valued scores derived from some sort of pairwise comparison task, versus absolute judgments, and is mindful of the cost tradeoffs in service of an actual goal, like ranking algorithm training.
My apologies if I’m revisiting a well-known topic, since this is just off-the-cuff. What’s your intuition on using limited pairwise judgments for reevaluation? Oftentimes, annotators just starting out will make very different judgments than after they’ve been going for a while. For example, suppose we randomize the order of the annotation set between different annotators. Then, if annotators disagree for items at different times in the training, but generally have good pairwise agreement during similar phases in the training. Is that any better, or is it just as good with absolute score comparisons after a linear rescaling?
Hey no problem, it’s off-the-cuff for me too :) I guess I’m a little confused what you’re trying to do: are you identifying annotators who disagree a lot or who are “bad”? And you want a judging metric that’s invariant to the internal rescaling they do over time?
That sounds reasonable I guess, as long as you don’t need comparisons over a large amount of data to be annotated. Pairwise comparisons break down if you’re interested in comparing across *results*.
Sorry, let me clarify this further. What I mean is that I find, for some tasks, annotators have a kind of burn-in period, where the results aren’t consistent with their later work. For example, figuring out how many partial credit points to give on an exam is often strongly informed by answers of other tests. However, there are also cases where the annotators get bored and just start being arbitrary, or point of wearing out. My idea is to use pairwise comparison across randomized orderings to attempt to detect these changes in quality within a given annotator, using the judgment of similar annotators.
Certainly sounds like it could work…