Not much on this blog lately, so I’ll repost a comment I just wrote on whether to use pairwise vs. absolute judgments for relevance quality evaluation. (A fun one I know!)
The paper being talked about is Here or There: Preference Judgments for Relevance by Carterette et al.
I skimmed through the Carterette paper and it’s interesting. My concern with pairwise setup is, in order to get comparability among query-result pairs, you need to get annotators to do an O(N^2) amount of work. (Unless you do something horribly complicated with partial orders.) The absolute judgment task scales linearly, of course. Given the AMT environment and a fixed budget, if I stay in the smaller-volume task, instead of spending a lot on a quadratic taskload, I can simply get a higher number of workers per result and boil out more noise. Of course, if it’s true the pairwise judgment task is easier — as the paper claims — that might make my spending more efficient. But since it’s polynomial, no matter the cost/benefit ratios, there has to be a tipping point where, for a given data set size, you’d always want to switch back to absolute judgments.
Absolute judgments are just so much easier to compute with — both for analysis and to use as machine learning training data. I really don’t want to have fancy utility inference or stopping rule schemes just to know the relative ranking of my data. (And I think real-valued scores will always become a necessity. Theoretical microeconomists have made boatloads of theorems about representing preferences by pairwise comparisons. It turns out that when you add enough rationality assumptions — e.g. the sort that are demanded of search engine ranking tasks anyways — then your fancy ordering can always be mapped back to real-valued utility function.)
I’d be most interested in a paper that compares real-valued scores derived from some sort of pairwise comparison task, versus absolute judgments, and is mindful of the cost tradeoffs in service of an actual goal, like ranking algorithm training.