<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Pairwise comparisons for relevance evaluation</title>
	<atom:link href="http://brenocon.com/blog/2008/06/pairwise-comparisons-for-relevance-evaluation/feed/" rel="self" type="application/rss+xml" />
	<link>http://brenocon.com/blog/2008/06/pairwise-comparisons-for-relevance-evaluation/</link>
	<description>cognition, language, social systems; statistics, visualization, computation</description>
	<lastBuildDate>Tue, 25 Nov 2025 13:11:20 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
	<item>
		<title>By: Brendan</title>
		<link>http://brenocon.com/blog/2008/06/pairwise-comparisons-for-relevance-evaluation/#comment-101</link>
		<dc:creator>Brendan</dc:creator>
		<pubDate>Fri, 11 Jul 2008 23:59:00 +0000</pubDate>
		<guid isPermaLink="false">http://blog.anyall.org/?p=121#comment-101</guid>
		<description><![CDATA[Certainly sounds like it could work...]]></description>
		<content:encoded><![CDATA[<p>Certainly sounds like it could work&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John</title>
		<link>http://brenocon.com/blog/2008/06/pairwise-comparisons-for-relevance-evaluation/#comment-100</link>
		<dc:creator>John</dc:creator>
		<pubDate>Wed, 02 Jul 2008 13:11:00 +0000</pubDate>
		<guid isPermaLink="false">http://blog.anyall.org/?p=121#comment-100</guid>
		<description><![CDATA[Sorry, let me clarify this further.  What I mean is that I find, for some tasks, annotators have a kind of burn-in period, where the results aren&#039;t consistent with their later work.  For example, figuring out how many partial credit points to give on an exam is often strongly informed by answers of other tests.  However, there are also cases where the annotators get bored and just start being arbitrary, or point of wearing out.  My idea is to use pairwise comparison across randomized orderings to attempt to detect these changes in quality within a given annotator, using the judgment of similar annotators.]]></description>
		<content:encoded><![CDATA[<p>Sorry, let me clarify this further.  What I mean is that I find, for some tasks, annotators have a kind of burn-in period, where the results aren&#8217;t consistent with their later work.  For example, figuring out how many partial credit points to give on an exam is often strongly informed by answers of other tests.  However, there are also cases where the annotators get bored and just start being arbitrary, or point of wearing out.  My idea is to use pairwise comparison across randomized orderings to attempt to detect these changes in quality within a given annotator, using the judgment of similar annotators.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brendan</title>
		<link>http://brenocon.com/blog/2008/06/pairwise-comparisons-for-relevance-evaluation/#comment-99</link>
		<dc:creator>Brendan</dc:creator>
		<pubDate>Tue, 01 Jul 2008 20:12:00 +0000</pubDate>
		<guid isPermaLink="false">http://blog.anyall.org/?p=121#comment-99</guid>
		<description><![CDATA[Hey no problem, it&#039;s off-the-cuff for me too :)  I guess I&#039;m a little confused what you&#039;re trying to do: are you identifying annotators who disagree a lot or who are &quot;bad&quot;?  And you want a judging metric that&#039;s invariant to the internal rescaling they do over time?&lt;br/&gt;&lt;br/&gt;That sounds reasonable I guess, as long as you don&#039;t need comparisons over a large amount of data to be annotated.  Pairwise comparisons break down if you&#039;re interested in comparing across *results*.]]></description>
		<content:encoded><![CDATA[<p>Hey no problem, it&#8217;s off-the-cuff for me too :)  I guess I&#8217;m a little confused what you&#8217;re trying to do: are you identifying annotators who disagree a lot or who are &#8220;bad&#8221;?  And you want a judging metric that&#8217;s invariant to the internal rescaling they do over time?</p>
<p>That sounds reasonable I guess, as long as you don&#8217;t need comparisons over a large amount of data to be annotated.  Pairwise comparisons break down if you&#8217;re interested in comparing across *results*.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John</title>
		<link>http://brenocon.com/blog/2008/06/pairwise-comparisons-for-relevance-evaluation/#comment-98</link>
		<dc:creator>John</dc:creator>
		<pubDate>Thu, 19 Jun 2008 13:15:00 +0000</pubDate>
		<guid isPermaLink="false">http://blog.anyall.org/?p=121#comment-98</guid>
		<description><![CDATA[My apologies if I&#039;m revisiting a well-known topic, since this is just off-the-cuff.  What&#039;s your intuition on using limited pairwise judgments for reevaluation?  Oftentimes, annotators just starting out will make very different judgments than after they&#039;ve been going for a while.  For example, suppose we randomize the order of the annotation set between different annotators.  Then, if annotators disagree for items at different times in the training, but generally have good pairwise agreement during similar phases in the training.  Is that any better, or is it just as good with absolute score comparisons after a linear rescaling?]]></description>
		<content:encoded><![CDATA[<p>My apologies if I&#8217;m revisiting a well-known topic, since this is just off-the-cuff.  What&#8217;s your intuition on using limited pairwise judgments for reevaluation?  Oftentimes, annotators just starting out will make very different judgments than after they&#8217;ve been going for a while.  For example, suppose we randomize the order of the annotation set between different annotators.  Then, if annotators disagree for items at different times in the training, but generally have good pairwise agreement during similar phases in the training.  Is that any better, or is it just as good with absolute score comparisons after a linear rescaling?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
