<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>AI and Social Science - Brendan O&#039;Connor</title>
	<atom:link href="http://brenocon.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://brenocon.com/blog</link>
	<description>cognition, language, social systems; statistics, visualization, computation</description>
	<lastBuildDate>Wed, 11 Apr 2012 15:04:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.4</generator>
		<item>
		<title>F-scores, Dice, and Jaccard set similarity</title>
		<link>http://brenocon.com/blog/2012/04/f-scores-dice-and-jaccard-set-similarity/</link>
		<comments>http://brenocon.com/blog/2012/04/f-scores-dice-and-jaccard-set-similarity/#comments</comments>
		<pubDate>Wed, 11 Apr 2012 15:00:33 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=1323</guid>
		<description><![CDATA[The Dice similarity is the same as F1-score; and they are monotonic in Jaccard similarity. I worked this out recently but couldn&#8217;t find anything about it online so here&#8217;s a writeup. Let \(A\) be the set of found items, and &#8230; <a href="http://brenocon.com/blog/2012/04/f-scores-dice-and-jaccard-set-similarity/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[
<p>The <a href="http://en.wikipedia.org/wiki/Dice's_coefficient">Dice similarity</a> is the same as <a href="http://en.wikipedia.org/wiki/F1_score">F1-score</a>; and they are monotonic in <a href="http://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity</a>.  I worked this out recently but couldn&#8217;t find anything about it online so here&#8217;s a writeup.</p>
<p>Let \(A\) be the set of found items, and \(B\) the set of wanted items.  \(Prec=|AB|/|A|\), \(Rec=|AB|/|B|\).  Their harmonic mean, the \(F1\)-measure, is the same as the Dice coefficient:<br />
\begin{align*}<br />
F1(A,B)<br />
&#038;= \frac{2}{1/P+ 1/R}<br />
 = \frac{2}{|A|/|AB| + |B|/|AB|} \\<br />
Dice(A,B)<br />
&#038;= \frac{2|AB|}{ |A| + |B| } \\<br />
&#038;= \frac{2 |AB|}{ (|AB| + |A \setminus B|) + (|AB| + |B \setminus A|)} \\<br />
&#038;= \frac{|AB|}{|AB| + \frac{1}{2}|A \setminus B| + \frac{1}{2} |B \setminus A|}<br />
\end{align*}</p>
<p>It&#8217;s nice to characterize the set comparison into the three mutually exclusive partitions \(AB\), \(A \setminus B\), and \(B \setminus A\).  This illustrates Dice&#8217;s close relationship to the Jaccard metric,<br />
\begin{align*}<br />
Jacc(A,B)<br />
&#038;= \frac{|AB|}{|A \cup B|} \\<br />
&#038;= \frac{|AB|}{|AB| + |A \setminus B| + |B \setminus A|}<br />
\end{align*}<br />
And in fact \(J = D/(2-D)\) and \(D=2J/(1+J)\) for any input, so they are monotonic in one another.<br />
The <a href="http://en.wikipedia.org/wiki/Tversky_index">Tversky index (1977)</a> generalizes them both,<br />
\begin{align*}<br />
Tversky(A,B;\alpha,\beta)<br />
&#038;= \frac{|AB|}{|AB| + \alpha|A\setminus B| + \beta|B \setminus A|}<br />
\end{align*}<br />
where \(\alpha\) and \(\beta\) control the magnitude of penalties of false positive versus false negative errors.  It&#8217;s easy to work out that all weighted F-measures correspond to when \(\alpha+\beta=1\).  The Tversky index just gives a spectrum of ways to normalize the size of a two-way set intersection.  (I always thought Tversky&#8217;s more mathematical earlier work (before the famous <a href="http://en.wikipedia.org/wiki/Amos_Tversky">T</a>&amp;<a href="http://en.wikipedia.org/wiki/Daniel_Kahneman">K</a> heuristics-and-biases stuff) was pretty cool.  In <a href="http://homepage.psy.utexas.edu/homepage/group/loveLAB/love/classes/concepts/Tversky1977.pdf">the 1977 paper</a> he actually does an axiomatic derivation of set similarity measures, though as far as I can tell this index doesn&#8217;t strictly derive from them.  Then there&#8217;s a whole debate in cognitive psych whether similarity is a good way to characterize reasoning about objects but that&#8217;s another story.)</p>
<p>So you could use either Jaccard or Dice/F1 to measure retrieval/classifier performance, since they&#8217;re completely monotonic in one another.  Jaccard might be a little unintuitive though, because it&#8217;s always less than or equal min(Prec,Rec); Dice/F is always in-between.</p>
]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2012/04/f-scores-dice-and-jaccard-set-similarity/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cosine similarity, Pearson correlation, and OLS coefficients</title>
		<link>http://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/</link>
		<comments>http://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comments</comments>
		<pubDate>Tue, 13 Mar 2012 18:01:41 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=1199</guid>
		<description><![CDATA[Cosine similarity, Pearson correlations, and OLS coefficients can all be viewed as variants on the inner product &#8212; tweaked in different ways for centering and magnitude (i.e. location and scale, or something like that). Details: You have two vectors \(x\) &#8230; <a href="http://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[
<p>Cosine similarity, Pearson correlations, and OLS coefficients can all be viewed as variants on the inner product &#8212; tweaked in different ways for centering and magnitude (i.e. location and scale, or something like that).</p>
<p>Details:</p>
<p>You have two vectors \(x\) and \(y\) and want to measure similarity between them.  A basic similarity function is the <b><a href=http://en.wikipedia.org/wiki/Dot_product>inner product</a></b></p>
<p>\[ Inner(x,y) = \sum_i x_i y_i = \langle x, y \rangle \]</p>
<p>If x tends to be high where y is also high, and low where y is low, the inner product will be high &#8212; the vectors are more similar.</p>
<p>The inner product is unbounded.  One way to make it bounded between -1 and 1 is to divide by the vectors&#8217; L2 norms, giving the <b><a href=http://en.wikipedia.org/wiki/Cosine_similarity>cosine similarity</a></b></p>
<p>\[ CosSim(x,y) = \frac{\sum_i x_i y_i}{ \sqrt{ \sum_i x_i^2} \sqrt{ \sum_i y_i^2 } }<br />
= \frac{ \langle x,y \rangle }{ ||x||\ ||y|| }<br />
\]</p>
<p>This is actually bounded between 0 and 1 if x and y are non-negative.  Cosine similarity has an interpretation as the cosine of the angle between the two vectors; you can illustrate this for vectors in \(\mathbb{R}^2\) (e.g. <a href="http://nlp.stanford.edu/IR-book/html/htmledition/dot-products-1.html">here</a>).  </p>
<p>Cosine similarity is not invariant to shifts.  If x was shifted to x+1, the cosine similarity would change.  What is invariant, though, is the <b><a href=http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient>Pearson correlation</a></b>.  Let \(\bar{x}\) and \(\bar{y}\) be the respective means:</p>
<p>\begin{align}<br />
Corr(x,y) &#038;= \frac{ \sum_i (x_i-\bar{x}) (y_i-\bar{y}) }{<br />
\sqrt{\sum (x_i-\bar{x})^2} \sqrt{ \sum (y_i-\bar{y})^2 } }<br />
\\<br />
&#038; = \frac{\langle x-\bar{x},\ y-\bar{y} \rangle}{<br />
||x-\bar{x}||\ ||y-\bar{y}||}  \\<br />
&#038; = CosSim(x-\bar{x}, y-\bar{y})<br />
\end{align}</p>
<p>Correlation is the cosine similarity between centered versions of x and y, again bounded between -1 and 1.  People usually talk about cosine similarity in terms of vector angles, but it can be loosely thought of as a correlation, if you think of the vectors as paired samples.  Unlike the cosine, the correlation is invariant to both scale and location changes of x and y.</p>
<p>This isn&#8217;t the usual way to derive the Pearson correlation; usually it&#8217;s presented as a normalized form of the <b><a href="http://en.wikipedia.org/wiki/Covariance">covariance</a></b>, which is a centered average inner product (no normalization)</p>
<p>\[ Cov(x,y) = \frac{\sum (x_i-\bar{x})(y_i-\bar{y}) }{n}<br />
= \frac{ \langle x-\bar{x},\ y-\bar{y} \rangle }{n} \]</p>
<p>Finally, these are all related to the coefficient in a <b><a href="http://www.edwardtufte.com/tufte/dapp/chapter3.html">one-variable linear regression</a></b>.  For the OLS model \(y_i \approx ax_i\) with Gaussian noise, whose MLE is the least-squares problem \(\arg\min_a \sum (y_i &#8211; ax_i)^2\), a few lines of calculus shows \(a\) is</p>
<p>\begin{align}<br />
 OLSCoef(x,y) &#038;= \frac{ \sum x_i y_i }{ \sum x_i^2 }<br />
= \frac{ \langle x, y \rangle}{ ||x||^2 }<br />
\end{align}</p>
<p>This looks like another normalized inner product.  But unlike cosine similarity, we aren&#8217;t normalizing by \(y\)&#8217;s norm &#8212; instead we only use \(x\)&#8217;s norm (and use it twice): denominator of \(||x||\ ||y||\) versus \(||x||^2\).</p>
<p>Not normalizing for \(y\) is what you want for the linear regression: if \(y\) was stretched to span a larger range, you would need to increase \(a\) to match, to get your predictions spread out too.</p>
<p>Often it&#8217;s desirable to do the OLS model with an intercept term: \(\min_{a,b} \sum (y &#8211; ax_i &#8211; b)^2\).  Then \(a\) is</p>
<p>\begin{align}<br />
OLSCoefWithIntercept(x,y) &#038;= \frac<br />
{ \sum (x_i &#8211; \bar{x}) y_i }<br />
{ \sum (x_i &#8211; \bar{x})^2 }<br />
= \frac{\langle x-\bar{x},\ y \rangle}{||x-\bar{x}||^2}<br />
 \\<br />
&#038;= OLSCoef(x-\bar{x}, y)<br />
\end{align}</p>
<p>It&#8217;s different because the intercept term picks up the slack associated with where x&#8217;s center is.  So OLSCoefWithIntercept is invariant to shifts of x.  It&#8217;s still different than cosine similarity since it&#8217;s still not normalizing at all for y.  Though, subtly, it does actually control for shifts of y. This isn&#8217;t obvious in the equation, but with a little arithmetic it&#8217;s easy to derive that \(<br />
\langle x-\bar{x},\ y \rangle = \langle x-\bar{x},\ y+c \rangle \) for any constant \(c\).  (There must be a nice geometric interpretation of this.)</p>
<p>Finally, what if x and y are standardized: both centered and normalized to unit standard deviation?  The OLS coefficient for that is the same as the Pearson correlation between the original vectors.  I&#8217;m not sure what this means or if it&#8217;s a useful fact, but:</p>
<p>\[ OLSCoef\left(<br />
\sqrt{n}\frac{x-\bar{x}}{||x-\bar{x}||},<br />
\sqrt{n}\frac{y-\bar{y}}{||y-\bar{y}||} \right) = Corr(x,y) \]</p>
<p>Summarizing: Cosine similarity is normalized inner product.  Pearson correlation is centered cosine similarity.  A one-variable OLS coefficient is like cosine but with one-sided normalization.  With an intercept, it&#8217;s centered.</p>
<p>Of course we need a summary table.  &#8220;Symmetric&#8221; means, if you swap the inputs, do you get the same answer.  &#8220;Invariant to shift in input&#8221; means, if you add an arbitrary constant to either input, do you get the same answer.</p>
<table cellpadding=3 border=1 cellspacing=0 align=center>
<tr>
<td>Function
<td>Equation
<td>Symmetric?
<td>Output range
<td>Invariant to shift in input?</p>
<td>Pithy explanation in terms of something else</p>
<tr>
<tr>
<td>Inner(x,y) </p>
<td> \[ \langle x, y\rangle\]</p>
<td>Yes
<td>\(\mathbb{R}\)
<td>No</p>
<td>
<tr>
<td>CosSim(x,y)
<td>\[ \frac{\langle x,y \rangle}{||x||\ ||y||} \]</p>
<td>Yes </p>
<td>[-1,1]<br /> or [0,1] if inputs non-neg</p>
<td>No</p>
<td>normalized inner product</p>
<tr>
<td>Corr(x,y)
<td>\[ \frac{\langle x-\bar{x},\ y-\bar{y} \rangle }{||x-\bar{x}||\ ||y-\bar{y}||} \]</p>
<td>Yes
<td>[-1,1]
<td>Yes</p>
<td>centered cosine; <i>or</i> normalized covariance</p>
<tr>
<td>Cov(x,y)
<td> \[ \frac{\langle x-\bar{x},\ y-\bar{y} \rangle}{n} \]</p>
<td>Yes
<td>\(\mathbb{R}\)
<td>Yes</p>
<td>centered inner product</p>
<tr>
<td>OLSCoefNoIntcpt(x,y) </p>
<td>\[\frac{ \langle x, y \rangle}{ ||x||^2 }\]</p>
<td>No
<td>\(\mathbb{R}\)
<td>No</p>
<td>(compare to CosSim)</p>
<tr>
<td>OLSCoefWithIntcpt(x,y) </p>
<td> \[ \frac{\langle x-\bar{x},\ y \rangle}{||x-\bar{x}||^2} \]</p>
<td>No
<td>\(\mathbb{R}\)
<td>Yes</p>
<td>
</table>
<p>Are there any implications?  I&#8217;ve been wondering for a while why cosine similarity tends to be so useful for natural language processing applications.  Maybe this has something to do with it.  Or not.  One implication of all the inner product stuff is computational strategies to make it faster when there&#8217;s high-dimensional sparse data &#8212; the <a href="http://www.jstatsoft.org/v33/i01">Friedman et al. 2010 glmnet</a> paper talks about this in the context of coordinate descent text regression.  I&#8217;ve heard <a href="http://www.cs.utexas.edu/~pradeepr/paperz/coord_nips.pdf">Dhillon et al., NIPS 2011</a> applies <a href="http://en.wikipedia.org/wiki/Locality-sensitive_hashing">LSH</a> in a similar setting (but haven&#8217;t read it yet).  And there&#8217;s lots of work using LSH for cosine similarity; e.g. <a href="http://cs.jhu.edu/~vandurme/papers/VanDurmeLallACL10-slides.pdf">van Durme and Lall 2010 [slides]</a>.</p>
<p>Any other cool identities?  Any corrections to the above?</p>
<p>References: I use <a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/">Hastie et al 2009, chapter 3</a> to look up linear regression, but it&#8217;s covered in zillions of other places.  I linked to a nice chapter in <a href="http://www.edwardtufte.com/tufte/dapp/chapter3.html">Tufte&#8217;s little 1974 book</a> that he wrote before he went off and did all that visualization stuff.  (He calls it &#8220;two-variable regression&#8221;, but I think &#8220;one-variable regression&#8221; is a better term.  &#8220;one-feature&#8221; or &#8220;one-covariate&#8221; might be most accurate.) In my experience, cosine similarity is talked about more often in text processing or machine learning contexts.</p>
]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>I don&#8217;t get this web parsing shared task</title>
		<link>http://brenocon.com/blog/2012/03/i-dont-get-this-web-parsing-shared-task/</link>
		<comments>http://brenocon.com/blog/2012/03/i-dont-get-this-web-parsing-shared-task/#comments</comments>
		<pubDate>Fri, 09 Mar 2012 02:07:08 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=1195</guid>
		<description><![CDATA[The idea for a shared task on web parsing is really cool. But I don&#8217;t get this one: Shared Task &#8211; SANCL 2012 (First Workshop on Syntactic Analysis of Non-Canonical Language) They&#8217;re explicitly banning Manually annotating in-domain (web) sentences Creating &#8230; <a href="http://brenocon.com/blog/2012/03/i-dont-get-this-web-parsing-shared-task/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>The idea for a shared task on web parsing is really cool.  But I don&#8217;t get this one:</p>
<p><a href=https://sites.google.com/site/sancl2012/home/shared-task>Shared Task &#8211; SANCL 2012 (First Workshop on Syntactic Analysis of Non-Canonical Language)</a></p>
<p>They&#8217;re explicitly banning</p>
<ul>
<li>Manually annotating in-domain (web) sentences
<li>Creating new word clusters, or anything, from as much text data as possible
</ul>
<p>&#8230; instead restricting participants to the data sets they release.</p>
<p>Isn&#8217;t a cycle of annotation, error analysis, and new annotations (a self-training + active-learning loop, with smarter decisions through error analysis) the hands-down best way to make an NLP tool for a new domain? Are people scared of this reality?  Am I off-base?</p>
<p>I am, of course, just advocating for our <a href="http://www.ark.cs.cmu.edu/TweetNLP/">Twitter POS tagger</a> approach, where we annotated some data, made a supervised tagger, and iterated on features.  The biggest weakness in that paper is we didn&#8217;t have additional iterations of error analysis.  Our lack of semi-supervised learning was <i>not</i> a weakness.</p>
]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2012/03/i-dont-get-this-web-parsing-shared-task/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Save Zipf&#8217;s Law (new anti-credulous-power-law article)</title>
		<link>http://brenocon.com/blog/2012/02/save-zipfs-law/</link>
		<comments>http://brenocon.com/blog/2012/02/save-zipfs-law/#comments</comments>
		<pubDate>Tue, 14 Feb 2012 04:10:23 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=1151</guid>
		<description><![CDATA[To the delight of those of us enjoying the ride on the anti-power-law bandwagon (bandwagons are ok if it&#8217;s a backlash to another bandwagon), Cosma links to a new article in Science, &#8220;Critical Truths About Power Laws,&#8221; by Stumpf and &#8230; <a href="http://brenocon.com/blog/2012/02/save-zipfs-law/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>To the delight of those of us enjoying the ride on the anti-power-law bandwagon (bandwagons are ok if it&#8217;s a backlash to another bandwagon), Cosma <a href="http://cscs.umich.edu/~crshalizi/weblog/873.html">links</a> to a new article in Science, <a href="http://www.sciencemag.org/content/335/6069/665">&#8220;Critical Truths About Power Laws,&#8221; by Stumpf and Porter</a>.  Since it&#8217;s behind a paywall you might as well go read the Clauset/Shalizi/Newman paper on the topic, and since you won&#8217;t be bothered to read the paper, see the blogpost entitled <a href=http://cscs.umich.edu/~crshalizi/weblog/491.html>&#8220;So You Think You Have a Power Law — Well Isn&#8217;t That Special?&#8221;</a></p>
<p>Anyway, the Science article is nice &#8212; it amusingly refers to certain statistical tests as &#8220;epically fail[ing]&#8221; &#8212; and it&#8217;s on the side of truth and goodness so it should be supported, BUT, it has one horrendous figure.  I just love that, in this of all articles that should be harping on deeply flawed uses of (log-log) plots, they use one of those MBA-style bozo plots with unlabeled axes, one of which is viciously, unapologetically subjective:</p>
<p><a href="http://brenocon.com/blog/2012/02/save-zipfs-law/screen-shot-2012-02-13-at-10-44-38-pm/" rel="attachment wp-att-1152"><img src="http://brenocon.com/blog/wp-content/uploads/2012/02/Screen-shot-2012-02-13-at-10.44.38-PM.png" alt="" title="Screen shot 2012-02-13 at 10.44.38 PM" width="834" height="528" class="aligncenter size-full wp-image-1152" /></a></p>
<p>If there is one power law I may single out for mercy in this delightful but verging-on-scary witch hunt, it would be <a href="http://en.wikipedia.org/wiki/Zipf's_law">Zipf&#8217;s law</a>, cruelly put a bit low on that &#8220;mechanistic sophistication&#8221; axis.  Zipf&#8217;s law has <a href="http://jmlr.csail.mit.edu/papers/volume12/goldwater11a/goldwater11a.pdf">a wonderful explanation as the outcome of a Pitman-Yor process</a> (going back to <a href="http://www.unc.edu/~fbaum/teaching/PLSC541_Fall06/Simon%201955%20Biometrika.pdf">Simon 1955</a>!), and <a href="http://arxiv.org/abs/0706.1062">Clauset/Shalizi/Newman</a> found it was the only purported power law that actually checked out:</p>
<blockquote><p>There is only one case—the distribution of the frequencies of occurrence of words<br />
in English text—in which the power law appears to be truly convincing, in the sense<br />
that it is an excellent ﬁt to the data and none of the alternatives carries any weight.</p></blockquote>
<p>Now, it is the case that the CRP/PYP/Yule-Simon stuff is still more of a statistical generative explanation than a deeper mechanistic one; but no one knows how cognition works, there are no satisfying causal stories for linguistic production, and it&#8217;s probably fundamentally unknowable anyways, so that&#8217;s the best science you can get.  yay.</p>
]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2012/02/save-zipfs-law/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Histograms &#8212; matplotlib vs. R</title>
		<link>http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/</link>
		<comments>http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/#comments</comments>
		<pubDate>Thu, 02 Feb 2012 20:57:10 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=1112</guid>
		<description><![CDATA[When possible, I like to use R for its really, really good statistical visualization capabilities. I&#8217;m doing a modeling project in Python right now (R is too slow, bad at large data, bad at structured data, etc.), and in comparison &#8230; <a href="http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>When possible, I like to use R for its really, really good statistical visualization capabilities.  I&#8217;m doing a modeling project in Python right now (R is too slow, bad at large data, bad at structured data, etc.), and in comparison to base R, the matplotlib library is just painful.  I wrote a toy <a href="http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm">Metropolis</a> sampler for a <a href="http://en.wikipedia.org/wiki/Triangular_distribution">triangle distribution</a> and all I want to see is whether it looks like it&#8217;s working.  For the same dataset, here are histograms with default settings.  (Python: <em>pylab.hist(d)</em>, R: <em>hist(d)</em>)</p>
<p><a href="http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/screen-shot-2012-02-02-at-3-30-30-pm/" rel="attachment wp-att-1113"><img src="http://brenocon.com/blog/wp-content/uploads/2012/02/Screen-shot-2012-02-02-at-3.30.30-PM.png" alt="" title="Screen shot 2012-02-02 at 3.30.30 PM" width="983" height="467" class="aligncenter size-full wp-image-1113" /></a></p>
<p>I want to know whether my Metropolis sampler is working; those two plots give a very different idea.  Of course, you could say this is an unfair comparison, since matplotlib is only using 10 bins, while R is using 18 here &#8212; and it&#8217;s always important to vary the bin size a few times when looking at histograms.  But R&#8217;s defaults really are better: it actually uses an adaptive bin size, and the heuristic worked, choosing a reasonable number for the data.  The <a href="http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/hist.html">hist()</a> manual says it&#8217;s from Sturges (1926).  It&#8217;s hard to find other computer software that cites 100 year old papers for its design decisions &#8212; and where it matters.  (Old versions of R used to yell at you when you made a pie chart, citing perceptual studies that humans are really bad at interpreting them (<a href="http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/pie.html">here</a>).  This is what originally made me love R.)</p>
<p>Second, R is much smarter about breakpoints.  In the following plots, I&#8217;ve manually set the  number of bins to 10, and then 30 for each.</p>
<p><a href="http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/screen-shot-2012-02-02-at-3-39-45-pm/" rel="attachment wp-att-1114"><img src="http://brenocon.com/blog/wp-content/uploads/2012/02/Screen-shot-2012-02-02-at-3.39.45-PM.png" alt="" title="Screen shot 2012-02-02 at 3.39.45 PM" width="672" height="250" class="aligncenter size-full wp-image-1114" /></a></p>
<p><a href="http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/screen-shot-2012-02-02-at-3-40-48-pm/" rel="attachment wp-att-1115"><img src="http://brenocon.com/blog/wp-content/uploads/2012/02/Screen-shot-2012-02-02-at-3.40.48-PM.png" alt="" title="Screen shot 2012-02-02 at 3.40.48 PM" width="642" height="243" class="aligncenter size-full wp-image-1115" /></a></p>
<p>The second one is now OK for matplotlib &#8212; it&#8217;s good enough to figure out what&#8217;s going on &#8212; though still a little lame.  Why the gaps?</p>
<p>The problem is that my data are discrete &#8212; they&#8217;re all integers from 1 through 19 &#8212; and I think matplotlib is naively carving up that range into bins, which sometimes lumps together two integers, and sometimes gets zero of them.  I understand this is the simple naive implementation, and you could say it&#8217;s my fault that I shouldn&#8217;t have used the pylab histogram function for this type of data &#8212; but it&#8217;s really not as good as whatever R is doing, which works rather well here, and I didn&#8217;t have to waste time thinking about the internals of the algorithm.  For reference, here is the correct visualization of the data (R: <em>plot(table(d))</em>).  Note that R&#8217;s original Sturges breakpoints did make one error: the first two values got combined into one bin.<br />
<a href="http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/screen-shot-2012-02-02-at-4-06-28-pm/" rel="attachment wp-att-1144"><img src="http://brenocon.com/blog/wp-content/uploads/2012/02/Screen-shot-2012-02-02-at-4.06.28-PM.png" alt="" title="Screen shot 2012-02-02 at 4.06.28 PM" width="294" height="206" class="aligncenter size-full wp-image-1144" /></a></p>
<p>Lessons: (1) always vary the bin sizes for histograms, especially if you&#8217;re using naive breakpoint selection, and (2) don&#8217;t ignore a century&#8217;s worth of statistical research on these issues.  And since it&#8217;s hard to learn a century&#8217;s worth of statistics, just use R, where they&#8217;re compiled it in for you.</p>
]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Bayes update view of pointwise mutual information</title>
		<link>http://brenocon.com/blog/2011/11/bayes-update-view-of-pointwise-mutual-information/</link>
		<comments>http://brenocon.com/blog/2011/11/bayes-update-view-of-pointwise-mutual-information/#comments</comments>
		<pubDate>Sun, 13 Nov 2011 18:41:03 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=1093</guid>
		<description><![CDATA[This is fun. Pointwise Mutual Information (e.g. Church and Hanks 1990) between two variable outcomes \(x\) and \(y\) is \[ PMI(x,y) = \log \frac{p(x,y)}{p(x)p(y)} \] It&#8217;s called &#8220;pointwise&#8221; because Mutual Information, between two (discrete) variables X and Y, is the &#8230; <a href="http://brenocon.com/blog/2011/11/bayes-update-view-of-pointwise-mutual-information/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<br />
This is fun.  Pointwise Mutual Information (e.g. <a href="http://acl.ldc.upenn.edu/J/J90/J90-1003.pdf">Church and Hanks 1990</a>) between two variable outcomes \(x\) and \(y\) is</p>
<p>\[ PMI(x,y) = \log \frac{p(x,y)}{p(x)p(y)} \]
<p>It&#8217;s called &#8220;pointwise&#8221; because <a href="http://en.wikipedia.org/wiki/Mutual_information">Mutual Information</a>, between two (discrete) variables X and Y, is the expectation of PMI over possible outcomes of X and Y: \( MI(X,Y) = \sum_{x,y} p(x,y) PMI(x,y) \).</p>
<p>One interpretation of PMI is it&#8217;s measuring how much deviation from independence there is &#8212; since \(p(x,y)=p(x)p(y)\) if X and Y were independent, so the ratio is how non-independent they (the outcomes) are.</p>
<p>You can get another interpretation of this quantity if you switch into conditional probabilities.  Looking just at the ratio, apply the definition of conditional probability:</p>
<p>\[ \frac{p(x,y)}{p(x)p(y)} = \frac{p(x|y)}{p(x)} \]</p>
<p>Think about doing a Bayes update for your belief about \(x\).  Start with the prior \(p(x)\), then learn \(y\) and you update to the posterior belief \(p(x|y)\).  How much your belief changes is measured by that ratio; the log-scaled ratio is PMI.  (Positive PMI = increase belief, negative PMI = decrease belief.  Positive vs. negative associations.)</p>
<p>Interestingly, it&#8217;s symmetric (obvious from the original definition of PMI, sure):<br />
\[ \frac{p(x|y)}{p(x)} = \frac{p(y|x)}{p(y)} \]</p>
<p>So under this measurement of &#8220;amount of information you learn,&#8221; the amount you learn about \(x\) from \(y\) is actually the same as how much you learn about \(y\) from \(x\).</p>
<p>This is closer to the information gain view of mutual information, when you decompose it into relative and conditional entropies; the current Wikipedia page has some of the derivations back and forth for them.</p>
<p>Lots more about this stuff on the <a href="http://en.wikipedia.org/wiki/Mutual_information">MI</a> and <a href="http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL Divergence</a> Wikipedia pages.  And early chapters of the (free) <a href="http://www.inference.phy.cam.ac.uk/mackay/itila/book.html">MacKay 2003 textbook</a>.  There seems to be lots of recent work using PMI for association scores between words or concepts and such (I did this with Facebook &#8220;Like&#8221; data at my internship there, it is quite fun); it&#8217;s nice because with MLE or fixed-Dirichlet-MAP estimation it only requires simple counts and no optimization/sampling, so you can use it on very large datasets, and it seems to give good pairwise association results in many circumstances.</p>
]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2011/11/bayes-update-view-of-pointwise-mutual-information/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Memorizing small tables</title>
		<link>http://brenocon.com/blog/2011/11/memorizing-small-tables/</link>
		<comments>http://brenocon.com/blog/2011/11/memorizing-small-tables/#comments</comments>
		<pubDate>Fri, 11 Nov 2011 18:13:49 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=1066</guid>
		<description><![CDATA[Lately, I&#8217;ve been trying to memorize very small tables, especially for better intuitions and rule-of-thumb calculations. At the moment I have these above my desk: The first one is a few entries in a natural logarithm table. There are all &#8230; <a href="http://brenocon.com/blog/2011/11/memorizing-small-tables/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><br />
Lately, I&#8217;ve been trying  to memorize very small tables, especially for better intuitions and rule-of-thumb calculations.  At the moment I have these above my desk:</p>
<p><a href="http://brenocon.com/blog/2011/11/memorizing-small-tables/screen-shot-2011-11-11-at-1-04-28-pm-3/" rel="attachment wp-att-1074"><img src="http://brenocon.com/blog/wp-content/uploads/2011/11/Screen-shot-2011-11-11-at-1.04.28-PM1.jpg" alt="" title="Screen shot 2011-11-11 at 1.04.28 PM" width="1061" height="526" class="aligncenter size-full wp-image-1074" /></a></p>
<p>The first one is a few entries in a natural logarithm table.  There are all these stories about how in the slide rule era, people would develop better intuitions about the scale of logarithms because they physically engaged with them all the time.  I spend lots of time looking at log-likelihoods, log-odds-ratios, and logistic regression coefficients, so I think it would be nice to have quick intuitions about what they are.  (Though the <a href="http://www.stat.columbia.edu/~gelman/arm/">Gelman and Hill</a> textbook has an interesting argument against odds scale interpretations of logistic regression coefficients.)</p>
<p>The second one are some zsh filename manipulation <a href="http://www.rayninfo.co.uk/tips/zshtips.html">shortcuts</a>.  OK, this is more narrow than the others, but pretty useful for me at least.</p>
<p>The third one are rough unit equivalencies for data rates over time.  I find this very important for quickly determining whether a long-running job is going to take a dozen minutes, or a few hours, or a few days.  In particular, many data transfer commands (scp, wget, s3cmd) immediately tell you a rate per second, which you then can scale up.  (And if you&#8217;re using a CPU-bound pipeline command, you can always use the amazing <a href="http://www.ivarch.com/programs/pv.shtml">pv</a> command to get a rate-per-second estimate.)  This table is inspired by the <a href="http://brenocon.com/dean_perf.html">&#8220;Numbers Everyone Should Know&#8221;</a> list.</p>
<p>The fourth one is the <a href="http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper-Pearson_interval">Clopper-Pearson</a> binomial confidence interval.  Actually, the more useful ones to memorize are <a href="http://brenocon.com/blog/2011/04/rough-binomial-confidence-intervals/">Wald binomial intervals</a>, which are easy because they&#8217;re close to \(\pm 1/\sqrt{n}\).  Good party trick.  This sticky is actually the relevant R calls (type <a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/binom.test.html">binom.test</a> and press enter); I was using small-n binomial hypothesis testing a lot recently so wanted to get more used to it.  Maybe this one isn&#8217;t very useful.</p>
]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2011/11/memorizing-small-tables/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Be careful with dictionary-based text analysis</title>
		<link>http://brenocon.com/blog/2011/10/be-careful-with-dictionary-based-text-analysis/</link>
		<comments>http://brenocon.com/blog/2011/10/be-careful-with-dictionary-based-text-analysis/#comments</comments>
		<pubDate>Wed, 05 Oct 2011 16:15:36 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=1050</guid>
		<description><![CDATA[OK, everyone loves to run dictionary methods for sentiment and other text analysis &#8212; counting words from a predefined lexicon in a big corpus, in order to explore or test hypotheses about the corpus. In particular, this is often done &#8230; <a href="http://brenocon.com/blog/2011/10/be-careful-with-dictionary-based-text-analysis/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>OK, everyone loves to run dictionary methods for sentiment and other text analysis &#8212; counting words from a predefined lexicon in a big corpus, in order to explore or test hypotheses about the corpus.  In particular, this is often done for sentiment analysis: count positive and negative words (according to a sentiment polarity lexicon, which was derived from human raters or previous researchers&#8217; intuitions), and then proclaim the output yields sentiment levels of the documents.  More and more papers come out every day that do this.  <a href="http://brenocon.com/oconnor_balasubramanyan_routledge_smith.icwsm2010.tweets_to_polls.pdf">I&#8217;ve done this myself.</a>  It&#8217;s interesting and fun, but it&#8217;s easy to get a bunch of meaningless numbers if you don&#8217;t carefully validate what&#8217;s going on.  There are certainly good studies in this area that do further validation and analysis, but it&#8217;s hard to trust a study that just presents a graph with a few overly strong speculative claims as to its meaning.  This happens more than it ought to.</p>
<p>I was happy to see a similarly critical view in a nice working paper by <a href="http://www.justingrimmer.org/">Justin Grimmer</a> and <a href="http://www.gov.harvard.edu/people/brandon-stewart">Brandon Stewart</a>, <a href="http://stanford.edu/~jgrimmer/tad2.pdf">Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts</a>.</p>
<p>Since I think these arguments need to be more widely known, here&#8217;s a long quote from Section 4.1 &#8230; see also the paper for more details (and lots of other interesting stuff).  Emphases are mine.</p>
<blockquote><p>
For dictionary methods to work well, the scores attached to words must closely align with how the words are used in a particular context. If a dictionary is developed for a specific application, then this assumption should be easy to justify. But <strong>when dictionaries are created in one substantive area and then applied to another problems, serious errors can occur</strong>. Perhaps the clearest example of this is shown in Loughran and McDonald (2011).  Loughran and McDonald (2011) critique the increasingly common use of off the shelf dictionaries to measure the tone of statutorily required corporate earning reports in the accounting literature. They point out that many words that have a negative connotation in other contexts, like <em>tax</em>, <em>cost</em>, <em>crude</em> (oil) or <em>cancer</em>, may have a positive connotation in earning reports. For example, a health care company may mention cancer often and oil companies are likely to discuss crude extensively. And words that are not identified as negative in off the shelf dictionaries may have quite negative connotation in earning reports (<em>unanticipated</em>, for example).</p>
<p>Dictionaries, therefore, should be used with substantial caution. Scholars must either explicitly establish that word lists created in other contexts are applicable to a particular domain, or create a problem specific dictionary. In either instance, scholars must validate their results. But <strong>measures from dictionaries are rarely validated. Rather, standard practice in using dictionaries is to assume the measures created from a dictionary are correct and then apply them to the problem.</strong> This is due, in part, to the exceptional difficulties in validating dictionaries. Dictionaries are commonly used to establish granular scales of a particular kind of sentiment, such as tone. While this is useful for applications, the granular measures insure that it is essentially impossible to derive gold standard evaluations based on human coding of documents, because of the difficulty of establishing reliable granular scales from humans (Krosnick, 1999).</p>
<p>The consequence of domain specificity and lack of validation is that <strong>most analyses based on dictionaries are built on shaky foundations.</strong> <strong>Yes, dictionaries are able to produce measures that are claimed to be about tone or emotion, but the actual properties of these measures &#8211; and how they relate to the concepts their attempting to measure &#8211; are essentially a mystery.</strong> Therefore, for scholars to effectively use dictionary methods in their future work, advances in the validation of dictionary methods must be made. We suggest two possible ways to improve validation of dictionary methods. First, the classification problem could be simplified. If scholars use dictionaries to code documents into binary categories (positive or negative tone, for example), then validation based on human gold standards and the methods we describe in Section 4.2.4 is straightforward. Second, scholars could treat measures from dictionaries similar to how we validations from unsupervised methods are conducted (see Section 5.5). This would force scholars to establish that their measures of underlying concepts have properties associated with long standing expectations.
</p></blockquote>
<p>And after an example analysis,</p>
<blockquote><p>
&#8230; we reiterate our skepticism of dictionary based measures. As is standard in the use of dictionary measures (for example, Young and Soroka (2011)) the measures are presented here without validation.  This lack of validation is due in part because <strong>it is exceedingly difficult to demonstrate that our scale of sentiment precisely measures differences in sentiment expressed</strong> towards Russia.  Perhaps this is because <strong>it is equally difficult to define what would constitute these differences in scale</strong>.
</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2011/10/be-careful-with-dictionary-based-text-analysis/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Information theory stuff</title>
		<link>http://brenocon.com/blog/2011/09/information-theory-stuff/</link>
		<comments>http://brenocon.com/blog/2011/09/information-theory-stuff/#comments</comments>
		<pubDate>Sun, 25 Sep 2011 21:28:59 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=1010</guid>
		<description><![CDATA[Actually this post is mainly to test the MathJax installation I put into WordPress via this plugin. But information theory is great, why not? The probability of a symbol is . It takes bits to encode one symbol &#8212; sometimes &#8230; <a href="http://brenocon.com/blog/2011/09/information-theory-stuff/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<br />
Actually this post is mainly to test the <a href=http://www.mathjax.org/>MathJax</a> installation I put into WordPress via <a href=http://wordpress.org/extend/plugins/mathjax-latex/>this plugin</a>.  But <a href="http://en.wikipedia.org/wiki/Information_theory">information theory</a> is great, why not?</p>
<p>The probability of a symbol is \(p\).</p>
<p>It takes \(\log \frac{1}{p} = -\log p\) bits to encode one symbol &#8212; sometimes called its &#8220;surprisal&#8221;.  Surprisal is 0 for a 100% probable symbol, and ranges up to \(\infty\) for extremely low probability symbols.  This is because you use a coding scheme that encodes common symbols as very short strings, and less common symbols as longer ones.  (e.g. <a href="http://en.wikipedia.org/wiki/Huffman_coding">Huffman</a> or <a href="http://en.wikipedia.org/wiki/Arithmetic_coding">arithmetic</a> coding.)  We should say logarithms are base-2 so information is measured in bits.\(^*\)</p>
<p>If you have a stream of such symbols and a probability distribution \(\vec{p}\) for them, where a symbol \(i\) comes at probability \(p_i\), then the average message size is the expected surprisal:</p>
<p>\[ H(\vec{p}) = \sum_i p_i \log \frac{1}{p_i} \]
<p>this is the Shannon <b>entropy</b> of the probability distribution \( \vec{p} \), which is a measure of its uncertainty.  In fact, if you start with a few pretty reasonable axioms for how to design a measurement of uncertainty of a discrete probability distribution, you end up with the above equation as the only possible measure.  (I think. This is all in Shannon&#8217;s original paper.)</p>
<p>Now, what if you have symbols at a distribution \( \vec{p} \) but you encode then with the wrong distribution \( \vec{q} \)?  You pay \(\log\frac{1}{q}\) bits per symbol but the expectation is under the true distribution \(\vec{p}\).  Then the average message size is called the <b>cross-entropy</b> between the distributions:</p>
<p>\[ H(\vec{p},\vec{q}) = \sum_i p_i \log \frac{1}{q_i} \]</p>
<p>How much worse is this coding compared to the optimal one?  (I.e. how much a cost do you pay for encoding with the wrong distribution?)  The optimal one is size \( \sum -p_i \log p_i \) so it&#8217;s just</p>
<p>\[ \begin{align}<br />
&#038; \sum_i -p_i \log q_i + p_i \log p_i \\<br />
KL(\vec{p} || \vec{q})=<br />
&#038;\sum_i p_i \log \frac{p_i}{q_i}<br />
\end{align} \]</p>
<p>which is called the <b>relative entropy</b> or <a href="http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Leibler divergence</a>, and it&#8217;s a measurement of the disssimilarity of the distributions \(\vec{p}\) and \(\vec{q}\).  You can see it&#8217;s about dissimilarity because if \(\vec{p}\) and \(\vec{q}\) were the same, then the inner term \(\log\frac{p}{q}\) would always be 0 and the whole thing comes out to be 0.</p>
<p>For more, I rather like the early chapters of the free online textbook by <a href="http://www.cs.toronto.edu/~mackay/itila/book.html">David MacKay: &#8220;Information Theory, Inference, and Learning Algorithms&#8221;</a>.  That&#8217;s where I picked up the habit of saying surprisal is \( \log \frac{1}{p} \) instead of \(-\log p\); the former seems more intuitive to me, and then you don&#8217;t have a pesky negative sign in the entropy and cross-entropy equations.  In general the book is great at making things intuitive.  Its main weakness is you can&#8217;t trust the insane negative things he says about frequentist statistics, but that&#8217;s another discussion.</p>
<p>\(^*\) You can use natural logs or whatever and it&#8217;s just different sized units: &#8220;nats&#8221;, as you can see in the fascinating Chapter 18 of MacKay on codebreaking, which features Bletchley Park, Alan Turing, and Nazis.</p>
]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2011/09/information-theory-stuff/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>End-to-end NLP packages</title>
		<link>http://brenocon.com/blog/2011/09/end-to-end-nlp-packages/</link>
		<comments>http://brenocon.com/blog/2011/09/end-to-end-nlp-packages/#comments</comments>
		<pubDate>Mon, 19 Sep 2011 00:31:30 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=995</guid>
		<description><![CDATA[What freely available end-to-end natural language processing (NLP) systems are out there, that start with raw text, and output parses and semantic structures? Lots of NLP research focuses on single tasks at a time, and thus produces software that does &#8230; <a href="http://brenocon.com/blog/2011/09/end-to-end-nlp-packages/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>What freely available end-to-end natural language processing (NLP) systems are out there, that start with raw text, and output parses and semantic structures?  Lots of NLP research focuses on single tasks at a time, and thus produces software that does a single task at a time.  But for various applications, it is nicer to have a full end-to-end system that just runs on whatever text you give it.</p>
<p>If you believe this is a worthwhile goal (see caveat at bottom), I will postulate there aren&#8217;t a ton of such end-to-end, multilevel systems.  Here are ones I can think of.  Corrections and clarifications welcome.</p>
<ul>
<li><a href="http://nlp.stanford.edu/software/corenlp.shtml">Stanford CoreNLP</a>.  Raw text to <a href="http://nlp.stanford.edu/software/stanford-dependencies.shtml">rich syntactic dependencies</a> (<a href="http://en.wikipedia.org/wiki/Lexical_functional_grammar">LFG</a>-inspired).  Also POS, NER, coreference.</li>
<li><a href="http://svn.ask.it.usyd.edu.au/trac/candc/wiki">C&amp;C tools</a>.  From (sentence-segmented, tokenized?) text to rich syntactic dependencies (<a href="http://en.wikipedia.org/wiki/Combinatory_categorial_grammar">CCG</a>-based) and also a semantic representation.  POS and chunks on the way.  Does anyone use this much?  It seems underappreciated relative to its richness.</li>
<li><a href="http://ml.nec-labs.com/senna/">Senna</a>.  Sentence-segmented text -> parse trees, plus POS, NER, chunks, and semantic role labeling.  This one is quite new; is it as good?  It doesn&#8217;t give syntactic dependencies, though for some applications semantic role labeling is similar or better (or worse?).  I&#8217;m a little concerned that its documentation seems overly focused on competing in evaluation datasets, as opposed to trying to ensure they&#8217;ve made something more broadly useful.  (To be fair, they&#8217;re focused on developing algorithms that could be broadly applicable to different NLP tasks; that&#8217;s a whole other discussion.)</li>
</ul>
<p>If you want to quickly get some sort of shallow semantic relations, a.k.a. high-level syntactic relations, one of the above packages might be your best bet.  Are there others out there?</p>
<p>Restricting oneself to these full end-to-end systems is also funny since you can mix-and-match components to get better results for what you want.  One example: if you have constituent parse trees and want dependencies, you could swap in the <a href="http://nlp.stanford.edu/software/stanford-dependencies.shtml">Stanford Dependency</a> extractor (or another one like <a href="http://nlp.cs.lth.se/software/treebank_converter/">pennconverter</a>?) to post-process the parses.  Or you could swap in the <a href="http://bllip.cs.brown.edu/resources.shtml">Charniak-Johnson</a> or <a href="http://code.google.com/p/berkeleyparser/">Berkeley</a> parser into the middle of the Stanford CoreNLP stack.  Or you could use a direct dependency parser (I think <a href="http://maltparser.org/">Malt</a> is the most popular?) and skip the pharse structure step.  Etc.</p>
<p>It&#8217;s worth noting several other NLP libraries that I see used a lot.  I believe that, unlike the above, they don&#8217;t focus on out-of-the-box end-to-end NLP analysis (though you can certainly use them to perform various parts of an NLP pipeline).</p>
<ul>
<li><a href="http://incubator.apache.org/opennlp/">OpenNLP</a> &#8212; I&#8217;ve never used it but lots of people like it.  Seems well-maintained now?  Does chunking, tagging, even coreference.</li>
<li><a href="http://alias-i.com/lingpipe/">LingPipe</a> &#8212; has lots of individual algorithms and high-quality implementations.  Only chunking and tagging (I think).  It&#8217;s only quasi-free.</li>
<li><a href="http://mallet.cs.umass.edu/">Mallet</a> &#8212; focuses on information extraction and topic modeling, so slightly different than the other packages listed here.</li>
<li><a href="http://www.nltk.org/">NLTK</a> &#8212; I always have a hard time telling what this actually does, compared to what it aims to teach you to do.  It seems to do various tagging and chunking tasks.  I use the nltk_data.zip archive all the time though (I can&#8217;t find a direct download link unfortunately), for its stopword lists and small toy corpora.  (Including the <a href="http://en.wikipedia.org/wiki/Brown_Corpus">Brown Corpus</a>!  I guess it now counts as a toy corpus since you can grep it in less than a second.)</li>
</ul>
<p>These packages are nice in terms of documentation and software engineering, but they don&#8217;t do any syntactic parsing or other shallow relational extraction.  (NLTK has some libraries that appear to do parsing and semantics, but it&#8217;s hard to tell how serious they are.)</p>
<p>Oh finally, there&#8217;s also <a href="http://uima.apache.org/">UIMA</a>, which isn&#8217;t really a tool, but rather a high-level API to integrate together your tools.  <a href="http://gate.ac.uk/">GATE</a> also heavily emphasizes the framework aspect, but does come with some sort of tools.</p>
]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2011/09/end-to-end-nlp-packages/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
	</channel>
</rss>

