Author Archives: brendano

Berkeley SDA and the General Social Survey

Posted on August 21, 2012

It is worth contemplating how grand the General Social Survey is. When playing around with the Statwing YC demo (which is very cool!) I was reminded of the very old-school SDA web tool for exploratory cross-tabulation analyses… They have the … Continue reading →

2 Comments

Posted on July 23, 2012

Re So I just wrote this hierarchical kernelized Boltzmann process in Prolog using ed on my iPhone. I can send you the RCS repository. — ML Hipster (@ML_Hipster) July 19, 2012 The best I can do is: I once programmed … Continue reading →

1 Comment

p-values, CDF’s, NLP etc.

Posted on July 17, 2012

Update Aug 10: THIS IS NOT A SUMMARY OF THE WHOLE PAPER! it’s whining about one particular method of analysis before talking about other things further down A quick note on Berg-Kirkpatrick et al EMNLP-2012, “An Empirical Investigation of Statistical … Continue reading →

3 Comments

The $60,000 cat: deep belief networks make less sense for language than vision

Posted on July 4, 2012

There was an interesting ICML paper this year about very large-scale training of deep belief networks (a.k.a. neural networks) for unsupervised concept extraction from images. They (Quoc V. Le and colleagues at Google/Stanford) have a cute example of learning very … Continue reading →

3 Comments

F-scores, Dice, and Jaccard set similarity

Posted on April 11, 2012

The Dice similarity is the same as F1-score; and they are monotonic in Jaccard similarity. I worked this out recently but couldn’t find anything about it online so here’s a writeup. Let $A$ be the set of found items, and … Continue reading →

2 Comments

Cosine similarity, Pearson correlation, and OLS coefficients

Posted on March 13, 2012

Cosine similarity, Pearson correlations, and OLS coefficients can all be viewed as variants on the inner product — tweaked in different ways for centering and magnitude (i.e. location and scale, or something like that). Details: You have two vectors $x$ … Continue reading →

23 Comments

I don’t get this web parsing shared task

Posted on March 9, 2012

The idea for a shared task on web parsing is really cool. But I don’t get this one: Shared Task – SANCL 2012 (First Workshop on Syntactic Analysis of Non-Canonical Language) They’re explicitly banning Manually annotating in-domain (web) sentences Creating … Continue reading →

5 Comments

Save Zipf’s Law (new anti-credulous-power-law article)

Posted on February 14, 2012

To the delight of those of us enjoying the ride on the anti-power-law bandwagon (bandwagons are ok if it’s a backlash to another bandwagon), Cosma links to a new article in Science, “Critical Truths About Power Laws,” by Stumpf and … Continue reading →

4 Comments

Histograms — matplotlib vs. R

Posted on February 2, 2012

When possible, I like to use R for its really, really good statistical visualization capabilities. I’m doing a modeling project in Python right now (R is too slow, bad at large data, bad at structured data, etc.), and in comparison … Continue reading →

8 Comments

Bayes update view of pointwise mutual information

Posted on November 13, 2011

This is fun. Pointwise Mutual Information (e.g. Church and Hanks 1990) between two variable outcomes $x$ and $y$ is \[ PMI(x,y) = \log \frac{p(x,y)}{p(x)p(y)} \] It’s called “pointwise” because Mutual Information, between two (discrete) variables X and Y, is the … Continue reading →