Category Archives: Uncategorized

What inputs do Monte Carlo algorithms need?

Monte Carlo sampling algorithms (either MCMC or not) have a goal to attain samples from a distribution. ¬†They can be organized by what inputs or prior knowledge about the distribution they require. ¬†This ranges from a low amount of knowledge, … Continue reading

2 Comments

Rise and fall of Dirichlet process clusters

Here’s Gibbs sampling for a Dirichlet process 1-d mixture of Gaussians. On 1000 data points that look like this. I gave it fixed variance and a concentration and over MCMC iterations, and it looks like this. The top is the … Continue reading

2 Comments

Correlation picture

Paul Moore posted a comment pointing out this great discussion of the correlation coefficient: Joseph Lee Rodgers and W. Alan Nicewander. “Thirteen Ways to Look at the Correlation Coefficient.” The American Statistician, Vol. 42, No. 1. (Feb., 1988), pp. 59-66. … Continue reading

Leave a comment

R scan() for quick-and-dirty checks

One of my favorite R tricks is scan(). I was using it to verify whether I wrote a sampler recently, which was supposed to output numbers uniformly between 1 and 100 into a logfile; this loads the logfile, counts the … Continue reading

Leave a comment

Really liking whoever made @SottedReviewer, e.g. BEFORE I START TALKING ABOUT RANDOM FORESTS AND AUCS, I’LL THROW A BONE TO SOCIAL SCIENTISTS WITH A GRANOVETTER CITE. #ICWSM — Sotted Reviewer (@SottedReviewer) March 11, 2013 There’s an entire great story here … Continue reading

Leave a comment

Wasserman on Stats vs ML, and previous comparisons

Larry Wasserman has a new position paper (forthcoming 2013) with a great comparison the Statistics and Machine Learning research cultures, “Rise of the Machines”. He has a very conciliatory view in terms of intellectual content, and a very pro-ML take … Continue reading

Leave a comment

Perplexity as branching factor; as Shannon diversity index

A language model’s perplexity is exponentiated negative average log-likelihood, $$\exp( -\frac{1}{N} \log(p(x)))$$ Where the inner term usually decomposes into a sum over individual items; for example, as \(\sum_i \log p(x_i | x_1..x_{i-1})\) or \(\sum_i \log p(x_i)\) depending on independence assumptions, … Continue reading

Leave a comment

Graphs for SANCL-2012 web parsing results

I was just looking at some papers from the SANCL-2012 workshop on web parsing from June this year, which are very interesting to those of us who wish we had good parsers for non-newspaper text. The shared task focus was … Continue reading

1 Comment

Powerset’s natural language search system

There’s a lot to say about Powerset, the short-lived natural language search company (2005-2008) where I worked after college. AI overhype, flying too close to the sun, the psychology of tech journalism and venture capitalism, etc. A year or two … Continue reading

1 Comment

CMU ARK Twitter Part-of-Speech Tagger – v0.3 released

We’re pleased to announce a new release of the CMU ARK Twitter Part-of-Speech Tagger, version 0.3. The new version is much faster (40x) and more accurate (89.2 -> 92.8) than before. We also have released new POS-annotated data, including a … Continue reading

Leave a comment