Category Archives: Uncategorized

FFT: Friedman + Fortran + Tricks

Posted on July 22, 2009

…is a tongue-in-cheek phrase from Trevor Hastie’s very fun to read useR-2009 presentation, from the merry trio of Hastie, Friedman, and Tibshirani, who brought us, among other things, the excellent Elements of Statistical Learning textbook. It’s a joy to read sophisticated … Continue reading →

1 Comment

Beta conjugate explorer

Posted on July 15, 2009

Here’s a little interactive explorer for the beta probability distribution, a conjugate prior for the Bernoulli under Bayesian inference… Ack, too much jargon. Simply press the right arrow every time you see the sun rise, the up arrow when it … Continue reading →

5 Comments

Michael Jackson in Persepolis

Posted on June 26, 2009

Michael Jackson just died while Iran is in turmoil. I am reminded of a passage in Marjane Satrapi’s wonderful graphic novel Persepolis, a memoir of growing up in revolutionary Iran in the 80′s. (Read the book to see how it … Continue reading →

2 Comments

Psychometrics quote

Posted on June 14, 2009

It is rather surprising that systematic studies of human abilities were not undertaken until the second half of the last century… An accurate method was available for measuring the circumference of the earth 2,000 years before the first systematic measures … Continue reading →

2 Comments

June 4

Posted on June 4, 2009

BBC News – June 4, 1989, Tiananmen Square Massacre Also worth reading: Nicholas Kristof’s riveting firsthand account.

Where tweets get sent from

Posted on May 27, 2009

Playing around with stream.twitter.com/spritzer, ggplot2 and maps / mapdata: I think I like the top better, without the map lines, like those night satellite photos: pointwise ghosts of high-end human economic development. This data is a fairly extreme sample of … Continue reading →

Zipf’s law and world city populations

Posted on May 24, 2009

Will Fitzgerald just wrote about an excellent article by Steven Strogatz on Zipf’s Law for the populations of cities. If you look at the biggest city, then the next biggest city, etc., there tends to be an exponential fall-off in … Continue reading →

13 Comments

Performance comparison: key/value stores for language model counts

Posted on April 22, 2009

I’m doing word and bigram counts on a corpus of tweets. I want to store and rapidly retrieve them later for language model purposes. So there’s a big table of counts that get incremented many times. The easiest way to … Continue reading →

28 Comments

1 billion web page dataset from CMU

Posted on April 17, 2009

This is fun — Jamie Callan‘s group at CMU LTI just finished a crawl of 1 billion web pages. It’s 5 terabytes compressed — big enough so they have to send it to you by mailing hard drives. Link: ClueWeb09 … Continue reading →

6 Comments

Pirates killed by President

Posted on April 15, 2009

A lesson in x-axis scaling, and choosing which data to compare. Two current graphs making their rounds on the internet: (about this.)