Author Archives: brendano

1 billion web page dataset from CMU

This is fun — Jamie Callan‘s group at CMU LTI just finished a crawl of 1 billion web pages. It’s 5 terabytes compressed — big enough so they have to send it to you by mailing hard drives. Link: ClueWeb09 … Continue reading

6 Comments

Pirates killed by President

A lesson in x-axis scaling, and choosing which data to compare.  Two current graphs making their rounds on the internet: (about this.)

Leave a comment

Binary classification evaluation in R via ROCR

A binary classifier makes decisions with confidence levels. Usually it’s imperfect: if you put a decision threshold anywhere, items will fall on the wrong side — errors. I made this a diagram a while ago for Turker voting; same principle … Continue reading

5 Comments

Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata

Lukas and I were trying to write a succinct comparison of the most popular packages that are typically used for data analysis.  I think most people choose one based on what people around them use or what they learn in … Continue reading

185 Comments

La Jetee

From here.

2 Comments

“Logic Bomb”

Article: Fannie Mae Logic Bomb Would Have Caused Weeklong Shutdown | Threat Level from Wired.com. I love the term “logic bomb”.  Can you pair it with a statistics bomb?  Data-driven bomb?  Or maybe the point is a connectionist bomb.

Leave a comment

SF conference for data mining mercenaries

I got an email from a promoter for Predictive Analytics World, a very expensive conference next month in San Francisco for business applications of data mining / machine learning / predictive analytics.  I’m not going because I don’t want to … Continue reading

2 Comments

Love it and hate it, R has come of age

Seeing a long, lavish article about R in the NEW YORK TIMES (!) really freaks me out. replicate(100, c( “OMG OMG, R is now famous?!”, “People used to make fun of me for learning R since Splus is SO OLD!”, … Continue reading

6 Comments

Facebook sentiment mining predicts presidential polls

I’m a bit late blogging this, but here’s a messy, exciting — and statistically validated! — new online data source. My friend Roddy at Facebook wrote a post describing their sentiment analysis system, which can evaluate positive or negative sentiment … Continue reading

7 Comments

Information cost and genocide

In 1994, the Rwandan genocide claimed 800,000 lives.  This genocide was remarkable for being very low-tech — lots of non-military, average people with machetes killing their neighbors.  Romeo Dallaire, the leader of the small UN peacekeeping mission there, saw it … Continue reading

Leave a comment