This should have been a blog post, but I got lazy and wrote a plaintext document instead.

For twitter, context matters: 90% of a tweet is metadata and 10% is text. That’s measured by (an approximation of) information content; by raw data size, it’s 95/5.

Nice! Compressibility’s totally the way to measure this. The only question is what scale you do the compression on.

With these implementations like zip, we wind up providing upper bounds on the amount of information. PPM with longer contexts is even better, but they still don’t get close to simple smoothed language models due to online constraints. There are some really cool hierarchical Bayesian language model with non-parametrics that are even better that Frank Wood, Nick Bartlett, David Pfau, Yee Whye Teh and a few others developed:

http://www.stat.columbia.edu/~fwood/Papers/Bartlett-DCC-2011.pdf

Thanks for the comment Bob, I really appreciate it. Somehow I missed the literature that compares PPM to what are now standard smoothed LM’s. That paper looks great.