How much text versus metadata is in a tweet?

Posted on June 14, 2011

This should have been a blog post, but I got lazy and wrote a plaintext document instead.

Link

For twitter, context matters: 90% of a tweet is metadata and 10% is text. That’s measured by (an approximation of) information content; by raw data size, it’s 95/5.

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to How much text versus metadata is in a tweet?

Bob Carpenter says:

June 16, 2011 at 9:01 pm

Nice! Compressibility’s totally the way to measure this. The only question is what scale you do the compression on.

With these implementations like zip, we wind up providing upper bounds on the amount of information. PPM with longer contexts is even better, but they still don’t get close to simple smoothed language models due to online constraints. There are some really cool hierarchical Bayesian language model with non-parametrics that are even better that Frank Wood, Nick Bartlett, David Pfau, Yee Whye Teh and a few others developed:

http://www.stat.columbia.edu/~fwood/Papers/Bartlett-DCC-2011.pdf
brendano says:

June 22, 2011 at 4:44 am

Thanks for the comment Bob, I really appreciate it. Somehow I missed the literature that compares PPM to what are now standard smoothed LM’s. That paper looks great.

How much text versus metadata is in a tweet?

2 Responses to How much text versus metadata is in a tweet?

About

Blogroll

Blog Search

Archives