Earlier this week I asked the question,
People sent me lots of submissions! Some are great, some are a bit of a stretch.
- Overpriced by an order of magnitude.
- The letters of “art” are found embedded, in order, in “pharmaceuticals”.
- Search keywords that cost the most to advertise on?
- “Wyeth”: I think this means this, and this.
- “Romeo and Juliet” famously includes both “art” (wherefore art thou) and pharmaceuticals (poison!)
- Some art has been created out of pharmaceuticals.
- Some art has been created under the influence of pharmaceuticals.
I was asking because I was playing around with a dataset of 100,000 noun phrases’ appearances on the web, from the Reading the Web project at CMU. That is, for a noun like “art”, this data has a large list of phrases in which the word “art” is used, across some 200 million web pages. For two noun concepts, we can see what they have in common and what’s different by looking at examples of how people use them when writing. So, for “art” versus “pharmaceuticals”:
common contexts for “art” but not “pharmaceuticals” [7394 total] | common contexts for both “art” and “pharmaceuticals” [165 total] | common contexts for “pharmaceuticals” but not “art” [206 total] |
---|---|---|
‘m into _ ‘s interested in _ A collection of _ _ has been described by structure of _ study in _ _ have been shown in The knowledge of _ _ is a commodity _ is a creation _ is a world an exhibition of _ the commercialization of _ the confinement of _ _ is cast in |
areas such as _ prices of _ storage of _ producers of _ _ designed for the provision of _ _ sold in the same way as _ _ are among The production of _ the analysis of _ advances in _ specialising in _ a career in _ _ stolen from |
a greater amount of _ standards for _ marketer of _ market for _ prescriptions for _ the supply of _ the availability of _ advertising for _ the appropriate use of _ shipment of _ a cocktail of _ classes of _ a complete inventory of _ _ related downloads new generations of _ |
The middle column, showing ways in which people talk about both “art” and “pharmaceuticals”, makes it pretty clear. What they have in common is that they’re both products: you can buy, sell, produce, and store them. (There’s also an intellectual goods aspect: they both can be stolen.) This really didn’t occur to me at first; silly me, I thought art was a thing of beauty removed from such mundane considerations. A number of the submitted answers, though, center around the theme of them both being expensive — so we have positive agreement between corpus statistics and human judgments!
Examining massive numbers of contexts like this follows what the infinitely wise Dinosaur Comics calls “a statistically-based descriptivist approach to semantics.” Or as linguist J.R. Firth put it, “You shall know a word by the company it keeps.” Many subtleties of the two concepts can be seen just in their context lists. For example, in the left column, we see that only art “is a commodity”. Well, certainly pharmaceuticals are a commodity too. But that’s so obvious it’s not worth saying. Proclaiming that “art is a commodity”, however, is interesting. Maybe we think about this (possible) fact more.
As for the data: it comes from 200 million web pages (500 million sentences), and is filtered to contexts that appear more than five hundred times in the data. It was collected as part of a research project that seeks to extract a database of knowledge from this information — “reading the web”. (Yes, Hadoop was involved.) To make the table, I took the contexts’ set differences and intersection and showed a random subsample from each.
A final note. Will pointed out that in Alice in Wonderland, the Mad Hatter asks, “Why is a raven like a writing desk?” I tried that query on this data, but unfortunately, it didn’t contain many instances of “raven”. However, it does include a proper name “Raven” — which turns out to be an anime character. Not the first time I’ve seen the Internet’s massive amount of anime knowledge get in the way of a very serious semantic extraction system!
Many thanks to Adam, Joanna, Will, Vikas, and Michael for the submitted answers.
Nice post, Brendan. I wonder what the lists would look like if you replaced words with Wordnet synsets to make the “occurs with both art and pharmaceuticals” criterion less dependent on specific words chosen.
A minor correction: that should be “filtered to contexts that appear five hundred times in the data,” not “five hundred times for the noun phrase.” Co-occurrence counts can be as small as one.
thanks, fixed
Very cool. (nothing deeper to add; just very cool)