AI and Social Science - Brendan O'Connor » Best Posts

Beautiful Data book chapter

brendano — Wed, 12 Aug 2009 22:14:47 +0000

Today I received my copy of Beautiful Data, a just-released anthology of articles about, well, working with data. Lukas and I contributed a chapter on analyzing social perceptions in web data. See it here. After a long process of drafting, proofreading, re-drafting, and bothering the publishers under rather sudden deadlines, I’ve resolved to never use graphics again in anything I write :)

Here’s our final figure, a k-means clustering of face photos via perceived social attributes (social concepts/types? with exemplars?):

I just started reading the rest of the book and it’s very fun. Peter Norvig‘s chapter on language models is gripping. (It does word segmentation, ciphers, and more, in that lovely python-centric tutorial style extending his previous spell correction article.) There are also chapters by many other great researchers and practitioners (some of whom you may have seen around this blog or its neighborhood) like Andrew Gelman, Hadley Wickham, Michal Migurski, Jeffrey Heer, and still more… I’m impressed just by the talent-gathering-and-organizing operation. Big kudos to editors Toby Segaran and Jeff Hammerbacher, and O’Reilly’s Julie Steele.

I also have an apparently secret code that gets you a discount, so email me if you want it. I wonder if I’m not supposed to give out many of them. Hm.

Announcing TweetMotif for summarizing twitter topics

brendano — Mon, 18 May 2009 17:40:03 +0000

Update (3/14/2010): There is now a TweetMotif paper.

Last week, I, with my awesome friends David Ahn and Mike Krieger, finished hacking together an experimental prototype, TweetMotif, for exploratory search on Twitter. If you want to know what people are thinking about something, the normal search interface search.twitter.com gives really cool information, but it’s hard to wade through hundreds or thousands of results. We take tweets matching a query and group together similar messages, showing significant terms and phrases that co-occur with the user query. Try it out at tweetmotif.com. Here’s an example for a current hot topic, #WolframAlpha:

It’s currently showing tweets that match both #WolframAlpha as well as two interesting bigrams: “queries failed” and “google killer”. TweetMotif doesn’t attempt to derive the meaning or sentiment toward the phrases — NLP is hard, and doing this much is hard enough! — but it’s easy for you to look at the tweets themselves and figure out what’s going on.

Here’s another fun example right now, a query for Dollhouse:

I love that the #wolframalpha topic has “infected” the dollhouse space. Someone pointed out a connection between them, but really they’re connected through bot spam. TweetMotif’s duplicate detection algorithm found 22 messages here where each is basically a list of all the trending topics. This seems to be a popular form of twitter spambots.

I learned a ton making this system, and I’ll try to write more about the technical details in a future post. It’s interesting to hear people speculate on how it works; everyone gives a different answer. I guess this goes to show you that search/NLP is still a pretty unsettled, not-completely-understood area.

There are lots of interesting TweetMotif examples. More prosaic, less news-y queries like sandwich yield cool things like major ingredients of sandwiches and types of sandwiches. (These are basically distributional similarity candidates for synonym and meronym acquisition, though a bit too noisy to use in its current form.) And in a few cases, like for understanding currently unfolding events, TweetMotif might even be useful! It would be nice to expand the set of usefully served queries. We’re occasionally posting interesting queries at twitter.com/tweetmotif.

And oh yeah. We have a beautiful iPhone interface!

Check it out folks. This is a functional prototype, so you can play with it right now at tweetmotif.com.

Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata

brendano — Mon, 23 Feb 2009 20:18:59 +0000

Lukas and I were trying to write a succinct comparison of the most popular packages that are typically used for data analysis. I think most people choose one based on what people around them use or what they learn in school, so I’ve found it hard to find comparative information. I’m posting the table here in hopes of useful comments.

Name	Advantages	Disadvantages	Open source?	Typical users
R	Library support; visualization	Steep learning curve	Yes	Finance; Statistics
Matlab	Elegant matrix support; visualization	Expensive; incomplete statistics support	No	Engineering
SciPy/NumPy/Matplotlib	Python (general-purpose programming language)	Immature	Yes	Engineering
Excel	Easy; visual; flexible	Large datasets	No	Business
SAS	Large datasets	Expensive; outdated programming language	No	Business; Government
Stata	Easy statistical analysis		No	Science
SPSS	Like Stata but more expensive and worse

[7/09 update: tweaks incorporating some of the excellent comments below, esp. for SAS, SPSS, and Stata.]

There’s a bunch more to be said for every cell. Among other things:

Two big divisions on the table: The more programming-oriented solutions are R, Matlab, and Python. More analytic solutions are Excel, SAS, Stata, and SPSS.
Python “immature”: matplotlib, numpy, and scipy are all separate libraries that don’t always get along. Why does matplotlib come with “pylab” which is supposed to be a unified namespace for everything? Isn’t scipy supposed to do that? Why is there duplication between numpy and scipy (e.g. numpy.linalg vs. scipy.linalg)? And then there’s package compatibility version hell. You can use SAGE or Enthought but neither is standard (yet). In terms of functionality and approach, SciPy is closest to Matlab, but it feels much less mature.
Matlab’s language is certainly weak. It sometimes doesn’t seem to be much more than a scripting language wrapping the matrix libraries. Python is clearly better on most counts. R’s is surprisingly good (Scheme-derived, smart use of named args, etc.) if you can get past the bizarre language constructs and weird functions in the standard library. Everyone says SAS is very bad.
Matlab is the best for developing new mathematical algorithms. Very popular in machine learning.
I’ve never used the Matlab Statistical Toolbox. I’m wondering, how good is it compared to R?
Here’s an interesting reddit thread on SAS/Stata vs R.
SPSS and Stata in the same category: they seem to have a similar role so we threw them together. Stata is a lot cheaper than SPSS, people usually seem to like it, and it seems popular for introductory courses. I personally haven’t used either…
SPSS and Stata for “Science”: we’ve seen biologists and social scientists use lots of Stata and SPSS. My impression is they get used by people who want the easiest way possible to do the sort of standard statistical analyses that are very orthodox in many academic disciplines. (ANOVA, multiple regressions, t- and chi-squared significance tests, etc.) Certain types of scientists, like physicists, computer scientists, and statisticians, often do weirder stuff that doesn’t fit into these traditional methods.
Another important thing about SAS, from my perspective at least, is that it’s used mostly by an older crowd. I know dozens of people under 30 doing statistical stuff and only one knows SAS. At that R meetup last week, Jim Porzak asked the audience if there were any recent grad students who had learned R in school. Many hands went up. Then he asked if SAS was even offered as an option. All hands went down. There were boatloads of SAS representatives at that conference and they sure didn’t seem to be on the leading edge.
But: is there ANY package besides SAS that can do analysis for datasets that don’t fit into memory? That is, ones that mostly have to stay on disk? And exactly how good as SAS’s capabilities here anyway?
If your dataset can’t fit on a single hard drive and you need a cluster, none of the above will work. There are a few multi-machine data processing frameworks that are somewhat standard (e.g. Hadoop, MPI) but It’s an open question what the standard distributed data analysis framework will be. (Hive? Pig? Or quite possibly something else.)
(This was an interesting point at the R meetup. Porzak was talking about how going to MySQL gets around R’s in-memory limitations. But Itamar Rosenn and Bo Cowgill (Facebook and Google respectively) were talking about multi-machine datasets that require cluster computation that R doesn’t come close to touching, at least right now. It’s just a whole different ballgame with that large a dataset.)
SAS people complain about poor graphing capabilities.
R vs. Matlab visualization support is controversial. One view I’ve heard is, R’s visualizations are great for exploratory analysis, but you want something else for very high-quality graphs. Matlab’s interactive plots are super nice though. Matplotlib follows the Matlab model, which is fine, but is uglier than either IMO.
Excel has a far, far larger user base than any of these other options. That’s important to know. I think it’s underrated by computer scientist sort of people. But it does massively break down at >10k or certainly >100k rows.
Another option: Fortran and C/C++. They are super fast and memory efficient, but tricky and error-prone to code, have to spend lots of time mucking around with I/O, and have zero visualization and data management support. Most of the packages listed above run Fortran numeric libraries for the heavy lifting.
Another option: Mathematica. I get the impression it’s more for theoretical math, not data analysis. Can anyone prove me wrong?
Another option: the pre-baked data mining packages. The open-source ones I know of are Weka and Orange. I hear there are zillions of commercial ones too. Jerome Friedman, a big statistical learning guy, has an interesting complaint that they should focus more on traditional things like significance tests and experimental design. (Here; the article that inspired this rant.)
I think knowing where the typical users come from is very informative for what you can expect to see in the software’s capabilities and user community. I’d love more information on this for all these options.

What do people think?

Aug 2012 update: Serbo-Croatian translation.
Apr 2015 update: Slovenian translation.
May 2017 update: Portugese translation.

Statistics vs. Machine Learning, fight!

brendano — Wed, 03 Dec 2008 08:56:30 +0000

10/1/09 update — well, it’s been nearly a year, and I should say not everything in this rant is totally true, and I certainly believe much less of it now. Current take: Statistics, not machine learning, is the real deal, but unfortunately suffers from bad marketing. On the other hand, to the extent that bad marketing includes misguided undergraduate curriculums, there’s plenty of room to improve for everyone.

So it’s pretty clear by now that statistics and machine learning aren’t very different fields. I was recently pointed to a very amusing comparison by the excellent statistician — and machine learning expert — Robert Tibshiriani. Reproduced here:

Machine learning	Statistics
Glossary
network, graphs	model
weights	parameters
learning	fitting
generalization	test set performance
supervised learning	regression/classiﬁcation
unsupervised learning	density estimation, clustering
large grant = $1,000,000	large grant = $50,000
nice place to have a meeting: Snowbird, Utah, French Alps	nice place to have a meeting: Las Vegas in August

Hah. Or rather, ouch! I had two thoughts reading this. (1) Poor statisticians. Machine learners invent annoying new terms, sound cooler, and have all the fun. (2) What’s wrong with statistics? They have way less funding and influence than it seems they might deserve.

There are several issues going on here, both substantive and cultural:

There might be too much re-making-up of terms on the ML side. But lots of these are useful. “Weights” is a great, intuitive term for the parameters of a linear model. I use it all the time to explain classifiers and regressions to non-experts. I was surprised to see “test set” on the statistics side; I’m used to thinking of held-out test set accuracy as an extremely common ML technique, while in statistics model fit is assessed with parametric assumptions for standard errors and such. I really like cross-validation and bootstrapping as ways of thinking about generalization — again, something that’s far easier to grasp than sampling and hypothesis testing approaches to parameter inference — which keep getting taught to and misunderstood by generations of confused Introduction to Statistics students. For example, how many times has been explained that: No, a p-value is NOT the probability your model is wrong. But scientific papers regularly treat significance levels in that manner (look how many stars are on this result!) On the other hand, cross-validation accuracy *is* something you can interpret as being related to the probability your model is right.

I’ll also note that there are definitely a number of topics in ML that aren’t very related to statistics or probability. Max-margin methods: if all we care about is prediction, why bother using a probability model at all? Why not just optimize the spatial geometry instead? SVM’s don’t require a lick of probability theory to understand. (Of course probability-based approaches are huge in ML, but it’s important to remember they’re not the only game in town, and there is no necessary reason they must be.) And then there are non-traditional settings such as online learning, reinforcement learning, and active learning, where the structure of access to information is in play. There are certainly plenty of things in statistics that aren’t considered part of ML — say, regression diagnostics and significance testing. Finally, many ML problems involve large, high dimensional data and models, where computational issues are very important. For example, in statistical machine translation, alignment models are described with probability theory and fit to data, but their structure is complex enough that optimal inference is intractable, and how you do approximate inference (EM, Viterbi, beam search, etc.) is a very major issue.

But the most interesting differences between stats and ML are institutional.

I’ve been hearing lots of friends compare two dueling courses at Stanford: CS229, the CS department’s “machine learning” course taught by Andrew Ng; and Stat 315 A/B, the Statistics department’s “statistical learning” sequence taught by some combination of Tibshirani, Jerome Friedman, and Trevor Hastie. These people are all top-of-the-line researchers in the field. Their courses’ contents are extremely similar; I’d bet any of them could teach most of the material from the other side.

What differs most is the teaching style. CS has far better lecture notes. Of course, the stats people wrote a very good book; but better lecture notes win because I can access them later and send them to people for free. CS students I’ve talked to think the CS course is better taught; I can’t find stats students who take the CS course. (My sample is biased, though I know people in both.) Finally, the CS course has a big, open-ended project component; the Stats course follows more of a traditional problem set and tests format.

I think this is reflective of the differences in institutional culture between CS and Stats. There’s an interesting John Langford post on part of the issue, which he calls “The Stats Handicap”. He points out that stats Ph.D.’s have a big disadvantage in the job market because statistics has an old-school journal-oriented publishing culture, so students publish much less and have less experience engaging with a research community. CS is conference-oriented — certain conferences have a higher prestige than many journals (e.g. NIPS in ML, CHI in HCI) — and this results in faster turnaround, dissemination, and collaboration. (I’ve heard others make similar comparisons between CS and psychology.) I’d expect any discipline with a larger conference emphasis to have better courses since they should reward presentation/teaching skills — or at least encourage practice — more than in journal world.

ML sounds like it’s young, vibrant, interesting to learn, and growing; Stats does not.

Is marketing a problem? Machine learning terms definitely sound pretty cool. Maybe the perspective of computational intelligence lends itself to cool names. Though the Stanford statisticians certainly know how to play this game — for example, they made up their own names for variants of L1 and L2-regularized regression, leaving annoyed people like me forever googling “lasso” and “ridge” trying to remember which is which. (On the other hand, perhaps that’s child’s play compared to the true original sin of ML nomenclature: tossing around the highly deceptive term “neural network” for a stack of linear functions paired with a wonky, overhyped training algorithm; the combination of which, many years later, still causes confusion. Definitely blame CS for that one.)

Another issue is the definition of statistics itself. In 1997, Jerome Friedman wrote an extremely interesting analysis of the situation: “Data Mining and Statistics: What’s the Connection?”. He points out, quite correctly, the statistical impoverishment of some common approaches to data mining. You can certainly blame statistics for not marketing its ideas well enough, or blame CS for ignoring statistics. For example there’s a good case that lots of genetic algorithms and neural network research was much ado about nothing — that is, over-complicated cool-sounding hammers looking for nails when all you needed were some time-honored statistical and optimization techniques. (E.g. why NN when you haven’t tried a straight-up GLM? Why GA when you haven’t tried Nelder-Mead?) But this problem has been rectified somewhat — for example, NLP has seen a big move to simple linear models as the default technique, and NN’s and GA’s have fallen from grace in mainstream ML.

Friedman argues part of the problem is in how statisticians approach problems and the world:

One can catalog a long history of Statistics (as a field) ignoring useful methodology developed in other data related fields. Here are some of them that had seminal beginnings in Statistics but for the most part were subsequently ignored in our field: Pattern Recognition, Neural Networks, Machine Learning, Graphical Models, Chemometrics, Data Visualization.

That is not to say statistics is not important — it’s incredibly important. He quotes Efron as saying “Statistics has been the most successful information science.” However, information science is becoming bigger and broader and more exciting, thanks to computation and ever-increasing amounts of data. What should statisticians do? Friedman continues (light editing and emphasis is mine):

One view says that our field should concentrate on that small part of information science that we do best, namely probabilistic inference based on mathematics. If this view is adopted, we should become resigned to the fact that the role of Statistics as a player in the “information revolution” will steadily diminish over time.

Another point of view holds that statistics ought to be concerned with data analysis. The field should be defined in terms of a set of problems — rather than a set of tools — that pertain to data. Should this point of view ever become the dominant one, a big change would be required in our practice and academic programs.

First and foremost, we would have to make peace with computing. It’s here to stay; that’s where the data is. This has been one of the most glaring omissions in the set of tools that have so far defined Statistics. Had we incorporated computing methodology from its inception as a fundamental statistical tool (as opposed to simply a convenient way to apply our existing tools) many of the other data related fields would not have needed to exist. They would have been part of our field.

Friedman wrote this article more than 10 years ago. All his observations about the importance and increasing prevalence of data and computing power are even more true today than back then. Has the field of statistics changed? Not clear. (I’d appreciate seeing evidence to the contrary.)

On the other hand a world of data *has* to be increasingly statistical. The positive spin from Efron:

A new generation of scientiﬁc devices, typiﬁed by microarrays, produce data on a gargantuan scale – with millions of data points and thousands of parameters to consider at the same time. These experiments are “deeply statistical”. Common sense, and even good scientiﬁc intuition, won’t do the job by themselves. Careful statistical reasoning is the only way to see through the haze of randomness to the structure underneath. Massive data collection, in astronomy, psychology, biology, medicine, and commerce, is a fact of 21st Century science, and a good reason to buy statistics futures if they are ever offered on the NASDAQ.

I know that I’m interested in quantitative information science, including statistics and data analysis. Machine learning has many strengths, but it is definitely an odd way to go about analysis. But there’s a good case that statistics, as traditionally defined, is only going to have a smaller role in the future. “Data mining” sounds more relevant, but does it even exist as a coherent subject? Maybe it’s time to study a more applied statistical field like econometrics.

It is accurate to determine a blog’s bias by what it links to

brendano — Sat, 11 Oct 2008 10:12:00 +0000

Here’s a great project from Andy Baio and Joshua Schachter: they assessed the political biases of different blogs based on which articles they tend link to. Using these political bias scores, they made a cool little Firefox extension that colors the names of different sources on the news aggregator site Memeorandum, like so:

How they computed these biases is pretty neat. Their data source was the Memeorandum site itself, which shows a particular news story, then a list of different news sites that have written articles about the topic. Scraping out that data, Joshua constructed the adjacency matrix of sites vs. articles they linked to and ran good ol’ SVD on it, an algorithm that can be used to summarize the very high-dimensional article linking information in just several numbers (“components” or “dimensions”) for each news site. Basically, the algorithm groups together sites that tend to link to the same articles. It’s not exactly clustering though; rather, it projects them into a space where sites close to each other had similar linking patterns. People have used this technique analogously to construct a political spectrum for Congress, by analyzing which legislators tend to vote together.

So here they found that the second dimension of the SVD’s projected outputs seemed to strongly correlate with their own intuitions of sites’ political biases. Talk about getting lucky! This score is used for their coloring visualization, and I personally found the examples pretty accurate. And they helpfully posted all of their output data with the blog post.

There is a concern though. The funny thing about SVD (and related algorithms like factor analysis and PCA) is that the numbers that fall out of it don’t necessarily mean anything. In fact there have been great controversies when researchers try to interpret its outputs. For example, if you run PCA on scores from different types of IQ tests, you get a “g factor”. Is it a measure of general human intelligence? Or is g just a meaningless statistical artifact? No one’s sure.

But for this problem, there is a fair, objective validation — use 3rd party, human judgments from the web! I’ve found before that you can assess media bias on AMT pretty well; but for this, I simply went to a pre-existing site called Skewz, which collects people’s ratings of the bias of individual articles from news sites. About 150 sites were rated on Skewz as well as included in the Memeorandum/SVD analysis.

Within that set, it turns out that the SVD’s second component significantly correlates with Skewz users’ judgments of political bias! First, here’s the scatterplot of the “v2″ SVD dimension against Skewz ratings. Higher numbers are conservative, lower are liberal:

So SVD tends to give most sites a neutral score, but when it assigns a strong score, it’s often right — or at least, correlates with Skewz users. Some of the disagreements are interesting — for example, Skewz thinks The New Republic is liberal, whereas SVD thinks it’s slightly conservative. That might mean that TNR links to lots of stories that conservatives tend to like, though its actual content and stances are liberal. (But don’t take any particular data point too seriously — the Skewz data is probably fairly noisy, and the bridging between the datasets introduces more noise too, since Memeorandum and Skewz are based on different sets of articles and such.)

Here’s a zoom-in on that narrow band in the middle. There’s some more successful correlation in there:

Here are the actual correlation coefficients with the different SVD outputs. It turns out the first dimension slightly correlates to political bias as well. (Joshua explained it as the overall volume of linking. Do liberals tend to link more?) But the third through fifth dimensions, which they say were very hard for them to interpret, don’t correlate at all to these political bias ratings.

SVD component (output dimension)	v1	v2	v3	v4	v5
Correlation to Skewz ratings	+.112	+.392	-.011	-.057	-.047

In conclusion … this overall result makes me really happy. A completely unsupervised algorithm, based purely on similarity of linking patterns, gets you a systematic correlation with independent judges’ assessments of bias. That’s just sweet.

Here’s the entire dataset for the above graphs. “score_svd” is their rescaled version of “v2″. (Click here to see and download all of it).

Update: See the comments below. You can also fit a linear model against all of v1..v5 to predict the Skewz rating as the response. This fits a little better than using just v2. Here’s the scatterplot for the model’s predictions.

Code: I put the Skewz scraper, data, and scripts up here.

Final note: The correlation coefficients above are via Kendall-Tau, which is invariant to rescalings of the data. This data has all sorts of odd spikes and such, and Joshua and Andy themselves rescaled the data for the coloring plugin, so this seemed safest. And don’t worry about the small sample size; v1 correlation’s p-value is .04, v2 correlation p-value is tiny.

Turker classifiers and binary classification threshold calibration

brendano — Wed, 18 Jun 2008 09:25:00 +0000

I wrote a big Dolores Labs blog post a few days ago. Click here to read it. I am most proud of the pictures I made for it:

Are women discriminated against in graduate admissions? Simpson’s paradox via R in three easy steps!

brendano — Sun, 13 Apr 2008 09:04:00 +0000

R has a fun built-in package, datasets: a whole bunch of easy-to-use, interesting tables of data. I found the famous UC Berkeley admissions data set, from a 1970′s study of whether sex discrimination existed in graduate admissions. It’s famous for illustrating a particular statistical paradox. Thanks to R’s awesome mosaic plots interface, we can see this really easily.

UCBAdmissions is a three-dimensional table (like a matrix): Admit Status x Gender x Dept, with counts for each category as the matrix’s values. R’s default printing shows the basics just fine. Here’s the data for just the first of six departments:

> UCBAdmissions
, , Dept = A

          Gender
Admit      Male Female
  Admitted  512     89
  Rejected  313     19

...

Overall, women have a lower admittance rate than men:

> apply(UCBAdmissions,c(1,2),sum)

          Gender
Admit         M    F
  Admitted 1198  557
  Rejected 1493 1278

This is the phenomenon that prompted a lawsuit against Berkeley which prompted the study that collected this data.

R’s plot function is overloaded to do a mosaic plot for this sort of categorical data. Very cool. With just

> plot(UCBAdmissions)

or, playing around after reading Quick-R’s page on this:

> install.packages(”vcd”)
> library(vcd)
> mosaic(UCBAdmissions, condvars=c('Dept'))

We have a plot showing admittance and gender breakdowns per department:

In each department, women have similar admittance rates as men. This seems to be at odds with the fact that women have a lower admittance rate overall. This discrepancy is an example of Simpson’s paradox.

This mosaic also shows the explanation: Selective departments have more female applicants. It’s easy to see since the departments are ordered by selectiveness. Departments A and B let in many applicants, but they’re mostly male. The reverse is true for the rest. This means that the overall female population takes big admittance hits in departments C through F, while lots of males get in via departments A and B.

I think these mosaic plots are impressive for visualizing categorical proportions for high dimensional data sets. Well, by “high” I think I mean, more than 2. I can’t think of a better way to see several cross relationships in categorical data at once. And the only tuning I needed to do was play around a bit with the order of those three dimensions.

Sources:

R’s UCBAdmissions help page. It comes with the standard download of R.
R’s vcd::mosaic function. I recommend the pdf vigenette about it, which has many more pictures of cool mosaic plots.
~~I would post the original 1975 Science paper, but it’s not freely available. I hate academic publishers.~~ Here’s the paper, at least for now:
- Bickel, P. J., Hammel, E. A., and O’Connell, J. W. (1975) Sex bias in graduate admissions: Data from Berkeley. Science, 187, 398–403. [PDF]

color name study i did

brendano — Tue, 18 Mar 2008 16:54:00 +0000

Link: Where does “Blue” end and “Red” begin?

I’m writing some posts on blog.doloreslabs.com and this is the best one so far. Methodology-wise, along the lines of my earlier Amazon Mechanical Turk moral decisions survey…

Food Fight

brendano — Thu, 31 Jan 2008 07:21:00 +0000

Absolutely amazing — a short film chronicling conflicts from World War II — as food.

I think this has to have the highest amount of Wikipedia-linkable references per second of any film I’ve seen. Yes, it’s U.S.-centric, but so is Wikipedia, which makes cataloguing it easier. At the very least:

World War II
- Persecution of Jews and The Holocaust
- Invasion of France
- The Blitz
- Pearl Harbor, Pacific Theater
- D-Day
- Liberation of France
  Invasion of Germany – Western and Eastern fronts
- Atomic bombing of Hiroshima
1948 Arab-Israeli War
Korean War
Cuban Missile Crisis
Vietnam War
- French Indochina War
- U.S. involvement
US/USSR nuclear arms race
First Gulf War
- Invasion of Kuwait
Israeli-Palestinian conflict
- Judging from its timing in the film, maybe specifically the First Intifada?
9/11 attacks
War in Afghanistan
- Taliban falls, but some escape, with Osama bin Laden
2003 Iraq war and aftermath

Many lessons learned about food violence, 20th century war, and Wikipedia meronymic article relationships…

Moral psychology on Amazon Mechanical Turk

brendano — Sun, 20 Jan 2008 01:44:00 +0000

There’s a lot of exciting work in moral psychology right now. I’ve been telling various poor fools who listen to me to read something from Jonathan Haidt or Joshua Greene, but of course there’s a sea of too many articles and books of varying quality and intended audience. But just last week Steven Pinker wrote a great NYT magazine article, “The Moral Instinct,” which summarizes current research and tries to spell out a few implications. I recommend it highly, if just for presenting so many awesome examples. (Yes, this blog has poked fun at Pinker before. But in any case, he is a brilliant expository writer. The Language Instinct is still one of my favorite popular science books.)

For a while now I’ve been thinking that recruiting subjects online could lend itself to collecting some really interesting behavioral science data. A few months ago I tried doing this with Amazon Mechanical Turk, a horribly misnamed web service that actually lets you create web-based tasks and pay online workers do them. Its canonical commercial applications include tedious tasks like search quality evaluation or image labeling, where you really need human data to perform well. You put up, say, several thousand images you want classified as “porn” or “not-porn”, say you’ll pay workers $0.01 to label ten images, then sit back and watch the data roll in.

So AMT advertises itself as a data annotation or machine learning substitute system, but I think its main innovation is finding out that there are lots and lots of people with free time willing to do online work for very, very low amounts of money. You can run any task you want, including surveys, and people happily respond for mere pennies. (Far below minimum wage, I might add — their motivation seems to be more like casual gaming or so.) To that end, I tried out running one of the standard moral psych survey questions to see what would happen — the so-called “trolley problem”:

A runaway trolley is hurtling down a track towards five people who have been tied down in its path. If nothing happens, they will be killed. Fortunately, you have a switch which would divert the trolley to a different track. Unfortunately, the other track has one person tied down to it. Should you flip the switch?

It’s supposed to be a classic dilemma of consequentialist vs. deontological moral reasoning. Is it acceptable to sacrifice for the greater good? Is it permissible to take an action that will cause a preventable death? And so on. I think it’s neat just because when I pose it to people, different folks really do disagree, give different answers, and are willing to argue about it. There are some interesting recent fMRI findings (due to Greene I think?) that people who refuse to flip the switch seem to be engaged in a more emotional response, whereas those who do seem to be using deliberative reasoning systems. (Some, like Greene and Pinker, seem to go further and argue this is a substantive normative reason to favor flipping the switch; whether you feel like getting sucked into that debate, though, there’s clearly something interesting happening here.)

So I ran this on AMT; the particpants (they call themselves “turkers”) had to answer yes or no. Turns out 77% say they’d flip the tracks.

I also ran two variant scenarios of the same logical dilemma, to sacrifice one person to save five:

A trolley is hurtling down a track towards five people. You are on a bridge under which it will pass, and you can stop it by dropping a heavy weight in front of it. As it happens, there is a very fat man next to you – your only way to stop the trolley is to push him over the bridge and onto the track, killing him to save five. Should you proceed?

and

A brilliant transplant surgeon has five patients, each in need of a different organ, each of whom will die without that organ. Unfortunately, there are no organs available to perform any of these five transplant operations. A healthy young traveler, just passing through the city the doctor works in, comes in for a routine checkup. In the course of doing the checkup, the doctor discovers that his organs are compatible with all five of his dying patients. Suppose further that if the young man were to disappear, no-one would suspect the doctor. Should the doctor sacrifice the man to save his other patients?

These two, of course, feel a lot harder to say “Yes” to, but if you were willing to say “Yes” to the original question, it is hard to justify why. The participants’ repsonses followed what you would expect: fewer said “Yes” to these scenarios. Here are the Yes/No responses to each of the questions (100 responses for each):

Question	Yes	No
surgeon	2	98
fat man	30	70
switch, save 5	77	23
switch, save 10	82	18
switch, save 15	83	17
switch, save 20	83	17

Only two people thought it was acceptable to sacrifice for organs, and only half as many would push the fat man as would flip the switch. I also ran variants of the switch version with more and more people on the tracks; the Yes response creeps upwards but never reaches 100%. The differences among the first three questions are statistically significant (unpaired t-tests, all p<.001 (this seems like the wrong test, can anyone correct me?)).

What’s amazing is how fast responses happen. I started getting responses just minutes after posting the question. I actually posted each of the six questions as a separate, standalone task; but many of the turkers who did one found the rest in the task pool and did them too. (So what was supposed to be a between-subjects design fell into something else, oops!) The whole thing cost $6 and was done in a matter of hours. It’s very encouraging — AMT allows you to very quickly iterate and try out different designs and such. It’s a bit of a pain to use, though; Amazon has certainly done a poor job in exploiting its full potential. (They have a form builder which was good enough to quickly write up these tasks, but to do anything moderately sophisticated, even just getting your data back out, you have to write programs against their somewhat mediocre API; you have to know how to use an XML parser, etc. Hm.)

I also tried an explicitly within-subject version, where each participant answered the three basic versions. I was interested in consistency — presumably very few people would sacrifice for organs but refuse to divert the trolley. For 141 participants, here are the frequencies of the different answer triples:

% with this response triple	flip switch?	push fat man?	sacrifice traveler for organs?
42.6	Y	N	N
29.8	Y	Y	N
20.6	N	N	N
5.0	Y	Y	Y
0.7	Y	N	Y
0.7	N	Y	Y
0.7	N	Y	N

I personally find the most common responses coherent with my own gut reactions — from left to right, I feel less and less good about sacrificing in each case. Perhaps all people feel the same gut reactions, and use different ad hoc reasons to draw the line in different places?

I’m sorry that this post started with neat moral psychology then degenerated into methodology, but hey it’s fun. I’ve seen only two instances of any sort of research paper being written using AMT, both by computer scientists; here’s a nice blog post on an information retrieval experiment (it’s a great blog, btw); and someone mentioned to me this one on data processing accuracy also. Anyone know of any? It’s clearly an interesting approach.