AI and Social Science - Brendan O'Connor

Java and IDEs for the R/Python world

brendano — Tue, 16 Jun 2015 00:57:26 +0000

(Some tips on how to use Java if you’re from R or Python; some thoughts on software platforms and programming for data-science-or-whatever-we-call-it-now.)

Most of my research these days uses Python, R, or Java. It’s terrific that so many people are using Python and R as their primary langauges now; this is way better than the bad old days when people overused Java just because that’s what they learned in their intro CS course. Python/R are better for many things. But fast, compiled, static languages are still important[1], and Java still seems to be a very good cross-platform approach for this[2], or at the very least, it’s helpful to know how to muck around with CoreNLP or Mallet. I think in undergrad I kept annoying my CS professors that we needed to stop using Java and do everything in Python, but I honestly think we now have the opposite problem — I’ve met many people recently who do lots of programming without traditional CS training (e.g. from the natural sciences, social sciences, statistics, humanities, etc.), who need to pick up some Java but find it fairly different than the lightweight languages they first learned. I don’t know what are good overall introductions to the language for this audience, but here’s a little bit of information about development tools which make it easier.

Unlike R or Python, Java is really hard to program with just a text editor. You have to import tons of packages to do anything basic, and the names for everything are long and easy to misspell, which is extra bad because it takes more lines of code to do anything. While it’s important to learn the bare basics of compiling and running Java from the commandline (at the very least because you need to understand it to run Java on a server), the only good way to write Java for real is with an IDE. This is more complicated than a text editor, but once you get the basics down it is much more productive for most things. In many ways Java is an outdated, frustrating language, but Java plus an IDE is actually pretty good.

The two most popular IDEs seem to be Eclipse and IntelliJ. I’ve noticed really good Java programmers often prefer IntelliJ. It’s probably better. I use Eclipse only because I learned it a long time ago. They’re both free.

The obvious things an IDE gives you include things like autosuggestion to tell you method names for a particular object, or instantly flagging misspelled variable names. But the most useful and underappreciated features, in my opinion, are for code navigation and refactoring. I feel like I became many times more productive when I learned how to use them.

For example:

Go to a definition (Eclipse name: “Open Declaration”). Hold “Command” then all the function names, class names, and variable names will get underlines. You can click one to navigate to where it’s declared. This is really helpful to follow method calls. You basically are following the path your program would take at runtime. You can even navigate into the code for any library or the standard library.
Back: this is a button on the toolbar. After you navigated to a declaration, use the this to go back to where you were before. This lets you do things like go to a method just to quickly refresh your memory about what’s going on, or maybe go to a class to remember what things are in it, then after a second go right back to what you were working on. This lets you effectively deal with a lot more complexity without holding it all in your head at once.

(The “Command” key is for Mac with Eclipse; there are equivalents for Linux and Windows and other IDEs too.)

With these two commands, you can move through your code, and other people’s code, like it’s a web browser. Enabling keyboard shortcuts makes it work even better. Then you can press a keyboard shortcut to navigate to the the function currently under your cursor, and press another to go back to where you were. I think that by default these two commands don’t both have shortcuts; it’s worth adding them yourself (in Preferences). I actually mapped them to be like Chrome or Safari, using Command-[ and Command-] for Back and Open Declaration, respectively. I use them constantly when writing Java code.

But that’s just one navigational direction. You can also traverse in other directions with:

See all references (Eclipse: right-click, “References”; or, Cmd-Shift-G). You invoke this on a function name in the code. Then you’ll get a listing on the sidebar of all places that call that function, and you can click on them to go to them. As opposed to going to a declaration, this lets you go backwards in a hypothetical call stack. It’s like being able to navigate to all inbound links, like all “cited by” in Google Scholar. And it’s useful for variables and classes, too. By invoking this on different things in your code, you quickly get little ego-network snapshots of your codebase’s dependency graph. This not only helps you track down bugs, but helps you figure out how to refactor or restructure your code.

There are many other useful navigational features as well, such as navigating to a class by typing a prefix of its name; and many other IDE features too. Different people tend to use different ones so it’s worth looking at what different people use.

Finally, besides navigation, a very useful feature is rename refactoring: any variable or function or class can be renamed, and all references to it get renamed too. Since names are pretty important for comprehension, this actually makes it much easier to write the first draft of code, because you don’t have to worry about getting the name right on the first try. When I write large Python programs, I find I have to spend lots of time thinking through the structure and naming so I don’t hopelessly confuse myself later. There’s also move refactoring, where you can move functions between different files.

Navigation and refactoring aren’t just things for Java; they’re important things you want to do in any language. There are certainly IDEs and editor plugins for lightweight languages as well which support these things to greater or lesser degrees (e.g. RStudio, PyCharm, Syntastic…). And without IDE support, there are unix-y alternatives like CTags, perl -pi, grep, etc. These are good, but their accuracy relative to the semantics you care about often is less than 100%, which changes how you use them.

Java and IDE-style development feel almost retrospective in some ways. To me at least, they’re associated with a big-organization, top-heavy, bureaucratic software engineering approach to programming, which feels distant from the needs of computational research or startup-style lightweight development. And they certainly don’t address some of the major development challenges facing scientific programming, like dependency management for interwoven code/data pipelines, or data/algorithm visualization done concurrently with code development. But these tools still the most effective ones for a large class of problems, so worth doing well if you’re going to do them at all.

[1]: An incredibly long and complicated discussion addressed in many other places, but in my own work, static languages are necessary over lightweight ones for (1) algorithms that need more speed, especially ones that involve graphs or linguistic structure, or sample/optimize over millions of datapoints; (2) larger programs, say more than a few thousand lines of code, which is when dynamic typing starts to turn into a mess while static typing and abstractions start to pay off; (3) code with multiple authors, or that develops or uses libraries with nontrivial APIs; in theory dynamic types are fine if everyone is super good at communication and documentation, but in practice explicit interfaces make things much easier. If none of (1-3) are true, I think Python or R is preferable.

[2]: Long story here and depends on your criteria. Scala is similar to Java in this regard. The main comparison is to C and C++, which have a slight speed edge over Java (about the same for straightforward numeric loops, but gains in BLAS/LAPACK and other low-level support), are way better for memory usage, and can more directly integrate with your favorite more-productive high-level language (e.g. Python, R, or Matlab). But the interface between C/C++ and the lightweight language you care about is cumbersome. Cython and Rcpp do this better — and especially good if you’re willing to be tied to either Python or R — but they’re still awkward enough they slow you down and introduce new bugs. (Julia is a better approach since it just eliminates this dichotomy, but is still maturing.) C/C++’s weaknesses compared to Java include nondeterministic crashing bugs (due to the memory model), high conceptual complexity to use the better C++ features, time-wasting build issues, and no good IDEs. At the end of the day I find that I’m usually more productive in Java than C/C++, though the Cython or Rcpp hybrids can get to similar outcomes. These main criteria somewhat assume a Linux or Mac platform; people on Microsoft Windows are in a different world where there’s a great C++ IDE and C# is available, which is (mostly?) better than Java. But very few people in my work/research world use Windows and it’s been like this for many years, for better or worse.

Replot: departure delays vs flight time speed-up

brendano — Sat, 26 Apr 2014 19:32:38 +0000

Here’s a re-plotting of a graph in this 538 post. It’s looking at whether pilots speed up the flight when there’s a delay, and find that it looks like that’s the case. This is averaged data for flights on several major transcontinental routes.

I’ve replotted the main graph as follows. The x-axis is departure delay. The y-axis is the total trip time — number of minutes since the scheduled departure time. For an on-time departure, the average flight is 5 hours, 44 minutes. The blue line shows what the total trip time would be if the delayed flight took that long. Gray lines are uncertainty (I think the CI due to averaging).

What’s going on is, the pilots seem to be targeting a total trip time of 370-380 minutes or so. If the departure is only slightly delayed by 10 minutes, the flight time is still the same, but delays in the 30-50 minutes range see a faster flight time which makes up for some of the delay.

The original post plotted the y-axis as the delta against the expected travel time (delta against 5hr44min). It’s good at showing that the difference does really exist, but it’s harder to see the apparent “target travel time”.

Also, I wonder if the grand averaging approach — which averages totally different routes — is necessarily the best. It seems like the analysis might be better by adjusting for different expected times for different routes. The original post is also interested in comparing average flight times by different airlines. You might have to go to linear regression to do all this at once.

I got the data by pulling it out of 538′s plot using the new-to-me tool WebPlotDigitizer. I found it pretty handy! I put files and plotting code at github/brendano/flight_delays.

What the ACL-2014 review scores mean

brendano — Wed, 19 Feb 2014 23:01:17 +0000

I’ve had several people ask me what the numbers in ACL reviews mean — and I can’t find anywhere online where they’re described. (Can anyone point this out if it is somewhere?)

So here’s the review form, below. They all go from 1 to 5, with 5 the best. I think the review emails to authors only include a subset of the below — for example, “Overall Recommendation” is not included?

The CFP said that they have different types of review forms for different types of papers. I think this one is for a standard full paper. I guess what people really want to know is what scores tend to correspond to acceptances. I really have no idea and I get the impression this can change year to year. I have no involvement with the ACL conference besides being one of many, many reviewers.

APPROPRIATENESS (1-5)
Does the paper fit in ACL 2014? (Please answer this question in light of the desire to broaden the scope of the research areas represented at ACL.)

5: Certainly.
4: Probably.
3: Unsure.
2: Probably not.
1: Certainly not.

CLARITY (1-5)
For the reasonably well-prepared reader, is it clear what was done and why? Is the paper well-written and well-structured?

5 = Very clear.
4 = Understandable by most readers.
3 = Mostly understandable to me with some effort.
2 = Important questions were hard to resolve even with effort.
1 = Much of the paper is confusing.

ORIGINALITY (1-5)
Is there novelty in the developed application or tool? Does it address a new problem or one that has received little attention? Alternatively, does it present a system that has significant benefits over other systems, either in terms of its usability, coverage, or success?

5 = Surprising: Significant new problem, or a major advance over other applications or tools that attack this problem.
4 = Noteworthy: An interesting new problem, with clear benefits over other applications or tools that attack this problem.
3 = Respectable: A nice research contribution that represents a notable extension of prior approaches.
2 = Marginal: Minor improvements on existing applications or tools in this area.
1 = The system does not represent any advance in the area of natural language processing.

IMPLEMENTATION AND SOUNDNESS (1-5)
Has the application or tool been fully implemented or do certain parts of the system remain to be implemented? Does it achieve its claims? Is enough detail provided that one might be able to replicate the application or tool with some effort? Are working examples provided and do they adequately illustrate the claims made?

5 = The application or tool is fully implemented, and the claims are convincingly supported. Other researchers should be able to replicate the work.
4 = Generally solid work, although there are some aspects of the application or tool that still need work, and/or some claims that should be better illustrated and supported.
3 = Fairly reasonable work. The main claims are illustrated to some extent with examples, but I am not entirely ready to accept that the application or tool can do everything that it should (based on the material in the paper).
2 = Troublesome. There are some aspects that might be good, but the application or tool has several deficiencies and/or limitations that make it premature.
1 = Fatally flawed.

SUBSTANCE (1-5)
Does this paper have enough substance, or would it benefit from more ideas or results?
Note that this question mainly concerns the amount of work; its quality is evaluated in other categories.

5 = Contains more ideas or results than most publications in this conference; goes the extra mile.
4 = Represents an appropriate amount of work for a publication in this conference. (most submissions)
3 = Leaves open one or two natural questions that should have been pursued within the paper.
2 = Work in progress. There are enough good ideas, but perhaps not enough in terms of outcome.
1 = Seems thin. Not enough ideas here for a full-length paper.

EVALUATION (1-5)
To what extent has the application or tool been tested and evaluated? Have there been any user studies?

5 = The application or tool has been thoroughly tested. Rigorous evaluation on a large corpus or via formal user studies support the claims made for the system. Critical analysis of the results yields many insights into the limitations (if any).
4 = The application or tool has been tested and evaluated on a reasonable corpus or with a small set of users. The results support the claims made. Critical analysis of the results yields some insights into the limitations (if any).
3 = The application or tool has been tested and evaluated to a limited extent. The results have been critically analyzed to gain insight into the system's performance.
2 = A few test cases have been run on the application or tool but no significant evaluation or user study has been performed.
1 = The application or tool has not been tested or evaluated.

MEANINGFUL COMPARISON (1-5)
Do the authors make clear where the presented system sits with respect to existing literature? Are the references adequate? Are the benefits of the system/application well-supported and are the limitations identified?

5 = Precise and complete comparison with related work. Benefits and limitations are fully described and supported.
4 = Mostly solid bibliography and comparison, but there are a few additional references that should be included. Discussion of benefits and limitations is acceptable but not enlightening.
3 = Bibliography and comparison are somewhat helpful, but it could be hard for a reader to determine exactly how this work relates to previous work or what its benefits and limitations are.
2 = Only partial awareness and understanding of related work, or a flawed comparison or deficient comparison with other work.
1 = Little awareness of related work, or insufficient justification of benefits and discussion of limitations.

IMPACT OF IDEAS OR RESULTS (1-5)
How significant is the work described? Will novel aspects of the system result in other researchers adopting the approach in their own work? Does the system represent a significant and important advance in implemented and tested human language technology?

5 = A major advance in the state-of-the-art in human language technology that will have a major impact on the field.
4 = Some important advances over previous systems, and likely to impact development work of other research groups.
3 = Interesting but not too influential. The work will be cited, but mainly for comparison or as a source of minor contributions.
2 = Marginally interesting. May or may not be cited.
1 = Will have no impact on the field.

IMPACT OF ACCOMPANYING SOFTWARE (1-5)
If software was submitted or released along with the paper, what is the expected impact of the software package? Will this software be valuable to others? Does it fill an unmet need? Is it at least sufficient to replicate or better understand the research in the paper?

5 = Enabling: The newly released software should affect other people's choice of research or development projects to undertake.
4 = Useful: I would recommend the new software to other researchers or developers for their ongoing work.
3 = Potentially useful: Someone might find the new software useful for their work.
2 = Documentary: The new software useful to study or replicate the reported research, although for other purposes they may have limited interest or limited usability. (Still a positive rating)
1 = No usable software released.

IMPACT OF ACCOMPANYING DATASET (1-5)
If a dataset was submitted or released along with the paper, what is the expected impact of the dataset? Will this dataset be valuable to others in the form in which it is released? Does it fill an unmet need?

5 = Enabling: The newly released datasets should affect other people's choice of research or development projects to undertake.
4 = Useful: I would recommend the new datasets to other researchers or developers for their ongoing work.
3 = Potentially useful: Someone might find the new datasets useful for their work.
2 = Documentary: The new datasets are useful to study or replicate the reported research, although for other purposes they may have limited interest or limited usability. (Still a positive rating)
1 = No usable datasets submitted.

RECOMMENDATION (1-5)
There are many good submissions competing for slots at ACL 2014; how important is it to feature this one? Will people learn a lot by reading this paper or seeing it presented?

In deciding on your ultimate recommendation, please think over all your scores above. But remember that no paper is perfect, and remember that we want a conference full of interesting, diverse, and timely work. If a paper has some weaknesses, but you really got a lot out of it, feel free to fight for it. If a paper is solid but you could live without it, let us know that you're ambivalent. Remember also that the authors have a few weeks to address reviewer comments before the camera-ready deadline.

Should the paper be accepted or rejected?

5 = This paper changed my thinking on this topic and I'd fight to get it accepted;
4 = I learned a lot from this paper and would like to see it accepted.
3 = Borderline: I'm ambivalent about this one.
2 = Leaning against: I'd rather not see it in the conference.
1 = Poor: I'd fight to have it rejected.

REVIEWER CONFIDENCE (1-5)
5 = Positive that my evaluation is correct. I read the paper very carefully and am familiar with related work.
4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.
3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math, experimental design, or novelty.
2 = Willing to defend my evaluation, but it is fairly likely that I missed some details, didn't understand some central points, or can't be sure about the novelty of the work.
1 = Not my area, or paper is very hard to understand. My evaluation is just an educated guess.

PRESENTATION FORMAT
Papers at ACL 2014 can be presented either as poster or as oral presentations. If this paper were accepted, which form of presentation would you find more appropriate?
Note that the decisions as to which papers will be presented orally and which as poster presentations will be based on the nature rather than on the quality of the work. There will be no distinction in the proceedings between papers presented orally and those presented as poster presentations.

RECOMMENDATION FOR BEST LONG PAPER AWARD (1-3)
3 = Definitely.
2 = Maybe.
1 = Definitely not.

Scatterplot of KN/PYP language model results

brendano — Tue, 18 Feb 2014 01:29:17 +0000

I should make a blog where all I do is scatterplot results tables from papers. I do this once in a while to make them eaiser to understand…

I think the following are results are from Yee Whye Teh’s paper on hierarchical Pitman-Yor language models, and in particular comparing them to Kneser-Ney and hierarchical Dirichlets. They’re specifically from these slides by Yee Whye Teh (page 25), which shows model perplexities. Every dot is for one experimental condition, which has four different results from each of the models. So a pair of models can be compared in one scatterplot.

where

ikn = interpolated kneser-ney
mkn = modified kneser-ney
hdlm = hierarchical dirichlet
hpylm = hierarchical pitman-yor

My reading: the KN’s and HPYLM are incredibly similar (as Teh argues should be the case on theoretical grounds). MKN and HPYLM edge out IKN. HDLM is markedly worse (this is perplexity, so lower is better). While HDLM is a lot worse, it does best, relatively speaking, on shorter contexts — that’s the green dot, the only bigram model that was tested, where there’s only one previous word of context. The other models have longer contexts, so I guess the hierarchical summing of pseudocounts screws up the Dirichlet more than the PYP, maybe.

The scatterplot matrix is from this table (colored by N-1, meaning the n-gram size):

tanh is a rescaled logistic sigmoid function

brendano — Thu, 31 Oct 2013 20:34:08 +0000

This confused me for a while when I first learned it, so in case it helps anyone else:

The logistic sigmoid function, a.k.a. the inverse logit function, is

\[ g(x) = \frac{ e^x }{1 + e^x} \]

Its outputs range from 0 to 1, and are often interpreted as probabilities (in, say, logistic regression).

The tanh function, a.k.a. hyperbolic tangent function, is a rescaling of the logistic sigmoid, such that its outputs range from -1 to 1. (There’s horizontal stretching as well.)

\[ tanh(x) = 2 g(2x) - 1 \]

It’s easy to show the above leads to the standard definition \( tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}} \). The (-1,+1) output range tends to be more convenient for neural networks, so tanh functions show up there a lot.

The two functions are plotted below. Blue is the logistic function, and red is tanh.

Response on our movie personas paper

brendano — Fri, 13 Sep 2013 21:58:10 +0000

Update (2013-09-17): See David Bamman‘s great guest post on Language Log on our latent personas paper, and the big picture of interdisciplinary collaboration.

I’ve been informed that an interesting critique of my, David Bamman’s and Noah Smith’s ACL paper on movie personas has appeared on the Language Log, a guest post by Hannah Alpert-Abrams and Dan Garrette. I posted the following as a comment on LL.

Thanks everyone for the interesting comments. Scholarship is an ongoing conversation, and we hope our work might contribute to it. Responding to the concerns about our paper,

We did not try to make a contribution to contemporary literary theory. Rather, we focus on developing a computational linguistic research method of analyzing characters in stories. We hope there is a place for both the development of new research methods, as well as actual new substantive findings. If you think about the tremendous possibilities for computer science and humanities collaboration, there is far too much to do and we have to tackle pieces of the puzzle to move forward. Clearly, our work falls more into the first category — it was published at a computational linguistics conference, and we did a lot of work focusing on linguistic, statistical, and computational issues like:

how to derive useful semantic relations from current syntactic parsing and coreference technologies,
how to design an appropriate probabilistic model on top of this,
how to design a Bayesian inference algorithm for the model,

and of course, all the amazing work that David did in assembling a large and novel dataset — which we have released freely for anyone else to conduct research on, as noted in the paper. All the comments above show there are a wealth of interesting questions to further investigate. Please do!

We find that, in these multidisciplinary projects, it’s most useful to publish part of the work early and get scholarly feedback, instead of waiting for years before trying to write a “perfect” paper. Our colleagues Noah Smith, Tae Yano, and John Wilkerson did this in their research on Congressional voting; Brendan did this with Noah and Brandon Stewart on international relations events analysis; there’s great forthcoming work from Yanchuan Sim, Noah, Brice Acree and Justin Gross on analyzing political candidates’ ideologies; and at the Digital Humanities conference earlier this year, David presented his joint work with the Assyriologist Adam Anderson on analyzing social networks induced from Old Assyrian cuneiform texts. (And David’s co-teaching a cool digital humanities seminar with Christopher Warren in the English department this semester — I’m sure there will be great cross-fertilization of ideas coming out of there!)

For example, we’ve had useful feedback here already — besides comments from the computational linguistics community through the ACL paper, just in the discussion on LL there have been many interesting theories and references presented. We’ve also been in conversation with other humanists — as we stated in our acknowledgments (noted by one commenter) — though apparently not the same humanists that Alpert-Abrams and Garrett would rather we had talked to. This is why it’s better to publish early and participate in the scholarly conversation.

For what it’s worth, some of these high-level debates on whether it’s appropriate to focus on progress in quantitative methods, versus directly on substantive findings, have been playing out for decades in the social sciences. (I’m thinking specifically about economics and political science, both of which are far more quantitative today than they were just 50 years ago.) And as several commenters have noted, and as we tried to in our references, there’s certainly been plenty of computational work in literary/cultural analysis before. But I do think the quantitative approach still tends to be seen as novel in the humanities, and as the original response notes, there have been some problematic proclamations in this area recently. I just hope there’s room to try to advance things without being everyone’s punching bag for whether or not they liked the latest Steven Pinker essay.

Probabilistic interpretation of the B3 coreference resolution metric

brendano — Sat, 31 Aug 2013 01:14:38 +0000

Here is an intuitive justification for the B3 evaluation metric often used in coreference resolution, based on whether mention pairs are coreferent. If a mention from the document is chosen at random,

B3-Recall is the (expected) proportion of its actual coreferents that the system thinks are coreferent with it.
B3-Precision is the (expected) proportion of its system-hypothesized coreferents that are actually coreferent with it.

Does this look correct to people? Details below:

In B3′s basic form, it’s a clustering evaluation metric, to evaluate a gold-standard clustering of mentions against a system-produced clustering of mentions.

Let \(G\) mean a gold-standard entity and \(S\) mean a system-predicted entity, where an entity is a set of mentions. \(i\) refers to a mention; there are \(n\) mentions in the document. \(G_i\) means the gold entity that contains mention \(i\); and \(S_i\) means the system entity that has \(i\).

The B3 precision and recall for a document are usually defined as (though it doesn’t seem to be exactly this in the original paper…):
\begin{align}
B3Prec &= \frac{1}{n} \sum_i \frac{|G_i \cap S_i|}{|S_i|} \\
B3Rec &= \frac{1}{n} \sum_i \frac{|G_i \cap S_i|}{|G_i|}
\end{align}

Consider B3Prec. Think about it like,

\begin{align}
B3Prec
&= E_{ment}\left[ \frac{ |G_i \cap S_i| }{ |S_i| } \right] \\
&= E_{ment}\left[ P(G_j = G_i \mid j \in S_i) \right]
\end{align}

The first step is the expectation under the distribution of “pick a mention \(i\) at random from the document”. The second step is from restating \(|G_i \cap S_i|\) as: out of the system-hypothesized coreferents of \(i\), how many are in the same gold cluster as \(i\)? Thus \(|G_i \cap S_i|/|S_i|\) is: if you choose a mention \(j\) randomly out of \(S_i\), how often does it have the same gold cluster as \(i\)? (I think that last line might be collapsable via the law of total expectation, but tracking those two random variables nested in there makes me confused.)

Similarly, \(B3Rec = E_{ment}[ P(S_j = S_i \mid j \in G_i) ]\).

Does the above look correct? I hadn’t seen this intuitive justification given anywhere before and that’s at least how I’m used to thinking about B3 so I was curious what other people think. This is why I like B3: I can explain it in terms of mention pairs.

I think this also gives an additional justification to Cai and Strube (2010)‘s proposal to handle divergent gold versus system mentions. So say the system produces a spurious mention \(i\) that isn’t part of the gold standard’s mentions (a “twinless” mention). If you assume that mentions not in the gold standard should be considered to have no coreferents, then all of \(i\)’s system-hypothesized coreferents are false positives. Therefore, to think about precision under this assumption, the system’s non-gold-mentions should be added to the gold as singleton entities, before computing precision. And analogously for recall (add gold-only mentions as system-side singletons: the system has failed to find any coreference links to them). I think this is what they call \(B^3_{sys}\) (section 2.2.2).

I also like the pairwise linking metric since it’s defined only in terms of mentions; to be analogous to the presentation of B3 here,

Pairwise-Prec: choose a pair of mentions the system thinks are coreferent. How often are they actually coreferent?
Pairwise-Rec: choose a pair of coreferent mentions. How often does the system think they’re coreferent?

Or algorithmically: take all entities to be fully connected mention graphs and compute link recovery precision/recall. But pairwise Prec/Rec/F1 doesn’t seem as popular as B3 Prec/Rec/F1; in particular, it’s not part of the CoNLL-2011 scoring script everyone seems to use now (or I guess its bugfixed version I kept hearing mutterings about at conferences this summer, though I don’t see any information about this online — I was told a series of bugs were discovered in it recently…). (Unlike pairwise, B3 does not have quadratic scaling effects in cluster and document size, though I never understood why that’s a-priori an important consideration?) It is apparent though, that the Cai and Strube method can be adapted to pairwise metrics, maybe including BLANC, under the same justification given here for why it should apply to B3.

(As far as I know B3 hasn’t been proposed before as a pure clustering metric … you could actually think of it in comparison to Rand index, VI, etc. I think it has some sort of relationship to VI if you think about Renyi entropies and precision/recall averaging — VI is a kind of precision/recall average, except log-loss variants of precision/recall, i.e. conditional Shannon entropies… that’s another long story though.)

Some analysis of tweet shares and “predicting” election outcomes

brendano — Tue, 20 Aug 2013 03:19:38 +0000

Everyone recently seems to be talking about this newish paper by Digrazia, McKelvey, Bollen, and Rojas (pdf here) that examines the correlation of Congressional candidate name mentions on Twitter against whether the candidate won the race. One of the coauthors also wrote a Washington Post Op-Ed about it. I read the paper and I think it’s reasonable, but their op-ed overstates their results. It claims:

“In the 2010 data, our Twitter data predicted the winner in 404 out of 435 competitive races”

But this analysis is nowhere in their paper. Fabio Rojas has now posted errata/rebuttals about the op-ed and described this analysis they did here. There are several major issues off the bat:

They didn’t ever predict 404/435 races; they only analyzed 406 races they call “competitive,” getting 92.5% (in-sample) accuracy, then extrapolated to all races to get the 435 number.
They’re reporting about in-sample predictions, which is really misleading to a non-scientific audience; more notes on this further below.
These aren’t predictions from just Twitter data, but a linear model that includes incumbency status and a bunch of other variables. (Noted by Jonathan Nagler, who guessed this even before Rojas posted the errata/rebuttal.)

Given that the op-ed uses their results to proclaim that social media “will undermine the polling industry,” this sort of scrutiny is entirely fair. Let’s take #3. If you look at their Figure 1, as Nagler reproduces, it’s obvious that tweet share alone gives much less than that much accuracy. I’ve reproduced it again and added a few annotations:

Their original figure is nice and clear. ”Tweet share” is: out of the name mentions of the two candidates in the race, the percentage that are of the Republican candidate. “Vote margin” is: how many more votes the Republican candidate got. One dot per race. Thus, if you say “predict the winner to be whoever got more tweet mentions,” then the number of correct predictions would be the number of dots in the shaded yellow areas, and the accuracy rate are them divided by the total number of dots. This looks like much less than 93% accuracy. [1]

It’s also been pointed out that incumbency alone predicts most House races; are tweets really adding anything here? The main contribution of the paper is to test tweets alongside many controlling variables, including incumbency status. The most convincing analysis the authors could have done would be to add an ablation test: use the model with the tweet share variable, and a model without it, and see how different the accuracies are. This isn’t in the paper. However, we can look at the regression coefficients to get an idea of relative variable importance, and the authors do a nice job reporting this. I took their coefficient numbers from their “Table 1″ in the paper, and plotted them, below:

The effect sizes and their standard errors are on the right. Being the incumbent is worth, on average, 49,000 votes, and it is much more important than all the other variables. One additional percentage point of tweet share is worth 155 votes. [2] The predictive effect of tweet share is significant, but small. In the paper they point out that a standard deviation worth of tweet share margin comes out to around 5000 votes — so roughly speaking, tweet shares are 10% as important as incumbency? [3] In the op-ed Rojas calls this a “strong correlation”; another co-author Johan Bollen called it a “strong relation.” I guess it’s a matter of opinion whether you call Figure 1 a “strong” correlation.

On the other hand, tweet share is telling something that those greyed-out, non-significant demographic variables aren’t, so something interesting might be happening. The paper also has some analysis of the outliers where the model fails. Despite being clearly oversold, this is hardly the worst study of Twitter and elections; I learned something from reading it.

As always, I recommend Daniel Gayo-Avello’s 2012 review of papers on Twitter and election prediction:

“I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper” — A Balanced Survey on Election Prediction using Twitter Data

… (Update: there’s also a newer review) and also see Metaxas and Mustafaraj (2012) for a broader and higher level overview of social media and elections. (Gayo-Avello also wrote a lengthy but sensible post on this paper.)

Next, point #2 — this “prediction” analysis shares a sadly often-repeated flaw, that the so-called “predictions” are evaluated on the training data (in ML-speak), i.e. they’re in-sample predictions (in socialscience-speak). This is cheating: it’s awfully easy to predict what you’ve already seen! XKCD has a great explanation of election model overfitting. As we should know by now, the right thing to do is report accuracy on an out-of-sample, held-out test set; and the best test is to make forecasts about the future and wait to see if they turn out true.

It’s scientifically irresponsible to take the in-sample predictions and say “we predicted N number of races correctly” in the popular press. It sounds like you mean predicting on new data. Subsequent press articles that Rojas links to use verbs like “foretell” and “predict elections” — it’s pretty clear what people actually care about, and how they’re going to interpret a researcher using the term “prediction.” In-sample predictions are a pretty technical concept and I think it’s misleading to call them “predictions.” [4]

Finally, somewhere in this whole kerfluffle hopefully there’s a lesson about cool social science and press coverage. I feel a little bad for the coauthors given how many hostile messages I’ve seen about their paper on Twitter and various blogs; presumably this motivates what Rojas says at the end of their errata/rebuttal:

The original paper is a non-peer reviewed draft. It is in the process of being corrected, updated, and revised for publication. Many of these criticisms have already been incorporated into the current draft of the paper, which will be published within the next few months.

That sounds great and I look forward to seeing the final and improved version of the paper. But, I feel like in the area of Twitter research, you have to be really cautious about your claims; they will get overblown by readers and the media otherwise. Here, the actual paper is reasonable if limited; the problem is they wrote an op-ed in a major newspaper with incredibly expansive and misleading claims about this preliminary research! This is going to bring out some justifiable criticism from the scientific community, I’m afraid.

[1] Also weird: many of the races have a 100% tweet share to one candidate. Are the counts super low, like 3-vs-0? Does it need smoothing or priors? Are these from astroturfing or spamming efforts? Do they create burstiness/overdispersion? Name mention frequency is an interesting but quite odd sort of variable that needs more analysis in the future.

[2] These aren’t literal vote counts, but number of votes normalized by district size; I think it might be interpretable as, expected number of votes in an average-sized city. Some blog posts have complained they don’t model vote share as a percentage, but I think their normalization preprocessing actually kind of handles that, albeit in a confusing/non-transparent way.

[3] I guess we could compare the variables’ standardized coefficients. Incumbency as a 0-1 indicator, for 165 Republican incumbents out of 406 total in their dataset, is stdev ~ 0.5; so I guess that’s more like, a standardized unit of tweet share is worth 20% of standardized impact of incumbency? I’m not really sure what’s the right way to compare here… I still think difference in held-out accuracy on an ablation test is the best way to tell what’s going on with one variable, if you really care about it (which is the case here).

[4] I wish we had a different word for “in-sample predictions,” so we can stop calling them “predictions” to make everything clearer. They’re still an important technical concept since they’re very important to the math and intuitions of how these models are defined. I guess you could say “yhat” or “in-sample response variable point estimate”? Um, need something better… Update: Duh, how about “fitted value” or “in-sample fits” or “model matched the outcome P% of the time”… (h/t Cosma)

[5] Numbers and graphics stuff I did are here.

Confusion matrix diagrams

brendano — Mon, 17 Jun 2013 21:37:19 +0000

I wrote a little note and diagrams on confusion matrix metrics: Precision, Recall, F, Sensitivity, Specificity, ROC, AUC, PR Curves, etc.

brenocon.com/confusion_matrix_diagrams.pdf

also, graffle source.

Movie summary corpus and learning character personas

brendano — Wed, 08 May 2013 04:01:38 +0000

Here is one of our exciting just-finished ACL papers. David and I designed an algorithm that learns different types of character personas — “Protagonist”, “Love Interest”, etc — that are used in movies.

To do this we collected a brand new dataset: 42,306 plot summaries of movies from Wikipedia, along with metadata like box office revenue and genre. We ran these through parsing and coreference analysis to also create a dataset of movie characters, linked with Freebase records of the actors who portray them. Did you see that NYT article on quantitative analysis of film scripts? This dataset could answer all sorts of things they assert in that article — for example, do movies with bowling scenes really make less money? We have released the data here.

Our focus, though, is on narrative analysis. We investigate character personas: familiar character types that are repeated over and over in stories, like “Hero” or “Villian”; maybe grand mythical archetypes like “Trickster” or “Wise Old Man”; or much more specific ones, like “Sassy Best Friend” or “Obstructionist Bureaucrat” or “Guy in Red Shirt Who Always Gets Killed”. They are defined in part by what they do and who they are — which we can glean from their actions and descriptions in plot summaries.

Our model clusters movie characters, learning posteriors like this:

Each box is one automatically learned persona cluster, along with actions and attribute words that pertain to it. For example, characters like Dracula and The Joker are always “hatching” things (hatching plans, presumably).

One of our models takes the metadata features, like movie genre and gender and age of an actor, and associates them with different personas. For example, we learn the types of characters in romantic comedies versus action movies. Here are a few examples of my favorite learned personas:

One of the best things I learned about during this project was the website TVTropes (which we use to compare our model against).

We’ll be at ACL this summer to present the paper. We’ve posted it online too:

Learning Latent Personae of Film Characters.
David Bamman, Brendan O’Connor, and Noah A. Smith.
ACL 2013, Sofia, Bulgaria, August 2013
Dataset