The idea for a shared task on web parsing is really cool. But I don’t get this one:
Shared Task – SANCL 2012 (First Workshop on Syntactic Analysis of Non-Canonical Language)
They’re explicitly banning
- Manually annotating in-domain (web) sentences
- Creating new word clusters, or anything, from as much text data as possible
… instead restricting participants to the data sets they release.
Isn’t a cycle of annotation, error analysis, and new annotations (a self-training + active-learning loop, with smarter decisions through error analysis) the hands-down best way to make an NLP tool for a new domain? Are people scared of this reality? Am I off-base?
I am, of course, just advocating for our Twitter POS tagger approach, where we annotated some data, made a supervised tagger, and iterated on features. The biggest weakness in that paper is we didn’t have additional iterations of error analysis. Our lack of semi-supervised learning was not a weakness.
The purpose of a shared task is to compare models and algorithms, not to compare human annotation and error analysis skill, which depends mainly on a team’s supply of human labor. As for “Our lack of semi-supervised learning was not a weakness,” how do you know? Have you proven that such a better learning algorithm is impossible?
A couple points. First, as the website points out:
“It is permissible to use previously constructed lexicons, word clusters or other resources provided that they are made available for other participants.”
So you can use clusters, but in the spirit of open competition, we ask that these resources be made available.
Second, I agree that taking a domain, running system X on the domain, doing an error analysis and then adding features, changing the model or annotating some more data is a very good way to adapt systems. I don’t think anyone is ‘scared’ of this approach. In fact, outside academia, this is the standard way of doing business, not the exception. However, this is not as easy as it sounds. First, you need the resources (human resources that is) to do this for every domain on the web or domain you might be interested in. Second, the annotations you wish to collect must be easily created by you or via a system like mechanical turk. It is one thing to annotate some short twitter posts with 12-15 part of speech tags and a whole other thing to annotate consumer reviews with syntactic structure. I have tried both. They are not comparable. Even the former cannot be done reliably by turkers, which means you will need grad students, staff research scientists or costly third party vendors to do this every time you want to study a new domain.
So the spirit of the competition was to see, from a modeling/algorithm perspective, what are the best methods for training robust syntactic analyzers on the data currently available. By limiting the resources we are trying to make this as much an apples-to-apples comparison as we can. Even this is impossible. Some parsers require lexicons that might have been tuned for specific domains, etc.
Understanding this is still valuable in the “analyze, annotate and iterate” approach. Don’t you want to start off with the best baseline to reduce the amount of human labor required?
re: semi-supervised learning for twitter POS tagging — I don’t know the details, but there was one experiment that turned out negative. It is easy to believe more feature engineering could be a better use of researcher time, though hard to prove such a negative of course.
re: getting new annotations the standard way of doing business — yes. I remember when Powerset finally hired an annotation expert. Was way more useful than quite a bit of the other work we were doing.
Thank you for the explanation of your goals, it is helpful.
Pingback: Graphs for SANCL-2012 web parsing results | AI and Social Science – Brendan O'Connor