OK, everyone loves to run dictionary methods for sentiment and other text analysis — counting words from a predefined lexicon in a big corpus, in order to explore or test hypotheses about the corpus. In particular, this is often done for sentiment analysis: count positive and negative words (according to a sentiment polarity lexicon, which was derived from human raters or previous researchers’ intuitions), and then proclaim the output yields sentiment levels of the documents. More and more papers come out every day that do this. I’ve done this myself. It’s interesting and fun, but it’s easy to get a bunch of meaningless numbers if you don’t carefully validate what’s going on. There are certainly good studies in this area that do further validation and analysis, but it’s hard to trust a study that just presents a graph with a few overly strong speculative claims as to its meaning. This happens more than it ought to.
I was happy to see a similarly critical view in a nice working paper by Justin Grimmer and Brandon Stewart, Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.
Since I think these arguments need to be more widely known, here’s a long quote from Section 4.1 … see also the paper for more details (and lots of other interesting stuff). Emphases are mine.
For dictionary methods to work well, the scores attached to words must closely align with how the words are used in a particular context. If a dictionary is developed for a specific application, then this assumption should be easy to justify. But when dictionaries are created in one substantive area and then applied to another problems, serious errors can occur. Perhaps the clearest example of this is shown in Loughran and McDonald (2011). Loughran and McDonald (2011) critique the increasingly common use of off the shelf dictionaries to measure the tone of statutorily required corporate earning reports in the accounting literature. They point out that many words that have a negative connotation in other contexts, like tax, cost, crude (oil) or cancer, may have a positive connotation in earning reports. For example, a health care company may mention cancer often and oil companies are likely to discuss crude extensively. And words that are not identified as negative in off the shelf dictionaries may have quite negative connotation in earning reports (unanticipated, for example).
Dictionaries, therefore, should be used with substantial caution. Scholars must either explicitly establish that word lists created in other contexts are applicable to a particular domain, or create a problem specific dictionary. In either instance, scholars must validate their results. But measures from dictionaries are rarely validated. Rather, standard practice in using dictionaries is to assume the measures created from a dictionary are correct and then apply them to the problem. This is due, in part, to the exceptional difficulties in validating dictionaries. Dictionaries are commonly used to establish granular scales of a particular kind of sentiment, such as tone. While this is useful for applications, the granular measures insure that it is essentially impossible to derive gold standard evaluations based on human coding of documents, because of the difficulty of establishing reliable granular scales from humans (Krosnick, 1999).
The consequence of domain specificity and lack of validation is that most analyses based on dictionaries are built on shaky foundations. Yes, dictionaries are able to produce measures that are claimed to be about tone or emotion, but the actual properties of these measures – and how they relate to the concepts their attempting to measure – are essentially a mystery. Therefore, for scholars to effectively use dictionary methods in their future work, advances in the validation of dictionary methods must be made. We suggest two possible ways to improve validation of dictionary methods. First, the classification problem could be simplified. If scholars use dictionaries to code documents into binary categories (positive or negative tone, for example), then validation based on human gold standards and the methods we describe in Section 4.2.4 is straightforward. Second, scholars could treat measures from dictionaries similar to how we validations from unsupervised methods are conducted (see Section 5.5). This would force scholars to establish that their measures of underlying concepts have properties associated with long standing expectations.
And after an example analysis,
… we reiterate our skepticism of dictionary based measures. As is standard in the use of dictionary measures (for example, Young and Soroka (2011)) the measures are presented here without validation. This lack of validation is due in part because it is exceedingly difficult to demonstrate that our scale of sentiment precisely measures differences in sentiment expressed towards Russia. Perhaps this is because it is equally difficult to define what would constitute these differences in scale.
I am more worried about young scholars using sentiment analysis as a “tool” and reporting sentiments while basing their research on such so called analysis. What’s morer worrying is that they use such dictionary based approaches in an informal content setting like Twitter. One needs to be very careful before such steps are taken.
Also, grnerating domain specific lexicons is a well known and established problem in opinion mining.
Good article. I am currently writing thesis related to the sentiment analysis using lexical. To be honest I found the fear when working on assessment dictionary. For first step, I do seeking any formal vocabulary in the formal dictionary, if no I’ll look it up in the dictionary of non-formal (slang) – there happens to be a website that makes a collection of slang words according to the language. Third, I’m looking how often and how to use it in our daily conversation. Does 3 steps is enough to determine the value, for case I am the only one coder?
Sorry for my English, thank you.