Patches to Rainbow, the old text classifier that won’t go away

I’ve been reading several somewhat recent finance papers (Antweiler and Frank 2005, Das and Chen 2007) that use Rainbow, the text classification software originally written by Andrew McCallum back in 1996. The last version is from 2002 and the homepage announces he isn’t really supporting it any more.

However, as far as I can tell, it might still be the easiest-to-use text classifier package out there. You don’t have to program — just invoke commandline arguments — and it can accommodate reasonably sized datasets, does tokenization, stopword filtering, etc. for you, and has some useful feature selection and other options. Based on my limited usage, it seems well-implemented. If anyone knows of a better one I’d love to hear it. I once looked at, among other things, GATE and UIMA, and they seemed too hard to use if you wanted to download something that did simple text classification; or else, maybe they didn’t have documentation on how to use them in that manner. Rainbow does. If I had to recommend a text classifier to a social scientist today, I might say they should Rainbow.

(GATE and UIMA call themsleves “architectures”. I usually don’t want an architecture, I want a program that does stuff. LingPipe was the only other system I found that had good web documentation saying how to use it to do text classification. It looks like a good option, if you’re willing to write some code. There are numerous academic efforts to make automated content analysis systems that at a high level sound like the right sort of thing, but nearly all of them have poor web docs so it’s hard to tell whether they do what you want.)

In the meantime, the current Rainbow download has issues compiling on modern GCC and Mac OSX — some issues documented here. I worked through them put my patched version (only tested on GCC 4.0, OSX 10.5) up here:

This entry was posted in Uncategorized. Bookmark the permalink.

5 Responses to Patches to Rainbow, the old text classifier that won’t go away

  1. jhofman says:

    nltk is also decent. easy to use, nice documentation.

  2. brendano says:

    Right. Somehow I often forget about NLTK. I think I tried to use it for something more complex once, and it didn’t work out, so it’s unfairly tarnished in my mind. It does have great tutorials, at least aimed for someone learning the material.

  3. Jeff says:

    What about weka? It even has a GUI. LingPipe is a good choice too.

  4. DBACL is a pretty good command line text classifier.

  5. Pedro Lopes says:

    Great post, glad to check this out. I recently hacked the bow to OSX, but your port is much more complete. Thanks for sharing.