What freely available end-to-end natural language processing (NLP) systems are out there, that start with raw text, and output parses and semantic structures? Lots of NLP research focuses on single tasks at a time, and thus produces software that does a single task at a time. But for various applications, it is nicer to have a full end-to-end system that just runs on whatever text you give it.
If you believe this is a worthwhile goal (see caveat at bottom), I will postulate there aren’t a ton of such end-to-end, multilevel systems. Here are ones I can think of. Corrections and clarifications welcome.
- Stanford CoreNLP. Raw text to rich syntactic dependencies (LFG-inspired). Also POS, NER, coreference.
- C&C tools. From (sentence-segmented, tokenized?) text to rich syntactic dependencies (CCG-based) and also a semantic representation. POS and chunks on the way. Does anyone use this much? It seems underappreciated relative to its richness.
- Senna. Sentence-segmented text -> parse trees, plus POS, NER, chunks, and semantic role labeling. This one is quite new; is it as good? It doesn’t give syntactic dependencies, though for some applications semantic role labeling is similar or better (or worse?). I’m a little concerned that its documentation seems overly focused on competing in evaluation datasets, as opposed to trying to ensure they’ve made something more broadly useful. (To be fair, they’re focused on developing algorithms that could be broadly applicable to different NLP tasks; that’s a whole other discussion.)
If you want to quickly get some sort of shallow semantic relations, a.k.a. high-level syntactic relations, one of the above packages might be your best bet. Are there others out there?
Restricting oneself to these full end-to-end systems is also funny since you can mix-and-match components to get better results for what you want. One example: if you have constituent parse trees and want dependencies, you could swap in the Stanford Dependency extractor (or another one like pennconverter?) to post-process the parses. Or you could swap in the Charniak-Johnson or Berkeley parser into the middle of the Stanford CoreNLP stack. Or you could use a direct dependency parser (I think Malt is the most popular?) and skip the pharse structure step. Etc.
It’s worth noting several other NLP libraries that I see used a lot. I believe that, unlike the above, they don’t focus on out-of-the-box end-to-end NLP analysis (though you can certainly use them to perform various parts of an NLP pipeline).
- OpenNLP — I’ve never used it but lots of people like it. Seems well-maintained now? Does chunking, tagging, even coreference.
- LingPipe — has lots of individual algorithms and high-quality implementations. Only chunking and tagging (I think). It’s only quasi-free.
- Mallet — focuses on information extraction and topic modeling, so slightly different than the other packages listed here.
- NLTK — I always have a hard time telling what this actually does, compared to what it aims to teach you to do. It seems to do various tagging and chunking tasks. I use the nltk_data.zip archive all the time though (I can’t find a direct download link unfortunately), for its stopword lists and small toy corpora. (Including the Brown Corpus! I guess it now counts as a toy corpus since you can grep it in less than a second.)
These packages are nice in terms of documentation and software engineering, but they don’t do any syntactic parsing or other shallow relational extraction. (NLTK has some libraries that appear to do parsing and semantics, but it’s hard to tell how serious they are.)
Oh finally, there’s also UIMA, which isn’t really a tool, but rather a high-level API to integrate together your tools. GATE also heavily emphasizes the framework aspect, but does come with some sort of tools.