What freely available end-to-end natural language processing (NLP) systems are out there, that start with raw text, and output parses and semantic structures? Lots of NLP research focuses on single tasks at a time, and thus produces software that does a single task at a time. But for various applications, it is nicer to have a full end-to-end system that just runs on whatever text you give it.
If you believe this is a worthwhile goal (see caveat at bottom), I will postulate there aren’t a ton of such end-to-end, multilevel systems. Here are ones I can think of. Corrections and clarifications welcome.
- Stanford CoreNLP. Raw text to rich syntactic dependencies (LFG-inspired). Also POS, NER, coreference.
- C&C tools. From (sentence-segmented, tokenized?) text to rich syntactic dependencies (CCG-based) and also a semantic representation. POS and chunks on the way. Does anyone use this much? It seems underappreciated relative to its richness.
- Senna. Sentence-segmented text -> parse trees, plus POS, NER, chunks, and semantic role labeling. This one is quite new; is it as good? It doesn’t give syntactic dependencies, though for some applications semantic role labeling is similar or better (or worse?). I’m a little concerned that its documentation seems overly focused on competing in evaluation datasets, as opposed to trying to ensure they’ve made something more broadly useful. (To be fair, they’re focused on developing algorithms that could be broadly applicable to different NLP tasks; that’s a whole other discussion.)
If you want to quickly get some sort of shallow semantic relations, a.k.a. high-level syntactic relations, one of the above packages might be your best bet. Are there others out there?
Restricting oneself to these full end-to-end systems is also funny since you can mix-and-match components to get better results for what you want. One example: if you have constituent parse trees and want dependencies, you could swap in the Stanford Dependency extractor (or another one like pennconverter?) to post-process the parses. Or you could swap in the Charniak-Johnson or Berkeley parser into the middle of the Stanford CoreNLP stack. Or you could use a direct dependency parser (I think Malt is the most popular?) and skip the pharse structure step. Etc.
It’s worth noting several other NLP libraries that I see used a lot. I believe that, unlike the above, they don’t focus on out-of-the-box end-to-end NLP analysis (though you can certainly use them to perform various parts of an NLP pipeline).
- OpenNLP — I’ve never used it but lots of people like it. Seems well-maintained now? Does chunking, tagging, even coreference.
- LingPipe — has lots of individual algorithms and high-quality implementations. Only chunking and tagging (I think). It’s only quasi-free.
- Mallet — focuses on information extraction and topic modeling, so slightly different than the other packages listed here.
- NLTK — I always have a hard time telling what this actually does, compared to what it aims to teach you to do. It seems to do various tagging and chunking tasks. I use the nltk_data.zip archive all the time though (I can’t find a direct download link unfortunately), for its stopword lists and small toy corpora. (Including the Brown Corpus! I guess it now counts as a toy corpus since you can grep it in less than a second.)
These packages are nice in terms of documentation and software engineering, but they don’t do any syntactic parsing or other shallow relational extraction. (NLTK has some libraries that appear to do parsing and semantics, but it’s hard to tell how serious they are.)
Oh finally, there’s also UIMA, which isn’t really a tool, but rather a high-level API to integrate together your tools. GATE also heavily emphasizes the framework aspect, but does come with some sort of tools.
Check out our open information extraction package called reverb. It takes raw text as input and outputs binary relationships like (citrus fruit, excellent source of, vitamin c). It was designed to scale to massive corpora (billions of documents).
http://reverb.cs.washington.edu
Thanks for the link! I hadn’t heard of this yet. Looks neat.
I sometimes call SenseClusters and WordNet::SenseRelate::AllWords end-to-end systems, as they take raw text as input and output clusters and sense tagged text…
http://senseclusters.sourceforge.net
http://senserelate.sourceforge.net
Both have web interfaces available so you can run them without needing to install (but you can also download and install them too).
Anyway, nice topic for a post – end to end systems are indeed relatively rare, and it’s nice to draw attention to what is out there…
Enjoy,
Ted
You should have a look at gramlab (www.gramlab.org). This project aims at the development of open-source NLP components and platform, targeted to non-specialists.
+ Apache Stanbol, “an open source modular software stack and reusable set of components for semantic content management.
Apache Stanbol components are meant to be accessed over RESTful interfaces to provide semantic services for content management. The current code is written in Java and based on the OSGi modularization framework.”
http://incubator.apache.org/stanbol/
http://stanbol.demo.nuxeo.com/
While I find it a bit weird to describe a parser as an “end-to-end” NLP system, I do find it a bit annoying that most tools assume that things such as sentence-splitting and tokenization is taken care for elsewhere.
So, I take this as an opportunity to advertise my dependency-parser which produces Stanford dependencies (with some of the non-tree relations), and is bundled together with a sentence splitter (Dan Gillick’s splitta) , the stanford-tagger, and a bash script to run the complete pipeline of raw text to parse-trees (it is much faster than the Stanford parser, and produces parses of similar accuracy).
Ted, thanks for the links.
Hugues and Stefane — the websites for these projects don’t say what they actually do, in terms of what types of semantic analysis are already supported.
Yoav, looks handy!
Next question ..
What “end-to-end” NLP packages are available in the “cloud” ?
i.e. .. corpus (ePub) | {cloud} | natural language i/o
I’m dreaming of a (open) cloud-based framework for a plethora of pluggable APIs ..
;^)
Have you tried http://GateCloud.net? You can run pre-packaged nlp services, based on plugins shipped with GATE, or bring your own GATE compatible application to run on the cloud. One uses GATE Developer to create, test and export the application, so it’s zero adaptation cost typically. All open API of course
Although NLTK was indeed developed as a toolkit for teaching NLP, we have tried to take it far beyond that and try to incorporate more state-of-the-art tools either as interfaces or directly. There are also more than toy corpora. For example, there is now an interface to read and process the Europarl corpus and a way to use the Stanford Tagger from inside your python code. As for installing the corpora, you can just run ‘import nltk; nltk.download()’ and an interface should pop up allowing you to install some or all of the corpora that are bundled with NLTK. Also, there’s an NLTK O’reilly book and makes things much clearer as to how NLTK can be used to do many different things. Also, look at Jacob Perkins’ nltk-trainer object that makes it much easier to train different types of classifiers and taggers on the various corpora.
GATE does have the components of an end-to-end NLP system built in, unlike UIMA. So yes, it is an API as well, but we also bundle in lots of ready-made NLP tools. And not just for English either.
Kalina, thanks for the information. Sorry if I misrepresented it. I think the last time I spent a while trying to read through GATE but wasn’t able to find them… I guess it was several years ago though.
I’ve been happy using gensim.
http://radimrehurek.com/gensim/
Some of those other packages, like C&C Tools and Senna, are in the same “quasi free” category as LingPipe in the sense that they’re released under what their authors call “non-commercial” licenses. The intent for the LingPipe license was a little different in that we didn’t single out academia as a special class of users. We do allow free use for research purposes for industrialists and academics alike. We also provide an explicit “developers” license that explicitly gives you this right, which makes some users (‘ organizations) feel better.
For instance, none of the Senna, C&C, or LingPipe licenses are compatible with GPL-ed code. Senna goes so far as to prohibit derived works altogether.
Stanford NLP’s license sounds like it was written by someone who didn’t quite understand the GPL. Their page says “The Stanford CoreNLP code is licensed under the full GPL, which allows its use for research purposes, free software projects, software services, etc., but not in distributed proprietary software.” Their wording here makes it seem like “free” or “research” have some special status under the GPL, which they don’t. The terms “research” and “academia” don’t even show up in the license, and although “free” does, although they intend it as “free as in free speech”, not “free as in free beer”:
http://www.gnu.org/licenses/gpl.txt
The basis of the GPL is simple — if you redistribute code based on GPL-ed code, you have to release the redistributed code under the GPL (in some cases, you can get away with using a less restrictive license like LGPL or BSD for your mods or interacting libraries, though you can’t change the underlying GPL-ed source’s license). Stanford’s right in saying you can’t distribute proprietary (by which I assume they mean closed source) code that uses GPL libraries (the meaning of “uses” here is a bit complex, but think of linking in C/C++ terms). You can charge for GPL-ed code if you can find someone to pay you. That’s what RedHat’s doing with Linux, what Revolution R’s doing with R, and what Enthought’s doing with Python. It’s not what MySQL did with SQL or what we do with LingPipe — in both those cases, the company owned all the IP and thus could negotiate for any kind of license they wanted, as well as distributing open source.
You can also set up a software service, for example on Amazon’s Elastic Compute Cloud (EC2) or on your own servers, that’s entirely driven by GPL-ed software, like say Stanford NLP or Weka, and then charge users for accessing it. Because you’re not redistributing the software itself, you can modify it any way you like and write code around it without releasing your own software. GNU introduced the Affero GPL (AGPL), a license even more restrictive than the GPL that tries to close this server loophole for the basic GPL.
With GPL, you just have to abide by the terms of distributing your dependent code with a compatible license. There’s no free ride for academics here — you can’t take GPL-ed code, use it to build a research project for your thesis, then give an executable away for free without also distributing your code with a compatible license. And you can’t restrict the license to something research only. Similarly, you couldn’t roll a GPL-ed library into Senna or C&C or LingPipe and redistribute them under their own licenses. Academics are often violating these terms because they somehow think “research use only” is special.
Also, keep in mind that as an academic, your university (or lab) probably has a claim to your intellectual property developed using their resources. Here’s some advice from GNU on that front:
http://www.gnu.org/philosophy/university.html
Pingback: “Academic” Licenses, GPL, and “Free” Software « LingPipe Blog
Thanks Bob. FYI by “quasi-free” I was referring to the pricing, less so the redistribution-rights. I wrote more here http://lingpipe-blog.com/2011/11/03/academic-licenses-gpl-and-free-software/#comment-16594
Pingback: End-to-end NLP packages « Another Word For It
Hi Bob, I don’t think it was a matter of misunderstanding the GPL, it was just an attempt to say something positive about allowed uses before heading straight to things you can’t do. Nevertheless, you’re the second person that hasn’t liked this wording, and I agree with your objection that it is actually quite possible to make “research” use of these tools in violation of the GPL, and so for my new year spring cleaning, I have revised the wording. It now just says “which allows many free uses”, which hopefully will be okay by everyone….
p.s. The usage of the word “proprietary” is straight from the FSF.
I wrote up an updated comparison of a number of Semantic Role Labelers/Shallow Semantic Parsers (they usually include end-to-end NLP) if anyone is interested.