Performance comparison: key/value stores for language model counts

I’m doing word and bigram counts on a corpus of tweets. I want to store and rapidly retrieve them later for language model purposes. So there’s a big table of counts that get incremented many times. The easiest way to get something running is to use an open-source key/value store; but which? There’s recently been some development in this area so I thought it would be good to revisit and evaluate some options.

Here are timings for a single counting process: iterate over 45,000 short text messages, tokenize them, then increment counters for their unigrams and bigrams. (The speed of the data store is only one component of performance.) There are about 17 increments per tweet: 400k unique terms and 750k total count. This is substantially smaller than what I need, but it’s small enough to easily test. I used several very different architectures and packages, explained below.

architecture name speed

in-memory, within-process python dictionary 2700 tweets/sec

on-disk, within-process berkeleydb hashtable 340 tweets/sec

on-disk, within-process tokyo cabinet hashtable 1400 tweets/sec

in-memory, over socket memcached 120 tweets/sec

on-disk, over socket memcachedb 0.5 tweets/sec

in-memory, over socket tokyo tyrant, memcached protocol 85 tweets/sec

on-disk, over socket tokyo tyrant, memcached protocol 85 tweets/sec

on-disk, over socket tokyo tyrant, binary protocol 225 tweets/sec

Eventually, I’ll want a purely in-memory, distributed table. That’s why I was interested in Memcached. But for development purposes, it’s very convenient to use an on-disk database. In the past I’ve used BerkeleyDB for this. (An SQL database is also possible but seems like overkill.) Ideally it would be nice to have a distributed key-value store with a heavy caching layer. Check out Leonard Lin’s post on the subject. Unfortunately, these experiments were limited to single node with a small dataset.

More details on the options:

  • Python dictionary: defaultdict(int) is the simplest and most obvious implementation. It’s the baseline and the fastest. This is the best option for many types of experimental NLP code, since it can just be serialized to disk for use later. Only if you want many processes to build it concurrently and incrementally, or want many processes to access the model but not have to hold it in their own process space, do the other options start becoming relevant.

  • BerkeleyDB: a well-known key/value store that I’ve used for a while. Unfortunately it’s been removed from the Python distribution, and there are often version hell issues every time I see people try to use it. (Every Linux/Unix seems to carry a different version, and they’re all not compatible with each other.)
  • Tokyo Cabinet is a newish key/value store that has some impressive benchmarks. I just learned about it from Leonard’s post, and I also found it to be excellent. If Cabinet keeps being so awesome, I might never use BerkeleyDB again. (Though installation issues are worse than BerkeleyDB since it’s new enough to not be a common package; e.g. I found it on MacPorts but not Debian.)
  • Memcached: The most standard in-memory key/value for use over sockets. Usually used for caching results from database queries for web applications — because in-memory caching is way faster than hitting disk on a database query. All data in a Memcached disappears if you turn it off. Clients talk to it via a plaintext protocol over sockets.
    • The fact it was slower than the dictionary or BDB or Cabinet means that the communication overhead was high. The nice thing about Memcached for keeping running counts like this is that it should distribute well: have lots of different processes/machines processing data and asking a central Memcached cluster to increment counters. It might be a little unfair to compare Memcached performance to BerkeleyDB or Cabinet, since it’s designed for the situation of communicating with many clients at once. It’s usually considered a win if Memcached is faster than a parallel-ly accessed RDBMS, which is very much the case.

    • I wonder how this architecture would compare to a Hadoop/HDFS/MapReduce for batch-job term counting performance. Jimmy Lin & other Maryland folks wrote an interesting report (2009) about using Memcached during a Hadoop job in a similar way for, among other things, this same language model use case. In general, lots of machine learning algorithms really don’t parallelize very well in the MapReduce architecture; parameter updates in Gibbs sampling, EM, and any online algorithm (e.g. SGD) are other examples. (An earlier paper on a better-than-mapreduce approach for EM parameters: Jason Wolfe et al. 2008; slides, paper.) A Memcached-like system could be a component of more client-server-ish parallel processing models for these use cases.
    • Note of warning: there are actually 3 different Python libraries to talk to Memcached: (1) memcache.py aka python-memcached; (2) cmemcache which wraps the C library libmemcache, and (3) cmemcached.pyx aka python-libmemcached write wraps a different C library, libmemcached. For each one, the X in import X correlates quite poorly to the project’s name. Bleah. Option #3 seems newest, or at least has the best-maintained websites, so I used that.
  • MemcacheDB is a BerkeleyDB-backed, Memcached-protocol server. Initially I had hoped it was just Memcached over BDB. Unfortunately this is clearly not the case. Its name is so similar yet its effectiveness is so different than Memcached! As Leonard points out, there are lots of half-assed solutions out there. It’s easy for anyone to create a system that works well for their needs, but it’s harder to make something more general.
  • Tokyo Tyrant is a server implemented on top of Cabinet that implements a similar key/value API except over sockets. It’s incredibly flexible; it was very easy to run it in several different configurations. The first one is to use an in-memory data store, and communicate using the memcached protocol. This is, of course, *exactly* comparable to Memcached — behaviorally indistinguishable! — and it does worse. The second option is to do that, except switch to an on-disk data store. It’s pretty ridiculous that that’s still the same speed — communication overhead is completely dominating the time. Fortunately, Tyrant comes with a binary protocol. Using that substantially improves performance past Memcached levels, though less than a direct in-process database. Yes, communication across processes incurs overhead. No news here, I guess.

I can’t say this evaluation tells us too much about the server systems, since it’s all for a single process, which really isn’t their use case. It is interesting, however, to see that memcached’s plaintext protocol causing a big performance hit compared to a binary one. There’s a lot of talk and perhaps code for a binary memcached protocol, but I couldn’t find any docs suggesting whether it currently works. Tyrant seems to work great.

The biggest takeaway is that Tokyo Cabinet is awesome. It has very complete English language documentation — something sadly lacking in many otherwise fine Japanese open-source projects — and appears to be highly performant and very flexible. This presentation by its author (Mikio Hirabayashi) shows a pretty impressive array of different things the suite of packages can do. At the very least, I’ll probably abandon BerkeleyDB if Cabinet keeps working so well; and hopefully, distribution and remote access will be easy to add via Tyrant.

Final note: it’s interesting how many of these new low-latency datastore systems come out of open-sourced projects from social network companies. Tokyo Cabinet/Tyrant is from Mixi, a large Japanese social networking site; Cassandra is from Facebook; and Voldemort is from LinkedIn. (Hadoop HDFS, approximately from Yahoo, is another open-source non-rdbms distributed datastore, though it’s not really low-latency enough to be comparable.) Then there are lots of commercial low-latency and distributed systems for data warehousing (oracle greenplum vertica aster…) but all these large web companies seem happy open-sourcing their infrastructure. This is great for me, but sucks to be a database company.

This entry was posted in Uncategorized. Bookmark the permalink.

28 Responses to Performance comparison: key/value stores for language model counts

  1. Great post, Brendan!

    I’m curious … where are you getting your Tweet corpus?

  2. brendano says:

    just search.twitter.com

  3. brendano says:

    one more datapoint .. redis = 90 tweets/sec. on-disk, over-socket. unlike memcached and tyrant, the protocol library is pure python.

  4. The stats for language models vary dramatically based on (a) corpus size, and (b) n-gram length, and (c) fan-out. With only 45K tweets, you’re only looking at 4.5MB of text or so, assuming they’re 100 chars long each. At 5 chars/token, that’s fewer than 1M bigrams of 10 chars or so each, even if none of them are repeated, so it’s trivial to fit in memory. It’d probably even fit in L2 cache on the big quad-core Xeons, so your bottleneck would almost certainly become I/O at that point.

    When you start getting corpora several orders of magnitude larger (e.g. MEDLINE or Wikipedia), and go beyond bigrams, you have to work a bit harder to stay in memory, and PAT-trie-structure-based approaches start to dominate. You also start to get substantial coverage of the n-grams you have stored, so lookup and increment starts to dominate allocation in the amortized analysis. We’ve managed to get character 8-grams for 50GB of text in memory (OK, it was around 13GB of memory, but still).

    And when you hit the big data collection, like you mention in your last post, you’ll need to spread this out over multiple machines and use something like map-reduce, or you’ll need to adopt a DB-backed approach. You can also use this approach (which is really like what Lucene uses for building search indexes) to get large corpora counts collected on a single machine if you’re willing to be patient.

  5. brendano says:

    Yeah, this generic key/value hashtable definitely doesn’t scale as well as a system custom-designed for the keys (trie structures are designed for language) and custom-designed for the values (only need counts — i.e. just 4 or 8 bytes– but bdb, cabinet, and memcached are designed for fairly arbitrary-sized payloads.) I’m seeing 10 or 20 megs on disk for this little dataset, which i’m sure is way more than what tries would take.

    the distribution issue is interesting. there’s a an issue whether to do lots of computation locally, then only occasional merges across processors/machines, versus whether to do more communication more incrementally.

  6. Andy says:

    Have you tried MySQL?

    Heard accounts of people using MySQL as a key/value store and achieved performance higher than Memcached. Would be interesting to see if that is borne out in your benchmark.

  7. Andy says:

    It doesn’t seem your “on-disk” options were really flushing to disk.

    Seek time for disk is about 10ms. So if you’re doing random writes, you shouldn’t be able to do more than 100 disk IO per second.

    For tokyo cabinet at 1400 tweets/sec and 17 increments per tweet, that’s 23800 increments/sec. If you just need 1 disk IO per increment, that’s 23800 disk IO per sec, way more than the 100 seeks per second limit. Same goes for the numbers for berkeleydb and tokyo tyrant.

    Perhaps that explains why the numbers for tokyo tyrant in memory and on disk are exactly the same: the data store doesn’t sync to disk after each write

  8. David Mathers says:

    Final note: it’s interesting how many of these new low-latency datastore systems come out of open-sourced projects from social network companies.

    And memcache is from Livejournal.

  9. brendano says:

    Andy: yeah, there was only occasional flushing to disk for the Tokyo Cabinet case.

    David: oops I forgot that! I’m so used to memcached being a standard piece of open-source infrastructure it slipped my mind.

  10. ehsanul says:

    Great write up, thanks! :)

    I suggest you give MongoDB a try. Backed by 10gen, who are basically just developing MongoDB and opensourcing it. They plan to make money off service licenses apparently. Anyhows, seems like a great project, give it a try. It’s designed for speed and multiple machines.

  11. Dave Spencer says:

    See also Burst Tries for when you want to keep more of the dataset in RAM
    before flushing to disk – this is consistent with Bob C’s comment above about
    PAT-tries.
    - dave

  12. Mike Dirolf says:

    I’m biased because I work on it, but I agree with ehsanul that you might want to give MongoDB a shot. It uses memory mapped files so it is very fast. And you can perform richer queries than with a normal key/value store. Anyway let me know if you decide to give it a shot or have any questions.
    - mike

  13. jamie says:

    The BerkeleyDB numbers seem surprisingly low. Do you know what you’ve got the cache size set for? I’ve seen some really shockingly fast number out of bdb in the past.

    env = bsddb3.db.DBEnv()

    env.set_cachesize(0, 524288000, 1)

    I can’t believe this is still the case, but it looks like the default berkeley db cache is 256KB, which I think equates to only 16 pages in memory before you incur a disk operation. Letting BerkeleyDB do as much of it’s thing with a big working set helps performance immensely.

    http://www.oracle.com/technology/documentation/berkeley-db/db/api_c/env_set_cachesize.html

    Also, make sure you’re environment is opened with a memory pool. Here are the default env args I use:

    TXN_ARGS = db.DB_CREATE | db.DB_INIT_LOCK | \
    db.DB_INIT_LOG | db.DB_INIT_MPOOL | \
    db.DB_INIT_TXN | db.DB_RECOVER | db.DB_THREAD

    There’s some strange mythology around BerkeleyDB floating around the internet; much of it I think stems from subversions poor use of it. My experience has been anything but: wickedly fast, extremely stable, and best of all, easy to debug. The db_stat tool is insanely powerful, and help pinpoint the exact reason your database is slow.

  14. Pingback: gaemon's me2DAY

  15. jamie: sounds like berkeleydb has a pretty lousy configuration out of the box! if people misuse it under those circumstances, it really should get some of the blame.

    even if the only difference between bdb and tokyo is out-of-the-box config, that makes tokyo a much more useful tool for me.

  16. Bart Gysens says:

    Did you test gettext as an option?

  17. Evil Bob says:

    Tokyo Cabinet only occasionally flushes to disk because it memory maps the database file and as a result synchronisation is controlled by the kernel’s virtual memory manager.

    The advantages of this approach is that you no longer need manage the synchronisation and you gain extra performance since your program no longer needs to make fsync system calls. The disadvantage, as some of you have pointed out, is that you can no longer control the synchronisation and if you don’t agree with the virtual memory manager’s synchronisation algorithm, you’re stuck.

  18. Tim says:

    FYI, I’ve also made a performance test for Redis, Tokyo Tyrant and MemcacheDB.
    (request per second)
    http://timyang.net/data/mcdb-tt-redis/

  19. Pingback: Shane K Johnson » Blog Archive » How I learned to say ‘No’ to SQL

  20. Chris says:

    Thanks a lot for this writeup. We’ve got some large berkeley DB files and we’re looking into migrate to tokyocabinet for performance reasons. Your post was certainly useful.

    We have found pretty huge differences with the BSDDB cache size, and we’ve done tons of performance tweaking. I’m definitely looking forward to checking out how hard it is to port to TokyoCabinet, and how much time we can save.

  21. jd says:

    @Evil Bob,
    msync() to sync to the filesystem and fsync() to sync to disk?

  22. Pingback: Performance comparison: key/value stores for language model counts › ec2base

  23. Pranas says:

    I completely agree with Bob. It makes no sense comparing mapping implementations if you need something different ;-).
    Still it is nice what you share your observations.

    Thanks

  24. Samona says:

    Ciao, trovo questo sito fortemente per nulla male. Grazie mille per le idee. bye

  25. Holger says:

    Hello, i have been using BerkleyDB for more then five Years to store Key / Value Pairs for my Searchengine at http://www.amidalla.de. The Client is in Perl and i dont see many Performance Differences in setting the Cache Environment !? Maybe this is just Perl specific, i did not test this with the C++ Client.