update 2012-10-25: I’ve been informed there is a new maintainer for Mawk, who has probably fixed the bugs I’ve been seeing.
From: Gert Hulselmans
[The bugs you have found are] indeed true with mawk v1.3.3 which comes standard with Debian/Ubuntu. This version is almost not developed the last 10 years.
I now already use mawk v1.3.4 maintained by another developer (Thomas E. Dickey)
for more than a year on huge datafiles (sometimes several GB).
The problems/wrong results I had with mawk v1.3.3 sometimes are gone. In his version, normally all open/known bugs are fixed.
This version can be downloaded from: http://invisible-island.net/mawk/
update 2010-04-30: I have since found large datasets where mawk is buggy and gives the wrong result. nawk seems safe.
When one of these newfangled “Big Data” sets comes your way, the very first thing you have to do is data munging: shuffling around file formats, renaming fields and the like. Once you’re dealing with hundreds of megabytes of data, even simple operations can take plenty of time.
For one recent ad-hoc task I had — reformatting 1GB of textual feature data into a form Matlab and R can read — I tried writing implementations in several languages, with help from my classmate Elijah. The results really surprised us:
|Language||Time (min:sec)||Speed (vs. gawk)||Lines of code||Notes||Type|
|mawk||1:06||7.8x||3||Mike Brennan’s Awk, system default on Ubuntu/Debian Linux.||VM|
|java||1:20||6.4x||32||version 1.6 (-server didn’t matter)||VM+JIT|
|c-ish c++||1:35||5.4x||42||g++ 4.0.1 with -O3, using stdio.h||Native|
|python||2:15||3.8x||20||version 2.5, system default on OSX 10.5||VM|
|perl||3:00||2.9x||17||version 5.8.8, system default on OSX 10.5||VM|
|nawk||6:10||1.4x||3||Brian Kernighan‘s “One True Awk”, system default on OSX, *BSD||?|
|c++||6:50||1.3x||48||g++ 4.0.1 with -O3, using fstream, stringstream||Native|
|ruby||7:30||1.1x||22||version 1.8.4, system default on OSX 10.5; also tried 1.9, but was slower||Interpreted|
|gawk||8:35||1x||3||GNU Awk, system default on RedHat/Fedora Linux||Interpreted|
To be clear, the problem is to take several files of (item name, feature name, value) triples, like:
000794107-10-K-19960401 limited 1 000794107-10-K-19960401 colleges 1 000794107-10-K-19960401 code 2 ... 004334108-10-K-19961230 recognition 1 004334108-10-K-19961230 gross 8 ...
And then rename items and features into sequential numbers as a sparse matrix: (i, j, value) triples. Items should count up from inside each file; but features should be shared across files, so they need a shared counter. Finally, we need to write a mapping of feature IDs back to their names for later inspection; this can just be a list.
This task is simple, but it’s representative of many data munging tasks out there. It inputs and outputs textual data. It’s probably one-off. The algorithm is easy — especially since it’s a subtask of something larger — but still complex enough you’ll need a debug cycle or two. You want to get it done fast so you can get on to the real work. Complex programming tools, like debuggers, are of little use — you figure out what’s going on by inspecting the output. Complex data processing environments, like Hadoop or an RDBMS, are also of little use — you have to munge in the first place to load data into them.
It turns out, this is a task AWK was made for. It’s a language dating from the original Bell Labs Unix era — circa 1977 — and it’s extremely specialized for processing delimited text files in a single pass. Perl was created in part to supersede it, but for this core use case, Awk is much more elegant and clearer. The implementation here is only 8 lines of code, expanded from merely 3 when I first wrote it.
Since it’s a standardized language, many implementations exist. One of them, MAWK, is incredibly efficient. It outperforms all other languages, including statically typed compiled ones like Java and C++! It wins on both LOC and performance criteria — a rare feat indeed, transcending the usual competition of slow-but-easy scripting languages versus fast-but-hard compiled languages.
[There's another big pro-VM story here: Java beat C++, both in LOC/ease-of-programming as well as performance. C++ was, as usual, a total nightmare to write. What you don't see in the LOC numbers is the sheer amount of time spent googling through every weird issue. For example, apparently g++ requires you to define the hash function in order to use a hash_map of string keys. Or, there are a zillion different ways to split a string, none of them standard. Then there are 2 different I/O and 2 different string libraries given its C heritage, and if you make the wrong choices, performance is terrible. I'm totally sure that given some more rewriting, the C++ implementation can be made the fastest. It's just a question of how much pain you go through to find the right rewrites. All the other implementations were written in the most straightforward, simplest way possible. C++ abjectly fails the "get it done quick" criterion.]
My most pleasant surprise learning Awk was its shortcuts for reading and writing files. Like shell, there is no concept of a file handle — you don’t open and close files, you just specify the filename and the VM figures it out. Even Perl, king of syntactic shorthand, doesn’t have this useful feature; and even Ruby, with its elegant open()-block-cleanup construct, is clunkier. It sounds minor, but this eliminates a number of bugs you can make in scripts.
But what I most appreciate about Awk is the discourse structure of its programs: every clause is a potentially conditional action to be performed with the current record. If you want actions to be taken upon program start or exit, you declare special clauses for that. Awk manages all these features while being staying incredibly small and simple — the advantages of being a domain-specific language. I think it feels a little more like a super-flexible, index-challenged version of SQL than it does a standard scripting language. I suspect Awk’s simplicity and specialization is part of why Mike Brennan was able to make Mawk so insanely fast. If your only datatypes are strings and hashes, then compile-time type inference is pretty easy.
Awk is also well-suited to the “Disk is the New Tape” era. That is, right now, hard drive sizes are rapidly growing — allowing very large datasets — but random access seek times aren’t catching up. In this setting, the only way to process data is via linear scans, accessing one item of data at a time. (E.g. running variance, online learning, column stores, etc.) This is the core philosophy behind Hadoop’s computation model — and Awk’s. If hard drives are like tape drives, then it’s worth looking in to other blast-from-the-past technologies! (Similar point about SAS, in fact.)
There are many Awk tutorials on the web. This one is decent, though I strongly recommend Ken Church’s classic tutorial Unix for Poets. It shows how to do all sorts of great things with Unix text processing tools, including Awk.
All the code, results, and data can be obtained at github.com/brendano/awkspeed. I’d love to see results for more languages. And I hope someday someone tries writing an LLVM Awk — will it be even faster?