Comments on: Don’t MAWK AWK – the fastest and most elegant big data munging language!

By: AWK for Human Beings | Thoughts and Scribbles | MicroDevSys.com

AWK for Human Beings | Thoughts and Scribbles | MicroDevSys.com — Thu, 21 Aug 2014 18:18:17 +0000

[...] Well now. Though there should be a warning. MAWK does come with an accuracy disclaimer and that is really the choice maker between the two. More on mawk can be found [...]

By: Eric

Eric — Tue, 22 Jul 2014 16:56:44 +0000

I stopped at reading c++ is 1.3x times faster than awk while java is 6.4x times. Do you seriously suggest that c++ is 5x times SLOWER than java?

By: Ronald Loui

Ronald Loui — Sun, 11 May 2014 21:30:32 +0000

Come to think of it, you can do:

for (ii in imap) delete imap[ii]

instead of delete imap,

which should keep the largest allocation of imap’s hash, and indeed gives you a bit more gawk speed.

By: Ronald Loui

Ronald Loui — Sun, 11 May 2014 21:24:55 +0000

Well, you get the idea…

Oddly, my mawk is not faster with these rewrites:

loui@loui-desktop:~/gawk$ mawk -W version
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

compiled limits:
max NF 32767
sprintf buffer 1020
loui@loui-desktop:~/gawk$ awk -W version
GNU Awk 3.1.8
Copyright (C) 1989, 1991-2010 Free Software Foundation.

By: Ronald Loui

Ronald Loui — Sun, 11 May 2014 21:08:11 +0000

Sorry, but it seems my getline code and indenting has been eaten by HTML. Trying again:

BEGIN {
for (i=2; i in ARGV; i++) {
inf = ARGV[i]
outf = inf “n”
delete imap
I=0
while (getline < inf) {
if (!imap[$1]) imap[$1] = ++I
if (!jmap[$2]) jmap[$2] = ++J
print imap[$1] ” ” jmap[$2] ” ” $3 > outf
}
}
for (v in jmap) print v > “vocab”
}

By: Ronald Loui

Ronald Loui — Sun, 11 May 2014 21:06:32 +0000

FILENAME != lastf { lastf = FILENAME; delete imap; I=0 }
!imap[$1] { imap[$1] = ++I }
!jmap[$2] { jmap[$2] = ++J }
{ print imap[$1], jmap[$2], $3 > (lastf “n”) }
END { for (v in jmap) print v > “vocab” }

will save you 10-30% in my gawk, depending on the length of your files.

Now, if Arnold Robbins would give us back the pre-allocate hash size option at the command line, we could get that faster using extra wide hashes.

Note that I am taking advantage of the serial file processing. I am a huge fan of awk/gawk, data pipieline processing, and pragmatics, such as how much time it takes you to write, debug, maintain, explain, etc.

Since Stallman showed me gawk in 92, I’ve been a huge fan. Good to see awk/nawk/gawk/mawk get some well deserved respect. Solaris nearly killed nawk by giving us the worst implementation in history.

A few language extensions could speed this up even more. Array-in and array-out would reduce the file-i/o overhead. ENDFILE and BEGINFILE would remove an annoying conditional. For long lines, a parse-to-n-and-quit, like the split in bash and perl, would also save time, since $n, n>3, is never referenced. It’s possible that a two-stage “i in array” test would be faster than hash-collision-chain traversal, where the first stage is a bloom filter.

In fact, this is another 10% faster…

BEGIN {
for (ifile=2; ifile in ARGV; ifile++) {
inf = ARGV[ifile]
outf = inf “n”
delete imap
I=0
while (getline outf
}
}
for (v in jmap) print v > “vocab”
}

awk “experts” abhor the use of BEGIN for the main loop, but the fact is, you get a lot more control, you can process multiple streams, you can increase readability and correctness, AND you can popularize the language for more general scripting use.

So happy to see discussions like this.

By: Gert

Gert — Sun, 16 Mar 2014 22:41:32 +0000

The developer of the upstream mawk says it is a limitation of the printf funtion (to keep it fast):

> This is a known limitation: mawk’s format for %d is limited by the format.
> The limitation is done to improve performance.
>
> You can get more precision using one of the floating formats (and can construct
> one which prints like a %d, e.g., by putting a “.0″ on the end of the format).
http://code.google.com/p/original-mawk/issues/detail?id=23

By: brendano

brendano — Sat, 11 Jan 2014 19:24:47 +0000

Unfortunately, limitations like this seem to keep cropping up in mawk…

By: Andrew

Andrew — Fri, 10 Jan 2014 19:01:59 +0000

I’ve loved reading this comment set. It’s old (in internet years) but very informative. Thanks for all the commentary. I hope there might be a solution to my problem. A lot of my work requires summing over long periods of data, such as the total GET’d data on a busy web cluster in a month. Mawk is fantastic– I’m getting about 3.5x the performance I saw in gawk.
The problem I’m seeing is the 2147483647 max on %d-formatted numbers. Here’s output from the same script, where the first line is sprintf %d‘d and the second is simply print‘d:

total_size, total_count, average: 2147483647 50586 2493242 total_size, total_count, average: 1.26123e+11 50586 2.49324e+06

Sadly, the exponential notation doesn’t work for the report’s audience. Is there some workaround for this? With careful direction I don’t mind editing the source code, but it’s really not my forte.

Again, thanks for all the useful comments.

By: Gert

Gert — Thu, 26 Dec 2013 22:07:32 +0000

You need to recompile mawk (see next post).