<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Don&#8217;t MAWK AWK &#8211; the fastest and most elegant big data munging language!</title>
	<atom:link href="http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/feed/" rel="self" type="application/rss+xml" />
	<link>http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/</link>
	<description>cognition, language, social systems; statistics, visualization, computation</description>
	<lastBuildDate>Tue, 25 Nov 2025 13:11:20 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
	<item>
		<title>By: AWK for Human Beings &#124; Thoughts and Scribbles &#124; MicroDevSys.com</title>
		<link>http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/#comment-1853090</link>
		<dc:creator>AWK for Human Beings &#124; Thoughts and Scribbles &#124; MicroDevSys.com</dc:creator>
		<pubDate>Thu, 21 Aug 2014 18:18:17 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=652#comment-1853090</guid>
		<description><![CDATA[[...] &#160;Well now. &#160;Though there should be a warning. &#160;MAWK does come with an accuracy disclaimer and that is really the choice maker between the two. &#160;More on mawk can be found [...]]]></description>
		<content:encoded><![CDATA[<p>[...] &nbsp;Well now. &nbsp;Though there should be a warning. &nbsp;MAWK does come with an accuracy disclaimer and that is really the choice maker between the two. &nbsp;More on mawk can be found [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Eric</title>
		<link>http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/#comment-1745897</link>
		<dc:creator>Eric</dc:creator>
		<pubDate>Tue, 22 Jul 2014 16:56:44 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=652#comment-1745897</guid>
		<description><![CDATA[I stopped at reading c++ is 1.3x times faster than awk while java is 6.4x times. Do you seriously suggest that c++ is 5x times SLOWER than java?]]></description>
		<content:encoded><![CDATA[<p>I stopped at reading c++ is 1.3x times faster than awk while java is 6.4x times. Do you seriously suggest that c++ is 5x times SLOWER than java?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ronald Loui</title>
		<link>http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/#comment-1510503</link>
		<dc:creator>Ronald Loui</dc:creator>
		<pubDate>Sun, 11 May 2014 21:30:32 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=652#comment-1510503</guid>
		<description><![CDATA[Come to think of it, you can do:

for (ii in imap) delete imap[ii]

instead of delete imap,

which should keep the largest allocation of imap&#039;s hash, and indeed gives you a bit more gawk speed.]]></description>
		<content:encoded><![CDATA[<p>Come to think of it, you can do:</p>
<p>for (ii in imap) delete imap[ii]</p>
<p>instead of delete imap,</p>
<p>which should keep the largest allocation of imap&#8217;s hash, and indeed gives you a bit more gawk speed.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ronald Loui</title>
		<link>http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/#comment-1510493</link>
		<dc:creator>Ronald Loui</dc:creator>
		<pubDate>Sun, 11 May 2014 21:24:55 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=652#comment-1510493</guid>
		<description><![CDATA[Well, you get the idea...

Oddly, my mawk is not faster with these rewrites:

loui@loui-desktop:~/gawk$ mawk -W version
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

compiled limits:
max NF             32767
sprintf buffer      1020
loui@loui-desktop:~/gawk$ awk -W version
GNU Awk 3.1.8
Copyright (C) 1989, 1991-2010 Free Software Foundation.]]></description>
		<content:encoded><![CDATA[<p>Well, you get the idea&#8230;</p>
<p>Oddly, my mawk is not faster with these rewrites:</p>
<p>loui@loui-desktop:~/gawk$ mawk -W version<br />
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan</p>
<p>compiled limits:<br />
max NF             32767<br />
sprintf buffer      1020<br />
loui@loui-desktop:~/gawk$ awk -W version<br />
GNU Awk 3.1.8<br />
Copyright (C) 1989, 1991-2010 Free Software Foundation.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ronald Loui</title>
		<link>http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/#comment-1510466</link>
		<dc:creator>Ronald Loui</dc:creator>
		<pubDate>Sun, 11 May 2014 21:08:11 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=652#comment-1510466</guid>
		<description><![CDATA[Sorry, but it seems my getline code and indenting has been eaten by HTML.  Trying again:


BEGIN {
  for (i=2; i in ARGV; i++) {
    inf = ARGV[i]
    outf = inf &quot;n&quot;
    delete imap
    I=0
    while (getline &lt; inf) {
      if (!imap[$1]) imap[$1] = ++I
      if (!jmap[$2]) jmap[$2] = ++J
      print imap[$1] &quot; &quot; jmap[$2] &quot; &quot; $3 &gt; outf
    }
  }
  for (v in jmap) print v &gt; &quot;vocab&quot;
}
]]></description>
		<content:encoded><![CDATA[<p>Sorry, but it seems my getline code and indenting has been eaten by HTML.  Trying again:</p>
<p>BEGIN {<br />
  for (i=2; i in ARGV; i++) {<br />
    inf = ARGV[i]<br />
    outf = inf &#8220;n&#8221;<br />
    delete imap<br />
    I=0<br />
    while (getline &lt; inf) {<br />
      if (!imap[$1]) imap[$1] = ++I<br />
      if (!jmap[$2]) jmap[$2] = ++J<br />
      print imap[$1] &#8221; &#8221; jmap[$2] &#8221; &#8221; $3 &gt; outf<br />
    }<br />
  }<br />
  for (v in jmap) print v &gt; &#8220;vocab&#8221;<br />
}</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ronald Loui</title>
		<link>http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/#comment-1510464</link>
		<dc:creator>Ronald Loui</dc:creator>
		<pubDate>Sun, 11 May 2014 21:06:32 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=652#comment-1510464</guid>
		<description><![CDATA[FILENAME != lastf { lastf = FILENAME; delete imap; I=0 }
!imap[$1] { imap[$1] = ++I }
!jmap[$2] { jmap[$2] = ++J }
{ print imap[$1], jmap[$2], $3 &gt; (lastf &quot;n&quot;) }
END { for (v in jmap) print v &gt; &quot;vocab&quot; }

will save you 10-30% in my gawk, depending on the length of your files.

Now, if Arnold Robbins would give us back the pre-allocate hash size option at the command line, we could get that faster using extra wide hashes.

Note that I am taking advantage of the serial file processing.  I am a huge fan of awk/gawk, data pipieline processing, and pragmatics, such as how much time it takes you to write, debug, maintain, explain, etc.

Since Stallman showed me gawk in 92, I&#039;ve been a huge fan.  Good to see awk/nawk/gawk/mawk get some well deserved respect.  Solaris nearly killed nawk by giving us the worst implementation in history.  

A few language extensions could speed this up even more.  Array-in and array-out would reduce the file-i/o overhead.  ENDFILE and BEGINFILE would remove an annoying conditional.  For long lines, a parse-to-n-and-quit, like the split in bash and perl, would also save time, since $n, n&gt;3, is never referenced.  It&#039;s possible that a two-stage &quot;i in array&quot; test would be faster than hash-collision-chain traversal, where the first stage is a bloom filter.

In fact, this is another 10% faster...

BEGIN {
  for (ifile=2; ifile in ARGV; ifile++) {
    inf = ARGV[ifile]
    outf = inf &quot;n&quot;
    delete imap
    I=0
    while (getline  outf
    }
  }
  for (v in jmap) print v &gt; &quot;vocab&quot;
}

awk &quot;experts&quot; abhor the use of BEGIN for the main loop, but the fact is, you get a lot more control, you can process multiple streams, you can increase readability and correctness, AND you can popularize the language for more general scripting use.  

So happy to see discussions like this.]]></description>
		<content:encoded><![CDATA[<p>FILENAME != lastf { lastf = FILENAME; delete imap; I=0 }<br />
!imap[$1] { imap[$1] = ++I }<br />
!jmap[$2] { jmap[$2] = ++J }<br />
{ print imap[$1], jmap[$2], $3 &gt; (lastf &#8220;n&#8221;) }<br />
END { for (v in jmap) print v &gt; &#8220;vocab&#8221; }</p>
<p>will save you 10-30% in my gawk, depending on the length of your files.</p>
<p>Now, if Arnold Robbins would give us back the pre-allocate hash size option at the command line, we could get that faster using extra wide hashes.</p>
<p>Note that I am taking advantage of the serial file processing.  I am a huge fan of awk/gawk, data pipieline processing, and pragmatics, such as how much time it takes you to write, debug, maintain, explain, etc.</p>
<p>Since Stallman showed me gawk in 92, I&#8217;ve been a huge fan.  Good to see awk/nawk/gawk/mawk get some well deserved respect.  Solaris nearly killed nawk by giving us the worst implementation in history.  </p>
<p>A few language extensions could speed this up even more.  Array-in and array-out would reduce the file-i/o overhead.  ENDFILE and BEGINFILE would remove an annoying conditional.  For long lines, a parse-to-n-and-quit, like the split in bash and perl, would also save time, since $n, n&gt;3, is never referenced.  It&#8217;s possible that a two-stage &#8220;i in array&#8221; test would be faster than hash-collision-chain traversal, where the first stage is a bloom filter.</p>
<p>In fact, this is another 10% faster&#8230;</p>
<p>BEGIN {<br />
  for (ifile=2; ifile in ARGV; ifile++) {<br />
    inf = ARGV[ifile]<br />
    outf = inf &#8220;n&#8221;<br />
    delete imap<br />
    I=0<br />
    while (getline  outf<br />
    }<br />
  }<br />
  for (v in jmap) print v &gt; &#8220;vocab&#8221;<br />
}</p>
<p>awk &#8220;experts&#8221; abhor the use of BEGIN for the main loop, but the fact is, you get a lot more control, you can process multiple streams, you can increase readability and correctness, AND you can popularize the language for more general scripting use.  </p>
<p>So happy to see discussions like this.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Gert</title>
		<link>http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/#comment-1330863</link>
		<dc:creator>Gert</dc:creator>
		<pubDate>Sun, 16 Mar 2014 22:41:32 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=652#comment-1330863</guid>
		<description><![CDATA[The developer of the upstream mawk says it is a limitation of the printf funtion (to keep it fast):

&gt; This is a known limitation: mawk&#039;s format for %d is limited by the format.
&gt; The limitation is done to improve performance.
&gt; 
&gt; You can get more precision using one of the floating formats (and can construct
&gt; one which prints like a %d, e.g., by putting a &quot;.0&quot; on the end of the format).
http://code.google.com/p/original-mawk/issues/detail?id=23]]></description>
		<content:encoded><![CDATA[<p>The developer of the upstream mawk says it is a limitation of the printf funtion (to keep it fast):</p>
<p>&gt; This is a known limitation: mawk&#8217;s format for %d is limited by the format.<br />
&gt; The limitation is done to improve performance.<br />
&gt;<br />
&gt; You can get more precision using one of the floating formats (and can construct<br />
&gt; one which prints like a %d, e.g., by putting a &#8220;.0&#8243; on the end of the format).<br />
<a href="http://code.google.com/p/original-mawk/issues/detail?id=23" rel="nofollow">http://code.google.com/p/original-mawk/issues/detail?id=23</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: brendano</title>
		<link>http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/#comment-941119</link>
		<dc:creator>brendano</dc:creator>
		<pubDate>Sat, 11 Jan 2014 19:24:47 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=652#comment-941119</guid>
		<description><![CDATA[Unfortunately, limitations like this seem to keep cropping up in mawk...]]></description>
		<content:encoded><![CDATA[<p>Unfortunately, limitations like this seem to keep cropping up in mawk&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew</title>
		<link>http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/#comment-934488</link>
		<dc:creator>Andrew</dc:creator>
		<pubDate>Fri, 10 Jan 2014 19:01:59 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=652#comment-934488</guid>
		<description><![CDATA[I&#039;ve loved reading this comment set.  It&#039;s old (in internet years) but very informative.  Thanks for all the commentary.  I hope there might be a solution to my problem.  A lot of my work requires summing over long periods of data, such as the total GET&#039;d data on a busy web cluster in a month.  Mawk is fantastic-- I&#039;m getting about 3.5x the performance I saw in gawk.  
The problem I&#039;m seeing is the 2147483647 max on %d-formatted numbers.  Here&#039;s output from the same script, where the first line is &lt;b&gt;sprintf %d&lt;/b&gt;&#039;d and the second is simply &lt;b&gt;print&lt;/b&gt;&#039;d:

&lt;code&gt;total_size, total_count, average:       2147483647 50586 2493242
total_size, total_count, average:       1.26123e+11 50586 2.49324e+06&lt;/code&gt;

Sadly, the exponential notation doesn&#039;t work for the report&#039;s audience.  Is there some workaround for this?  With careful direction I don&#039;t mind editing the source code, but it&#039;s really not my forte.

Again, thanks for all the useful comments.]]></description>
		<content:encoded><![CDATA[<p>I&#8217;ve loved reading this comment set.  It&#8217;s old (in internet years) but very informative.  Thanks for all the commentary.  I hope there might be a solution to my problem.  A lot of my work requires summing over long periods of data, such as the total GET&#8217;d data on a busy web cluster in a month.  Mawk is fantastic&#8211; I&#8217;m getting about 3.5x the performance I saw in gawk.<br />
The problem I&#8217;m seeing is the 2147483647 max on %d-formatted numbers.  Here&#8217;s output from the same script, where the first line is <b>sprintf %d</b>&#8216;d and the second is simply <b>print</b>&#8216;d:</p>
<p><code>total_size, total_count, average:       2147483647 50586 2493242<br />
total_size, total_count, average:       1.26123e+11 50586 2.49324e+06</code></p>
<p>Sadly, the exponential notation doesn&#8217;t work for the report&#8217;s audience.  Is there some workaround for this?  With careful direction I don&#8217;t mind editing the source code, but it&#8217;s really not my forte.</p>
<p>Again, thanks for all the useful comments.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Gert</title>
		<link>http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/#comment-836485</link>
		<dc:creator>Gert</dc:creator>
		<pubDate>Thu, 26 Dec 2013 22:07:32 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=652#comment-836485</guid>
		<description><![CDATA[You need to recompile mawk (see next post).]]></description>
		<content:encoded><![CDATA[<p>You need to recompile mawk (see next post).</p>
]]></content:encoded>
	</item>
</channel>
</rss>
