It’s fairly obvious that an average can be calculated online, but interestingly, there’s also a way to calculate a running variance and standard deviation. Read all about it here.
I’m playing around with the Netflix Prize data of 100 million movie ratings, and a huge problem is figuring out how to load and calculate everything in memory. I’m having success with NumPy, the numeric library for Python, because it compactly stores arrays with C/Fortran binary layouts. For example, 100 million 32-bit floats = 100M * 4 = 400MB of memory, which is manageable. And it’s much easier to play around interactively in ipython/matplotlib rather than write C++ for everything.
Unfortunately, the simple ways to calculate variance on an array of that size create wasteful intermediate data structures as long as the original array.
>>> mean( (x-mean(x)) ** 2 ) # two intermediate structures >>> tmp=x-mean(x); tmp**=2; mean(tmp) # one intermediate structure
That’s an extra 400 or 800 megs of memory being thrown around. (And if x was an array of integers, the x-mean(x) step implicitly converts to 64-bit doubles which, well, doubles things again!)
In : from numpy import * In : x = arange(1e8) # python RSIZE = 774 MB In : timeit -n1 -r5 std(x) # RSIZE goes as high as 2.2 GB 1 loops, best of 5: 4.01 s per loop In : import running_stat In : timeit -n1 -r5 running_stat.std(x) # RSIZE = 774 MB the whole time 1 loops, best of 5: 1.66 s per loop
The C++ implementation is very simple and can be ripped out of running_stat.cc.
I wonder if Haskell’s laziness, perhaps along with my friend Patrick’s Haskell BLAS bindings, might magically solve some of the memory usage in the naive implementation.