It’s fairly obvious that an average can be calculated online, but interestingly, there’s also a way to calculate a running variance and standard deviation. Read all about it here.

I’m playing around with the Netflix Prize data of 100 million movie ratings, and a huge problem is figuring out how to load and calculate everything in memory. I’m having success with NumPy, the numeric library for Python, because it compactly stores arrays with C/Fortran binary layouts. For example, 100 million 32-bit floats = 100M * 4 = 400MB of memory, which is manageable. And it’s much easier to play around interactively in ipython/matplotlib rather than write C++ for everything.

Unfortunately, the simple ways to calculate variance on an array of that size create wasteful intermediate data structures as long as the original array.

>>> mean( (x-mean(x)) ** 2 ) # two intermediate structures >>> tmp=x-mean(x); tmp**=2; mean(tmp) # one intermediate structure

That’s an extra 400 or 800 megs of memory being thrown around. (And if x was an array of integers, the x-mean(x) step implicitly converts to 64-bit doubles which, well, doubles things again!)

So, following John Cook’s explanation, I wrote running_stat, a C++ and Python implementation of running variance. It takes almost no memory and is faster than NumPy’s native variance function. Demo:

In [1]: from numpy import * In [2]: x = arange(1e8) # python RSIZE = 774 MB In [3]: timeit -n1 -r5 std(x) # RSIZE goes as high as 2.2 GB 1 loops, best of 5: 4.01 s per loop In [4]: import running_stat In [5]: timeit -n1 -r5 running_stat.std(x) # RSIZE = 774 MB the whole time 1 loops, best of 5: 1.66 s per loop

The C++ implementation is very simple and can be ripped out of running_stat.cc.

Link: github.com/brendano/running_stat

I wonder if Haskell’s laziness, perhaps along with my friend Patrick’s Haskell BLAS bindings, might magically solve some of the memory usage in the naive implementation.

Thanks for the plug but BLAS is pretty irrelevant here. Any decent array library will support a fold opration to give you the result with a constant amount of memory.

Also, in python, couldn’t you just use a for loop?

could use a python for loop, but it’s slower than having numpy do an array operation because it’s in C :)

i guess figuring out how to push array operation into low-level libraries is a step simpler than matrix operations in low-level libraries…

I thought you were after an windowed variance.

If you are ever interested, you can compute the windowed max-min in constant time, irrespective of the window size! This is not entirely trivial, by the way. See my code:

http://code.google.com/p/maxminfilters/

Pretty cool, thanks!

Is windowed variance even harder?

Thanks for posting this code! It was super-informative and a great shortcut to implementing standard deviation & variation in javascript with streams – https://github.com/tmcw/stream-statistics