# Calculating running variance in Python and C++

It’s fairly obvious that an average can be calculated online, but interestingly, there’s also a way to calculate a running variance and standard deviation. Read all about it here.

I’m playing around with the Netflix Prize data of 100 million movie ratings, and a huge problem is figuring out how to load and calculate everything in memory. I’m having success with NumPy, the numeric library for Python, because it compactly stores arrays with C/Fortran binary layouts. For example, 100 million 32-bit floats = 100M * 4 = 400MB of memory, which is manageable. And it’s much easier to play around interactively in ipython/matplotlib rather than write C++ for everything.

Unfortunately, the simple ways to calculate variance on an array of that size create wasteful intermediate data structures as long as the original array.

>>> mean( (x-mean(x)) ** 2 )            # two intermediate structures
>>> tmp=x-mean(x); tmp**=2; mean(tmp)   # one intermediate structure


That’s an extra 400 or 800 megs of memory being thrown around. (And if x was an array of integers, the x-mean(x) step implicitly converts to 64-bit doubles which, well, doubles things again!)

So, following John Cook’s explanation, I wrote running_stat, a C++ and Python implementation of running variance. It takes almost no memory and is faster than NumPy’s native variance function. Demo:

In [1]: from numpy import *
In [2]: x = arange(1e8)                      # python RSIZE = 774 MB

In [3]: timeit -n1 -r5 std(x)                # RSIZE goes as high as 2.2 GB
1 loops, best of 5: 4.01 s per loop

In [4]: import running_stat
In [5]: timeit -n1 -r5 running_stat.std(x)   # RSIZE = 774 MB the whole time
1 loops, best of 5: 1.66 s per loop


The C++ implementation is very simple and can be ripped out of running_stat.cc.

I wonder if Haskell’s laziness, perhaps along with my friend Patrick’s Haskell BLAS bindings, might magically solve some of the memory usage in the naive implementation.

This entry was posted in Uncategorized. Bookmark the permalink.

### 23 Responses to Calculating running variance in Python and C++

1. Patrick says:

Thanks for the plug but BLAS is pretty irrelevant here. Any decent array library will support a fold opration to give you the result with a constant amount of memory.

Also, in python, couldn’t you just use a for loop?

2. brendano says:

could use a python for loop, but it’s slower than having numpy do an array operation because it’s in C :)

i guess figuring out how to push array operation into low-level libraries is a step simpler than matrix operations in low-level libraries…

3. I thought you were after an windowed variance.

If you are ever interested, you can compute the windowed max-min in constant time, irrespective of the window size! This is not entirely trivial, by the way. See my code:

4. brendano says:

Pretty cool, thanks!

Is windowed variance even harder?

5. Thanks for posting this code! It was super-informative and a great shortcut to implementing standard deviation & variation in javascript with streams – https://github.com/tmcw/stream-statistics

6. Coloring books are normally utilized by children,
though coloring books for adults will also be
available. The etiquette of business may be the set of written and unwritten rules
of conduct that will make social interactions run more smoothly.
Keep your vision and ears open for virtually any new site polices regarding this issue.

7. Asking questions are actually pleasant thing if
you are not understanding anything entirely, however this post gives good understanding even.

8. Hey there! This is kind of off topic but I need some help from an established blog.
Is it hard to set up your own blog? I’m not very techincal but I can
figure things out pretty quick. I’m thinking about setting up my own but I’m not
sure where to start. Do you have any tips or suggestions?
Thanks

my webpage – tanki online cheats

9. Quality posts is the key to interest the people
to pay a quick visit the website, that’s what this website
is providing.

10. I think the admin of this web site is really working hard in favor of his website,
as here every information is quality based data.

11. Greetings from Carolina! I’m bored at work so I decided to check out your blog on my iphone during lunch break.

I love the information you provide here and can’t wait to take a look when I get home.

I’m surprised at how quick your blog loaded on my cell phone ..
I’m not even using WIFI, just 3G .. Anyhow, great blog!

12. It seems that there are some differences from one manufacturer car 3d
games to another. The used car salesmen will use. The very used car you are interested in any model you can put doctor on the car since the dealership
purchased it?

13. You are in reality a fantastic webmaster. The website loading pace
is incredible. It appears that you’re doing any special trick.
Additionally, the contents are a masterpiece. you have performed a wonderful task on this subject!

my web site :: baking soda acne mask

14. warez mp3 says:

It makes it the first choice of the users and also the most wanted and cozy P2P torrent engine ever produced in the world.
Sometimes your ISP may limit your bandwidth per month. Agora is the first film to get
shot entirely on Malta.

15. Pretty! This has been an incredibly wonderful post. Thank you for supplying this information.

My homepage :: best robotic pool cleaner

16. It’s in fact very complicated in this busy life to listen news on
Television, thus I simply use web for that purpose, and take the hottest
news.

my website … inversion table reviews

17. Great goods from you, man. I have understand your stuff previous
to and you are just extremely great. I actually like what you’ve acquired here,
certainly like what you are saying and the way in which you say
it. You make it enjoyable and you still take care
of to keep it wise. I can not wait to read far more from you.
This is really a terrific website.

My weblog: best masticating juicer

18. There is definately a lot to learn about this issue. I really like all of the points

Feel free to visit my homepage … best skate shoes

19. Wow that was unusual. I just wrote an incredibly long comment but after I clicked submit my comment didn’t appear.
Grrrr… well I’m not writing all that over again.

Anyhow, just wanted to say superb blog!

My website … chainsaw reviews

20. A glass splash back for kitchen, bathrooms and wall paneling.
A splash back is a surface that lines the wall behind your cooker and hob to protect
the splash 7 a celebration of light surface from spills and splashes.
Now there is another innovative use glass splash backs
are available in every size to fit any type of kitchens and bathrooms.

21. Its like you read my mind! You appear to know so much
I think that you could do with some pics to
drive the message home a bit, but instead of that, this is excellent blog.
A fantastic read. I’ll certainly be back.

22. wiersze says:

What’s up to every one, it’s truly a good for me to go to see this web page, it consists of important Information.

23. Customers can travel up and down Stop & Shop’s aisles without getting up
off of their sofas. Freelance – By – U’s Forum – sign
up for the forum ASAP for the best and more recent freebies
and opportunities here:. Consider shopping at specific places or
checking for coupon matching before going out, since something like Staples coupons for office supplies may only be good for someone
living near a Staples if a trip isn’t carefully planned to
avoid extra gas and time expense.