Lukas and I were trying to write a succinct comparison of the most popular packages that are typically used for data analysis. I think most people choose one based on what people around them use or what they learn in school, so I’ve found it hard to find comparative information. I’m posting the table here in hopes of useful comments.
|Name||Advantages||Disadvantages||Open source?||Typical users|
|R||Library support; visualization||Steep learning curve||Yes||Finance; Statistics|
|Matlab||Elegant matrix support; visualization||Expensive; incomplete statistics support||No||Engineering|
|SciPy/NumPy/Matplotlib||Python (general-purpose programming language)||Immature||Yes||Engineering|
|Excel||Easy; visual; flexible||Large datasets||No||Business|
|SAS||Large datasets||Expensive; outdated programming language||No||Business; Government|
|Stata||Easy statistical analysis||No||Science|
|SPSS||Like Stata but more expensive and worse|
[7/09 update: tweaks incorporating some of the excellent comments below, esp. for SAS, SPSS, and Stata.]
There’s a bunch more to be said for every cell. Among other things:
- Two big divisions on the table: The more programming-oriented solutions are R, Matlab, and Python. More analytic solutions are Excel, SAS, Stata, and SPSS.
- Python “immature”: matplotlib, numpy, and scipy are all separate libraries that don’t always get along. Why does matplotlib come with “pylab” which is supposed to be a unified namespace for everything? Isn’t scipy supposed to do that? Why is there duplication between numpy and scipy (e.g. numpy.linalg vs. scipy.linalg)? And then there’s package compatibility version hell. You can use SAGE or Enthought but neither is standard (yet). In terms of functionality and approach, SciPy is closest to Matlab, but it feels much less mature.
- Matlab’s language is certainly weak. It sometimes doesn’t seem to be much more than a scripting language wrapping the matrix libraries. Python is clearly better on most counts. R’s is surprisingly good (Scheme-derived, smart use of named args, etc.) if you can get past the bizarre language constructs and weird functions in the standard library. Everyone says SAS is very bad.
- Matlab is the best for developing new mathematical algorithms. Very popular in machine learning.
- I’ve never used the Matlab Statistical Toolbox. I’m wondering, how good is it compared to R?
- Here’s an interesting reddit thread on SAS/Stata vs R.
- SPSS and Stata in the same category: they seem to have a similar role so we threw them together. Stata is a lot cheaper than SPSS, people usually seem to like it, and it seems popular for introductory courses. I personally haven’t used either…
- SPSS and Stata for “Science”: we’ve seen biologists and social scientists use lots of Stata and SPSS. My impression is they get used by people who want the easiest way possible to do the sort of standard statistical analyses that are very orthodox in many academic disciplines. (ANOVA, multiple regressions, t- and chi-squared significance tests, etc.) Certain types of scientists, like physicists, computer scientists, and statisticians, often do weirder stuff that doesn’t fit into these traditional methods.
- Another important thing about SAS, from my perspective at least, is that it’s used mostly by an older crowd. I know dozens of people under 30 doing statistical stuff and only one knows SAS. At that R meetup last week, Jim Porzak asked the audience if there were any recent grad students who had learned R in school. Many hands went up. Then he asked if SAS was even offered as an option. All hands went down. There were boatloads of SAS representatives at that conference and they sure didn’t seem to be on the leading edge.
- But: is there ANY package besides SAS that can do analysis for datasets that don’t fit into memory? That is, ones that mostly have to stay on disk? And exactly how good as SAS’s capabilities here anyway?
- If your dataset can’t fit on a single hard drive and you need a cluster, none of the above will work. There are a few multi-machine data processing frameworks that are somewhat standard (e.g. Hadoop, MPI) but It’s an open question what the standard distributed data analysis framework will be. (Hive? Pig? Or quite possibly something else.)
- (This was an interesting point at the R meetup. Porzak was talking about how going to MySQL gets around R’s in-memory limitations. But Itamar Rosenn and Bo Cowgill (Facebook and Google respectively) were talking about multi-machine datasets that require cluster computation that R doesn’t come close to touching, at least right now. It’s just a whole different ballgame with that large a dataset.)
- SAS people complain about poor graphing capabilities.
- R vs. Matlab visualization support is controversial. One view I’ve heard is, R’s visualizations are great for exploratory analysis, but you want something else for very high-quality graphs. Matlab’s interactive plots are super nice though. Matplotlib follows the Matlab model, which is fine, but is uglier than either IMO.
- Excel has a far, far larger user base than any of these other options. That’s important to know. I think it’s underrated by computer scientist sort of people. But it does massively break down at >10k or certainly >100k rows.
- Another option: Fortran and C/C++. They are super fast and memory efficient, but tricky and error-prone to code, have to spend lots of time mucking around with I/O, and have zero visualization and data management support. Most of the packages listed above run Fortran numeric libraries for the heavy lifting.
- Another option: Mathematica. I get the impression it’s more for theoretical math, not data analysis. Can anyone prove me wrong?
- Another option: the pre-baked data mining packages. The open-source ones I know of are Weka and Orange. I hear there are zillions of commercial ones too. Jerome Friedman, a big statistical learning guy, has an interesting complaint that they should focus more on traditional things like significance tests and experimental design. (Here; the article that inspired this rant.)
- I think knowing where the typical users come from is very informative for what you can expect to see in the software’s capabilities and user community. I’d love more information on this for all these options.
What do people think?
8/12 update: Serbo-Croatian translation.