Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata

Lukas and I were trying to write a succinct comparison of the most popular packages that are typically used for data analysis.  I think most people choose one based on what people around them use or what they learn in school, so I’ve found it hard to find comparative information.  I’m posting the table here in hopes of useful comments.

Name Advantages Disadvantages Open source? Typical users
R Library support; visualization Steep learning curve Yes Finance; Statistics
Matlab Elegant matrix support; visualization Expensive; incomplete statistics support No Engineering
SciPy/NumPy/Matplotlib Python (general-purpose programming language) Immature Yes Engineering
Excel Easy; visual; flexible Large datasets No Business
SAS Large datasets Expensive; outdated programming language No Business; Government
Stata Easy statistical analysis No Science
SPSS Like Stata but more expensive and worse

[7/09 update: tweaks incorporating some of the excellent comments below, esp. for SAS, SPSS, and Stata.]

There’s a bunch more to be said for every cell.  Among other things:

  • Two big divisions on the table: The more programming-oriented solutions are R, Matlab, and Python. More analytic solutions are Excel, SAS, Stata, and SPSS.
  • Python “immature”: matplotlib, numpy, and scipy are all separate libraries that don’t always get along.  Why does matplotlib come with “pylab” which is supposed to be a unified namespace for everything?  Isn’t scipy supposed to do that?  Why is there duplication between numpy and scipy (e.g. numpy.linalg vs. scipy.linalg)?  And then there’s package compatibility version hell.  You can use SAGE or Enthought but neither is standard (yet).  In terms of functionality and approach, SciPy is closest to Matlab, but it feels much less mature.
  • Matlab’s language is certainly weak.  It sometimes doesn’t seem to be much more than a scripting language wrapping the matrix libraries.  Python is clearly better on most counts.  R’s is surprisingly good (Scheme-derived, smart use of named args, etc.) if you can get past the bizarre language constructs and weird functions in the standard library.  Everyone says SAS is very bad.
  • Matlab is the best for developing new mathematical algorithms.  Very popular in machine learning.
  • I’ve never used the Matlab Statistical Toolbox.  I’m wondering, how good is it compared to R?
  • Here’s an interesting reddit thread on SAS/Stata vs R.
  • SPSS and Stata in the same category: they seem to have a similar role so we threw them together.  Stata is a lot cheaper than SPSS, people usually seem to like it, and it seems popular for introductory courses.  I personally haven’t used either…
  • SPSS and Stata for “Science”: we’ve seen biologists and social scientists use lots of Stata and SPSS.  My impression is they get used by people who want the easiest way possible to do the sort of standard statistical analyses that are very orthodox in many academic disciplines.  (ANOVA, multiple regressions, t- and chi-squared significance tests, etc.)  Certain types of scientists, like physicists, computer scientists, and statisticians, often do weirder stuff that doesn’t fit into these traditional methods.
  • Another important thing about SAS, from my perspective at least, is that it’s used mostly by an older crowd.  I know dozens of people under 30 doing statistical stuff and only one knows SAS.  At that R meetup last week, Jim Porzak asked the audience if there were any recent grad students who had learned R in school.  Many hands went up.  Then he asked if SAS was even offered as an option.  All hands went down.  There were boatloads of SAS representatives at that conference and they sure didn’t seem to be on the leading edge.
  • But: is there ANY package besides SAS that can do analysis for datasets that don’t fit into memory?  That is, ones that mostly have to stay on disk?  And exactly how good as SAS’s capabilities here anyway?
  • If your dataset can’t fit on a single hard drive and you need a cluster, none of the above will work. There are a few multi-machine data processing frameworks that are somewhat standard (e.g. Hadoop, MPI) but It’s an open question what the standard distributed data analysis framework will be.  (Hive? Pig?  Or quite possibly something else.)
  • (This was an interesting point at the R meetup.  Porzak was talking about how going to MySQL gets around R’s in-memory limitations.  But Itamar Rosenn and Bo Cowgill (Facebook and Google respectively) were talking about multi-machine datasets that require cluster computation that R doesn’t come close to touching, at least right now.  It’s just a whole different ballgame with that large a dataset.)
  • SAS people complain about poor graphing capabilities.
  • R vs. Matlab visualization support is controversial.  One view I’ve heard is, R’s visualizations are great for exploratory analysis, but you want something else for very high-quality graphs.  Matlab’s interactive plots are super nice though.  Matplotlib follows the Matlab model, which is fine, but is uglier than either IMO.
  • Excel has a far, far larger user base than any of these other options.  That’s important to know.  I think it’s underrated by computer scientist sort of people.  But it does massively break down at >10k or certainly >100k rows.
  • Another option: Fortran and C/C++.  They are super fast and memory efficient, but tricky and error-prone to code, have to spend lots of time mucking around with I/O, and have zero visualization and data management support.  Most of the packages listed above run Fortran numeric libraries for the heavy lifting.
  • Another option: Mathematica.  I get the impression it’s more for theoretical math, not data analysis.  Can anyone prove me wrong?
  • Another option: the pre-baked data mining packages.  The open-source ones I know of are Weka and Orange.  I hear there are zillions of commercial ones too.  Jerome Friedman, a big statistical learning guy, has an interesting complaint that they should focus more on traditional things like significance tests and experimental design.  (Here; the article that inspired this rant.)
  • I think knowing where the typical users come from is very informative for what you can expect to see in the software’s capabilities and user community.  I’d love more information on this for all these options.

What do people think?


8/12 update: Serbo-Croatian translation.

This entry was posted in Best Posts. Bookmark the permalink.

179 Responses to Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata

  1. Eric Sun says:

    >>I know dozens of people under 30 doing statistical stuff and only one knows SAS.

    I’m assuming the “one” is me, so I’ll just say a few points:
    I’m taking John Chambers’s R class at Stanford this quarter, so I’m slowly and steadily becoming an R convert.
    That said, I don’t think anything besides SAS can do well with datasets that don’t fit in memory. We used SAS in litigation consulting because we frequently had datasets in the 1-20 GB range (i.e. can fit easily on one hard disk but difficult to work with in R/Stata where you have to load it all in at once) and almost never larger than 20GB. In this relatively narrow context, it makes a lot of sense to use SAS: it’s very efficient and easy to get summary statistics, look at a few observations here and there, and do lots of different kinds of analyses. I recall a Cournot Equilibrium-finding simulation that we wrote using the SAS macro language, which would be quite difficult in R, I think. I don’t have quantitative stats on SAS’s capabilities, but I would certainly not think twice about importing a 20 GB file into SAS and working with it in the same way as I would a 20 MB file.

    That said, if you have really huge internet-scale data that won’t fit on one hard drive, then SAS won’t be too useful either. I’ll be very interested if this R + Hadoop system ever becomes mature: http://www.stat.purdue.edu/~sguha/rhipe/

    In my work at Facebook, Python + RPy2 is a good solution for large datasets that don’t need to be loaded into memory all at once (for example, analyzing one facebook network at a time). If you have mutliple machines, these computations can be speeded up using iPython’s parallel computing facilities.

    Also, R’s graphical capabilities continue to surprise me; you can actually do a lot of advanced stuff. I don’t do much graphics, but perhaps check out “R Graphics” by Murrell or Deepayan Sarkar’s book on Lattice Graphics.

  2. Eric Sun says:

    I thought that most people consider SAS to have the highest learning curve, certainly higher than R. but maybe I’m mistaken about that.

  3. Justin says:

    Calling scipy immature sounds somehow “wrong”. The issues you come up with are more of early design flaws that will not go away, no matter how “mature” scipy is getting.

    That said, these are flaws, but they seem pretty minor to me.

  4. I’ve recently seen GNU DAP mentioned as an open-source equivalent to SAS. Know if it’s any good?

  5. TS Waterman says:

    Have you considered Octave in this regard? It’s a GNU-licensed Matlab clone. Very nice graphing capability, Matlab syntax and library functions, open source.

    http://www.gnu.org/software/octave/FAQ.html#MATLAB-compatibility

  6. brendano says:

    @Eric – oops, yeah should’ve put SAS as hardest. Good point that the standard of judging how good large dataset support is, is whether you can manipulate a big dataset the same way you manipulate a small dataset. I’ve loaded 1-2 GB of data into R and you definitely have to do things differently (e.g. never use by()).

    @Justin – scipy certainly seems like it keeps improving. I just keep comparing it to matlab and it’s constantly behind. I remember once watching someone try to make a 3d plot. He spent quite a while going through various half-baked python solutions that didn’t work. Then he booted up matlab and had one in less than a minute. Matlab’s functionality is well-designed, well-put-together and well-documented.

    @Edward – I have seen it mentioned too. From glancing at its home page, it seems like a pretty small-time project.

  7. brendano says:

    @TS – yeah, i used octave just once for something simple. it worked fine. my issues were: first, i’m not impressed with gnuplot graphing. second, the interactive environment isn’t too great. third, trying to clone the matlab language seems crazy since it’s kind of crappy. i think i’d usually pick scipy over octave if being free is a requirement, else go with matlab if i have access to it.

    otoh it looks like it supports some nice things like sparse matrices that i’ve had a hard time with lately in R and scipy. i guess worth another look at some point…

  8. Brendan,

    Nice overview, I think another dimension you don’t mention — but which Bo Cowgill alluded to at our R panel talk — is performance. Matlab is typically stronger in this vein, but R has made significant progress with more recent versions. Some benchmark results can be found at:

    http://mlg.eng.cam.ac.uk/dave/rmbenchmark.php

    MD

  9. Mike says:

    In high energy particle physics, ROOT is the package of choice. It’s distributed by CERN, but it’s open source, and is multi-platform (though the Linux flavor is best supported). It does solve some of the problems you mentioned, like running over large datasets that can’t be entirely memory-resident. The syntax is C++ based, and has both an interpreter and the ability to compile/execute scripts from the command line.

    There are lots of reasons to prefer other packages (like R) over ROOT for certain tasks, but in the end there’s little that can be done with other packages that one cannot do with ROOT.

    • a friend of mine wrote a program in C++ with ROOT, he wanted to map energy levels and produce nice diagrams (Semi-automatic level scheme solver for nuclear spectroscopy is the title of his thesis). anyway I’ve never heard anyone swear so much over a package. And the stuff that went wrong was just plain weird. I’m sure you can do anything you want with it, but having a degree from Miscatonic University seems to be a prerequisite to keep your sanity intact.

  10. This is obviously oversimplified – but that is the point of a succinct comparison. I would add that you are missing a lot of disadvantages for Excel – it has incomplete statistics support and an outdated “language” :)

    Python actually really shines above the others for handling large datasets using memmap files or a distributed computing approach. R obviously has a stronger statistics user base and more complete libraries in that area – along with better “out-of-the-box” visualizations. Also, some of the benefits overlap – using numpy/scipy you get that same elegant matrix support / syntax that matlab has, basically slicing arrays and wrapping lapack.

    The advantages of having a real programming language and all the additional non-statistical libraries & frameworks available to you make Python the language of choice for me. If there is something scipy is weak at that I need, I’ll also use R in a pinch or move down to C. I think you are basically operating at a disadvantage if you are using the other packages at this point. The only other reason I can see to use them is if you have no choice, for example if you inherited a ton of legacy code within your organization.

  11. John says:

    I’m sure you’ve stirred up a lot of controversy. Thanks for calling ‘em like you see ‘em.

    As for Mathematica, I haven’t used it for statistics beyond some basic support for common distributions. But one thing it does well is very consistent syntax. I used it when it first came out, then didn’t use if for years, and then started using it again. When I came back to it, I was able to pick it up right where I left off. I can’t put R down for a week and remember the syntax. Mathematica may not do everything, but what it does do, it does elegantly.

  12. jessy says:

    it would be awesome to have an informal, hands on tutorial comparison of several of these languages (looking at ease, performance, features, etc.). maybe a meetup at something like super happy dev house, or even something separate. just a thought!

  13. brendano says:

    @Michael Driscoll – good point! I was afraid to make performance claims since I’ve heard that Matlab is getting faster, they have a JIT or a nice compiler or something now, and I haven’t used it too much recently. (That benchmark page doesn’t even say which matlab version was used, though I emailed the guy…) I’m also suspicious of performance comparisons since I’d expect much of it to be very dependent on the matrix library and there are several LAPACKs out there (ATLAS and others) and many compiletime parameters to fiddle with. I think I read something claiming many binary builds of R don’t use the best LAPACK they could. I’m not totally sure of this though. But if it’s true that Matlab knows how to vectorize for-loops, that’s really impressive.

    @Mike – ah yes, i remember looking at ROOT a long time ago and thinking it was impressive. But then I forgot about it because all the cs/stats people whose stuff I usually read don’t know about it. I think it just goes to show you that the data analysis tools problem is tackled so differently by different groups of people, it’s very easy to not miss out on better options just due to lack of information!

    @Pete – yeah I whine about python. but I seem to use numpy plenty still :) actually its freeness is a huge win over matlab for cluster environments since you dont’ have to pay for a zillion licenses…

    Hm I seem to be talking myself into thinking it’s down to R vs Python vs Matlab. then the rosetta stone http://mathesaurus.sourceforge.net/matlab-python-xref.pdf should be my guide…

    @John – very interesting. I think many R users have had the experience of quickly forgetting how to do basic things.

  14. brendano says:

    From David Knowles, who did the comparison Mike Driscoll linked to (http://mlg.eng.cam.ac.uk/dave/rmbenchmark.php):

    > Nice comparison. I would add to the pros of R/Python that the data
    > structures are much richer than Matlab. The big pro of Matlab still
    > seems to be performance (and maybe the GUI for some people). On top of
    > being expensive Matlab is a nightmare if you want to run a program on
    > lots of nodes because you need a license for every node!
    >
    > It’s 2008b I did the comparison with – I should mention that!

  15. From Rob Slaza’s statistics toolbox tutorials, it *seems* like using MATLAB for stats is reasonably simple…

  16. Gaurav says:

    On top of being expensive Matlab is a nightmare if you want to run a program on lots of nodes because you need a license for every node!

    @Brendan:

    Re David Knowles’ comment…

    There are specialized parallel/distributed computing tools available from MathWorks for writing large-scale applications (for clusters, grid etc.). You should check out: http://www.mathworks.com/products/parallel-computing.

    Running full-fledged desktop MATLAB on a huge number of nodes is messy and of course very expensive not to mention that a single user would take away several licenses for which other users will have to wait.

    Disclosure: I work for the parallel computing team at The MathWorks

  17. brendano says:

    Another guy from Mathworks, their head of Matlab product management Scott Hirsch, contacted me about the language issue and was very kind and clarifi-cative. The most interesting bits below.

    On Tue, Feb 24, 2009 at 7:20 AM, Scott Hirsch wrote:
    >> Brendan –
    >>
    >> Thanks for the interesting discussion you got rolling on several popular
    >> data analysis packages
    [...]
    >> I’m always very interested to hear the perspectives of MATLAB users, and
    >> appreciate your comments about what you like and what you don’t like. I was
    >> interested in following up on this comment:
    >>
    >> “Matlab’s language is certainly weak. It sometimes doesn’t seem to be
    >> much more than a scripting language wrapping the matrix libraries. “
    >>
    >> I have my own assumptions about what you might mean, but I’d be very
    >> interested in hearing your perspectives here. I would greatly appreciate it
    >> if you could share your thoughts on this subject.
    >
    > sure. most of my experiences are with matlab 6. just briefly,
    >
    > * leave out semicolon => print the expression. that is insane.
    > * each function has to be defined in its own file
    > * no optional arguments
    > * no named arguments
    > * no way to group variables together in a structure. (i don’t need object
    > orientation, just a bunch of named items)
    > * no perl/python-style hashes
    > * no object orientation (or just a message dispatch system) … less
    > important
    > * poor/no support for text
    > * or other things a general purpose language knows how to do (sql, networks,
    > etc etc)

    On Tue, Feb 24, 2009 at 11:27 AM, Scott Hirsch wrote:
    > Thanks, Brendan. This is very helpful. Some of the things have been
    > addressed, but not all. Here are some quick notes on where we are today.
    > Just to be clear – I have no intention (or interest) in changing your
    > perspectives, just figured I could let you know in case you were curious.
    >
    >
    >
    > > * leave out semicolon => print the expression. that is insane.
    > No plans to change this. Our solution is a bit indirect, but doesn’t break
    > the behavior that lots of users have come to expect. We have a code
    > analysis tool (M-Lint) that will point out missing semi-colons, either while
    > you are editing a file, or in a batch process for all files in a directory.
    >
    > > * each function has to be defined in its own file
    > You can include multiple functions in a file, but it introduces unique
    > semantics – primarily that the scope of these functions is limited to within
    > the file.

    [[ addendum from me: yeah, exactly. if you want to make functions that are
    shared in different pieces of your code, you usually have to do 1 function per
    file. ]]

    > > * no optional arguments
    > Nothing yet.
    >
    > > * no named arguments
    > Nope.
    >
    > > * no way to group variables together in a structure. (i don’t need object
    > orientation, just a bunch of named items)
    > We’ve had structures since MATLAB 5.

    [[ addendum from me: well, structures aren't very conventional in standard
    matlab style, or at least certainly not the standard library. most algorithm
    functions return a tuple of variables, instead of packaging things together
    into a structure. ]]

    > > * no perl/python-style hashes
    > We just added a Map container last year.
    >
    > > * no object orientation (or just a message dispatch system) … less
    > important
    > We had very weak OO capabilities in MATLAB 6, but introduced a modern system
    > in R2008a.
    >
    > > * poor/no support for text
    > This has gotten a bit better, primarily through the introduction of regular
    > expressions, but can still be awkward.
    >
    > > * or other things a general purpose language knows how to do (sql, networks,
    > etc etc)
    > Not much here, other than a smattering (Database Toolbox for SQL,
    > miscellaneous commands for web interaction, WSDL, …)
    >
    > Thanks again. I really do appreciate getting your perspective. It’s
    > helpful for me to understand how MATLAB is perceived.
    >
    > -scott

  18. brendano says:

    @Gaurav – it sure would be nice if i could see how much this parallel toolbox costs without having to register for a login!

  19. There is another good numpy/matlab comparison here:

    http://www.scipy.org/NumPy_for_Matlab_Users

    As of the last year, a standard ipython install ( “easy_install IPython[kernel]” ) now includes parallel computing right out of the box, no licenses required:

    http://ipython.scipy.org/doc/rel-0.9.1/html/parallel/index.html

    If this is going to turn into a performance shootout, then I’ll add that from what I’ve seen Python with numpy/scipy outperforms Matlab for vectorized code.

    My impression has been that performance order is Numpy > Matlab > R, but as my friend Mike Salib used to say – “All benchmarks are lies”. Anyway, competition is good and discussions like this keep everyone thinking about how to improve their platforms.

    Also, keep in mind that performance is often a sticking point for people when it need not be. One of the things I’ve found with dynamically typed languages is that ease of use often trumps raw performance – and you can always move the intensive stuff down to a lower level.

    For people who like poking at numbers:

    http://www.scipy.org/PerformancePython
    http://www.mail-archive.com/numpy-discussion@scipy.org/msg14685.html
    http://www.mail-archive.com/numpy-discussion@scipy.org/msg01282.html

    Sturla has some strong points here:
    http://www.mail-archive.com/numpy-discussion@scipy.org/msg14697.html

  20. thrope says:

    @brendano – I think it might be a case of “if you have to ask you can’t afford it” :)

  21. devicerandom says:

    What about Origin (and Linux/Unix open source clones like Qtiplot)? I know a lot of people using them, and they allow fast, easy statistical analysis with beautiful graphs out of the box. Qtiplot is quite immature but it is Python-scriptable, which is a definitive plus for me -I don’t know about Origin.

  22. Stefan says:

    Hi. I think this is a very incomplete comparison. If you want to make a real comparison, it should be more complete than this wiki article . And to give a bit of personal feedback:
    I know 2 people using STATA (social science), 2 people using Excel (philosophy and economics), several using LabView (engineers), some using R (statistical science, astronomy), several using S-Lang (astronomy), several using Python (astronomy) and by using Python, I mean that they are using the packages they need, which might be numpy, scipy, matplotlib, mayavi2, pymc, kapteyn, pyfits, pytables and many more. And this is the main advantage of using a real language for data analysis: you can choose among the many solutions the one that fits you best. I also know several people who use IDL and ROOT (astronomy and physics).
    I have used IDL, ROOT, PDL, (Excel if you really want to count that in) and Python and I like Python best :-)
    @brendano: One other note: I think that you really have to distinguish between data analysis and data visualization. In astronomy this is often handled by completely different software. The key here is to support standardized file storage/ exchange formats. In your example the people used scipy which does not offer a single visualization routine, so you can not blame scipy for difficulties with 3D plots…

  23. david says:

    I am a core scipy/numpy developer, and I don’t think calling them immature from a user POV is totally unfair. Every time someone tries numpy/scipy/matplotlib and cannot plot something simple in a couple of minutes is a failure of our side. I can only say that we are improving – projects like pythonxy or enthought are really helpful too for people who want something more integrated.

    There is no denying than if you are into an integrated solution, numpy/scipy is not the best solution of the ones mentioned today – it may well be the worse (I don’t know them all, but I am very familiar with matlab, and somewhat familiar with R). There is a fundamental problem for all those integrated solutions: once you hit their limitations, you can’t go beyond that. Not being able to handle data which do not fit in memory in matlab, that’s a pretty fundamental issue, for example. Not having basic data structures (hashmap, tree, etc…) another one. Making advanced UI in matlab, not easy either.

    You can build your own solution with the python stack: the numpy array capabilities are far beyond matlab’s one, for example (broadcasting, advanced indexing are much powerful than matlab current capabilities). The C API is complete, and you can do things which are simply not possible with matlab. You want to handle very big datasets ? pytables give you a database-like API on top of hdf5. Things like cython are also very powerful for people who need speed. I believe those are partially consequences of not being integrated.

    Concerning the flaws you mentioned (scipy.linalg vs numpy.linalg, etc…): those are mostly legacies, or exist because removing them would be too costly. There are some efforts to remove redundancy, but not all of them will disappear. They are confusing for a newcomer (they were for me), but they are pretty minor IMHO, compared to other problems.

  24. bill says:

    You forgot support and continuity. In my experience, SAS offers very good support and continuity. Others claim SPSS does, too (I have no experience there). In a commercial environment, the programs need to outlive the analyst and the whims of the academic/grad student support/development. For one-off disposable projects, R has lots of advantages. For commercial systems, not so many.

  25. Lou Pecora says:

    I’ve looked at several of the “packages” mentioned here (R, Octave, MATLAB, C, C++, Fortran, Mathematica). I’m a physicist who is often working in new fields where understanding the phenomena is the main goal. This means my colleagues and I are often developing new numerical/theoretical/data-analysis approaches. For anyone in this situation I unequivocally recommend:

    Python.

    Why? Because given my situation there often are no canned routines. That means soon or later (usually sooner) I will be programming. Of all the languages and packages I’ve used Python has no equal. It is object oriented, has very forgiving run-time behavior, fast turn around (no edit, compile, debug cycles — just edit and run cycles), great built in structures, good modularity, and very good libraries. And, it’s easy to learn. I want to spend my time getting results, not programming, but I have to go through code development since often nothing like what I want to do exists and I’ve got to link the numerics to I/O and maybe some interactive things that make it easy to use and run smoothly. I’ve taken on projects that I would not want to attempt in any of the packages/languages I’ve listed.

    I agree that Python is not wart-free. The version compatibility can sometimes be frustrating. “One-stop shopping” for a complete Python package is not here, yet (although Enthought is making good progress). It will never be as fast as MATLAB for certain things (JIT compiling, etc. makes MATLAB faster at times). Python plotting is certainly not up to Mathematica standards (although it is good).

    However, the Python community is very nice and very responsive. Python now has several easy ways to add extensions written in C or C++ for faster numerics. And for all my desire not to spend time coding, I must admit I find Python programming fun to do. I cannot say that for anything else I’ve used.

  26. There is good reason for the duplication of “linalg” in SciPy. SciPy’s brand has more features which probably aren’t of as much use to as wide an audience, and (perhaps more importantly) one of the requirements for NumPy is that it not depend critically on a Fortran compiler. SciPy relaxes this requirement, and thus can leverage a lot of existing Fortran code. At least that’s my understanding.

  27. These packages change and it’s easy to get locked-in ideas from the past. I haven’t used Matlab since the 1990s, but the last time I used it, its I/O and singular value decomposition was so slow that we switched to S-Plus just to finish in our lifetimes.

    Can any of these packages compute sparse SVDs like folks have used for Netflix (500K x 25K matrix with 100M partial entries)? Or do regressions with millions of items and hundreds of thousands of coefficients? I typically wind up writing my own code to do this kind of thing in LingPipe, as do lots of other folks (e.g. Langford et al.’s Vowpal Wabbit, Bottou et al.’s SGD, Madigan et al.’s BMR).

    What’s killing me now is scaling Gibbs samplers. BUGS is even worse than R in terms of scaling, but I can write my own custom samplers that fly in some cases and easily scale. I think we’ll see more packages like Daume’s HBC for this kind of thing.

    R itself tends to just wrap the real computing in layers of scripts to massage data and do error checking. The real code is often Fortran, but more typically C. That must be the same for SciPy given how relatively inefficient Python is at numerical computing. It’s frustrating that I can’t get basic access to the underlying functions without rewrapping everything myself.

    A problem I see with the way R and BUGS work is that they typically try to compile a declarative model (e.g. a regression equation in R’s glm package or a model specification in BUGS), rather than giving you control over the basic functionality (optimization or sampling).

    The other thing to consider with these things from a commercial perspective is licensing. R may be open source, but its Gnu license means we can’t really deploy any commercial software on top of it. Sci-Py has a mixed bag of licenses that is also not redistribution friendly. I don’t know what licensing/redistribution looks like for the other packages.

    @bill Support and continuity (by which I assume you mean stability of interfaces and functionality) is great in the core R and BUGS. The problem’s in all the user-contributed packages. Even there, the big ones like lmer are quite stable.

  28. As for the rather large speed gains made by recent MATLAB releases that Lou noted, I believe this is due in most part to their switch to the Intel Math Kernel Library in place of a well-tuned ATLAS (I’m not completely sure if that’s what they used before, but it’s a good bet). This hung a good number of people with PowerPC G5′s out to dry rather quickly as newer MATLABs apparently only run on Intel Macs (probably so they don’t have to maintain two separate BLAS backends).

    Accelerated linear algebra routines written by people who know the processors inside and out will result in big wins, obviously. You can also license the IKML separately and use it to compile NumPy (if I recall correctly, David Cournapeau who commented above was largely responsible for this capability, so bravo!). I figure it’s only a matter of time before somebody like Enthought latch onto the idea of selling a Python environment with IKML baked in, so you can get the speedups without the hassle.

  29. Stefan says:

    @ben The SciPy team was also unhappy about the licensing issue, so you’ll be glad to hear that SciPy 0.7 was released under a single, BSD license.

    You said “It’s frustrating that I can’t get basic access to the underlying functions without rewrapping everything myself.” We are currently working on ways to expose the mathematical functions underlying NumPy to C, so that you can access it in your extension code. During the last Google Summer of Code, the Cython team implemented a friendly interface between Cython and NumPy. This means that you can code your algorithms in Python, but still have the speed benefits of C.

    A number of posts above refer to plotting in 3D. I can recommend Enthought’s Mayavi2, which makes interactive data visualisation a pleasure:

    http://code.enthought.com/projects/mayavi/

    We are always glad for suggestions on how to improve SciPy, so if you do try it out, please join the mailing list and tell us more about your experience.

  30. Stewart says:

    You should probably add GenStat to your list, this is a UK package specialising in the biosciences. It’s a relative heavy-weight in stats having come from Rothamsted Research (home of Fisher, Yates and Nelder). Nelder was the actual originator of GenStat. GenStat is also free for teaching world-wide and free for research to the developing world. It’s popularity is mainly within Europe, Africa and Oceania, hence why many US researchers may not have heard of it. I hope this helps

  31. brendano says:

    Wow, this is the funnest language flamewar I’ve seen.

    I will note that no one defended SAS. Maybe those people don’t read blogs.

  32. bill says:

    brendano,
    Hmm, I thought I did. I do production work in SAS and mess around (test new stuff, experimental analyses) in R.
    Bill

  33. brendano says:

    Oops. Yes yes. My bad!

    OK: no one has defended Stata!

  34. John Dudley says:

    My company has been using StatSoft’s Statistica for years and it does all of the things that you found to be shortcomings of SAS, SPSS and Matlab…

    It’s fast, graphs are great and are virtually no limitations. I’m suprised it wasn’t listed as one of the packages reviewed. We have been using it for years and it is absolutely critical to our business model.

  35. Andy Malner says:

    StatSoft is the only major package with R integration…The best of both worlds.

  36. Abhijit says:

    In stats there seems to be the S-Plus/R schools and the SAS schools. SAS people find R obtuse with poor documentation, and the R people say the same about SAS (myself included). R wins in graphics and flexibility and customizability (though I certainly won’t argue with a SAS pro who can whip up macros). SAS seems a bit better with large data sets. R is ever expanding, and has improved greatly for simulations/looping and memory management. Recently for large datasets (bioinformatic, not the 5-10G financial ones), I’ve used a combination of Python and R to great effect, and am very pleased with the workflow. I think rpy2 is a great addition to Python and works quite well. For some graphs I actually prefer matplotlib to R.

    I’m also a big fan of Stata for more introductory level stuff as well as for epidemiology-related stuff. It is developing a programming language that seems useful. One real disadvantage in my book is its ability to hold only one dataset at a time, as well as a limit on the data size.

    I’ve also used Matlab for a few years. It’s statistics toolbox is quite good, and Matlab is pretty fast and has great graphics. It’s limited in terms of regression modeling to some degree, as well as survival methods. Syntactically I find R more intuitive for modeling (though that is the lineage I grew up with). The other major disadvantage of matlab is distribution of programs, since Matlab is expensive. The same complaint for SAS, as well:)

  37. Pingback: Comparing statistical packages: R, SAS, SPSS, etc. — The Endeavour

  38. John Johnson says:

    I’ll sing the same song here as I do elsewhere on this topic.

    In large-scale production, SAS is second to none. Of course, large-scale production shops usually have the $$$ to fork over, and SAS’s workflow capabilities (and, to a lesser extent, large dataset handling capabilities) save enough billable hours to justify the cost. However, for graphics, exploratory data analysis, and analysis beyond the well-established routines, you have to venture into the world of SAS/IML, which is a rather painful place to be. It’s PRNGs are also stuck in the last century, top of the line of a class obsolete for anything other than teaching.

    R is great for simulation, exploratory data analysis, and graphics. (I disagree with the assertion that R can’t do high-quality graphics, and, like some commenters above, recommend Paul Murrell’s book on the topic.) It’s language, while arcane, is powerful enough to write outside-the-box analyses. For example, I was able to quickly write, debug, and validate an unconventional ROC analysis based on a paper I read. As another example, bootstrapping analyses are much easier in R than SAS.

    In short, I keep both SAS and R around, and use both frequently.

    I can’t comment too much on Python. MATLAB (or Octave or Scilab) is great for roll-your-own statistical analyses as well, though I can’t see using it for, e.g., a conventional linear models analysis unless I wanted the experience. R’s matrix capabilities are enough for me at this point. I used Mathematica some time ago for some chaos theory and Fourier/wavelet analysis of images and it performed perfectly well. If I could afford to shell out the money for a non-educational license, I would just to have it around for the tasks it does really well, like symbolic manipulation.

    I used SPSS a long time ago, and have no interest in trying it again.

  39. Jon Peck says:

    SPSS has for several years been offering smooth integration with both Python and R. There are extensive apis foe both. Check out the possibilities at http://www.spss.com/devcentral. See also my blog at insideout.spss.com.

    You can even easily build SPSS Statistics dialog boxes and syntax for R and Python programs. DevCentral has a collection of tools to facilitate this.

    This integration is free with SPSS Base.

  40. Pingback: A lot of Stuff « Blog Pra falar de coisas

  41. Sean says:

    I used Matlab, R, stata, spss and SAS over the years.

    To me, the only reason for using sas is because of its large data ability. otherwise, it is a very very bad program. It, from day one, trains it users to be a third rate programmer.
    The learning curve for SAS is actually very steep, particularily for a very logical person. Why? the whole syntax in SAS is pretty illogical and inconsistent.
    sometimes, it is ‘/out’ sometimes, it is ‘output’.

    In 9.2, SAS started to make variables inside a macro as local variables by default.
    This is ridiculous!! SAS company has existed for at least 30 years. How can this basic programming rule should be implemented after 30 years?!

    Also, if a variable is uninitialized, SAS will still let the code run. One time, I worked in a company, this simple stupid SAS design flaw causes our project 3 weeks of delay (there is one uninitialized varaible among 80k lines of log, all blue). A couple of PhDs in the project who used C and Matlab did not believe why SAS makes such a stupid mistake. Yes, with a big disbelief, it made!

    My ranking is that Matlab and R are about the same, Matlab is better in plots most times. R is better is manipulation datasets. stata and SAS are the same level.
    After taking into account of cost, then the answer is more obvious.

  42. bill r says:

    SAS was not designed by a language maven, like Pascal. It grew from its PL/1 and Fortran roots. It is a collection of working tools, added to meet the demands of working statisticians and IT folk, that has grown since its start in the late ’60s and early ’70s. SAS clearly has kruft that shows its growth over time. Sort of like the UNIX tools, S, and R, actually.

    And, really, what competent programmer would ever use a variable without initializing or testing it first? That’s a basic programming rule I learned back in the mid ’60s, after branching off of uninitialized registers, and popping empty stacks.

    Bah, you kids. Get off of my lawn!

  43. tom p says:

    i work for a retail company that deploys SAS for their large datasets and complex analysis. just about everything else is done in excel.

    we had a demo of omniture’s discover onpremise (formerly visual sciences), and the visualization tools are fairly amazing. it seems like an interesting solution for trending real time evolving data, but we aren’t pulling the trigger on it now.

  44. draegtun says:

    For reference PDL (Perl Data Language) can be found at pdl.perl.org/ and is also available via CPAN

    /I3az/

  45. draegtun says:

    opps.. link screwed up… here goes again ;-)

    pdl.perl.org

  46. Giles says:

    Have you seen Resolver One? It’s a spreadsheet like Excel, but has built-in Python support, and allows cells in the grid to hold objects. This means that numpy mostly works, and you can have one cell in the grid hold a complete dataset, then manipulate that dataset in bulk using spreadsheet-like formulae. Someone has also just built an extension that allows you to connect it to R, too. In theory, this means that you can get the best of all three — spreadsheet, numpy, and R — in your model, using the right tool for each job.

    On the other hand, the integration with both numpy and R is quite new, so it’s immature as a stats tool compared to the other packages in this list.

    Full transparency: I work for Resolver Systems, so obviously I’m biased towards it :-) Still, we’re very keen on feedback, and we’re happy to give out free copies for non-commercial research and for open source projects.

  47. Being the resident MATLAB enthusiast in a house built on another tool, I will pitch in my two cents, by suggesting another spectrum along which these tools lie: “canned procedures” versus “roll your own”. Use of general-purpose programming languages, such as has been suggested in the comments for Fortran or C/C++ clearly anchor one end of this dimension, whereas the statistical software sporting canned routines lie all the way at the other. A tool like MATLAB, which provides some but not complete direct statistical support, is somewhere in the middle. The trade-off here, naturally, is the ability to customize analysis vs. convenience.

  48. Jude Ryan says:

    Most of the users on this post are biased towards packages like R, rather than packages like SAS, and I want to offer my perspective of the relative advantages and disadvantages of SAS relative to R.

    I am primarily a SAS user (over 20 years) who has been using R as needed (a few years) to do things that SAS cannot do (like MARS splines), or cannot do as well (like exploratory data analysis and graphics), or requires expensive SAS products like Enterprise Miner to do (like decision trees, neural networks, etc).

    I have worked primarily for financial service (credit cards) companies. SAS is the primary statistical analysis tool in these companies partly due to history (S, the precursor to S+ and R, was not yet developed) and partly because it can run on mainframes (another legacy system) accessing huge amounts of data stored on tapes, which I am not sure any other statistical package can. Furthermore, business who have the $ will be the last to embrace open source software like R, as they generally require quick support when they get stuck trying to solve a business problem, and researching the problem in a language like R is generally not an option in a business setting.

    Also, SAS’ capabilities for handling large volumes of data are unmatched. I have read huge compressed files of online data (Double Click), having over 2 billion records, using SAS, to filter the data and keep only the records I needed. Each of the resulting SAS datasets were anywhere from 35 GB to 60 GB in size. As far as I know, no other statistical tool can process such large volumes of data programatically. First we had to be able to read in the data and understand it. Sampling the data for modeling purposes came later. I would run the SAS program overnight, and it would generally take anywhere from 6 to 12 hours to complete, depending on the load on the server. In theory, any statistical software that works with records one at a time should be able to process such large volumes of data, and maybe the Python based tools can do this. I do not know as I have never used them. But I do know that R, and even tools like WEKA cannot process such volumes of data. Reading the data from a database, using R, can mitigate the large data problems encountered in R (as does using packages like biglm), but SAS is the clear leader in handling large volumes of data.

    R on the other hand is better suited for academics and research, as cutting edge methodologies can be and are implemented much more rapidly in R than in SAS, as R’s programming language has more elegant support for vectors and matricies than SAS (proc IML). R’s programming language is much more elegant and logically consistent, while SAS’ programming language(s) are more adhoc with non-standard programming constructs. Furthermore, people who prefer R generally have a stronger “theoretical” programming background (most have programmed in C, Perl, or objected oriented languages) or are able to pick up programming faster, while most users who feel comfortable with SAS have less of a programming background and can tolerate many of SAS’ non-standard programming constructs and inconsistencies. These people do not require or need a comprehensive programming language to accomplish their tasks, and it takes much less effort to program in base SAS than in R if one has no “theoretical” programming background. SAS macros take more time to learn and many programming languages have no equivalent (one exception I know are C’s pre-processor commands). But languages like R do not need anything like SAS macros and can achieve the same results all in one, logically consistent, programming language, and do more, like enabling R users to write their own functions. The equivalent to writing functions in R, in SAS, is to now program a new proc in C and know how to integrate it with SAS. An extremely steep learning curve. SAS is more of a suite of products, many of them with inconsistent programming constructs (base SAS is totally different from SCL – formerly Screen Control language but now SAS Component Language), and proc SQL and proc IML are different from data step programming.

    So while SAS has a shallow learning curve initially (learn only base SAS), the user can only accomplish tasks of “limited” sophistication with SAS, without resorting to proc IML (which is quite ugly). For the business world this is generally adequate. R, on the other hand, has a steeper learning curve initially, but tasks of much greater sophistication can handled more easily in R than is SAS, once R’s steeper learning curve is behind you.

    I forsee an increased use of R relative to SAS over time, as many statistical departments at Universities have started teaching R (sometimes replacing SAS with R) and students graduating from these universities will be more conversant with R, or equally conversant with both SAS and R. Many of these students entering the workforce will gravitate towards R, and to the extent the companies they work for do not mandate which statistical software to use, the use of R is bound to increase over time. With memory becoming cheaper, and Microsoft based 64 bit operating systems becoming more prevalent, bigger data sets can be stored in RAM, and R’s limitation in handling large volumes of data are starting to matter less. But the amount of data is also starting to grow, thanks to the internet, scanners (used in grocery chains), etc., and the volume of data may very well grow so rapidly that even cheaper RAM and 64 bit operating systems may not be able to cope with the data deluge. But not every organization works with such large datasets.

    For someone who has started their careers using SAS, SAS is more than adequate to solve all problems faced in the business world, and there may seem to be no real reason, or even justification to learn packages like R or other statistical tools. To learn R, I have put in much personal time and effort, and I do like R and have been and forsee using it more frequently over time for exploratory data analysis, and in areas where I want to implement cutting edge methodologies, and where I am not hampered by large data issues. Personally, both SAS and R will always be part of my “tool kit” and I will leverage the strengths of both. For those who do not currently use R, it would be wise to start doing so, as R is going to be more widely used over time. The number of R users has already reached critical mass, and since R is free, this is bound to increase the usage of R as the R community grows. Furthermore, the R Help Digest, and the incredibly talented R users that support it, is an invaluable aid to anyone interested in learning R.

    • robert king says:

      Mr. Ryan,

      Your commentary in the blog brenacon.com/2009/comparison of data
      analysis packages has been very interesting because I am in a
      perplexing situation.
      I developed an algrithm using Excel. Some values are 16 t0 20
      decimal places, Excel calculates using 15.

      An equation generating the correct value in cell E9, will give
      an incorect value in any cell entered thereafter.
      I am searching for a solution (C++/C, Fortran,??).

      My programming experience is limited, Fortran WATIV,Pascal.
      I considered hiring a programmer from a local college or university.
      Your suggestion would be appreciated. Thank you.

  49. Pingback: Dailycious 14.03.09 « cendres.net

  50. Y-H Chen says:

    Interesting. I don’t think I would have put SPSS and Stata in the same category. I haven’t spend a tremendous amount of time working with SPSS, but I have spent a fair amount of time with Stata, and my biased perspective is that Stata is more sophisticated and powerful than SPSS. Certainly, Stata’s language isn’t as powerful as R’s, but I definitely wouldn’t say it’s “weak.” Stata’s not my favorite statistical program in the world (that would, of course, be R), but there are definitely things I like about it; it’s a definite second to R in my book.

    By the way, here’s my (unfair) generalization regarding usage:
    – R: academic statisticians
    – SAS: statisticians and data-y people in non-academic settings, plus health scientists in academic and non-academic settings
    – SPSS: social scientists
    – Stata: health scientists

  51. Pingback: Walking Randomly » R Compared to MATLAB (or ‘learning a thing or two from your students’)

  52. xin says:

    Sean:
    I am a junior SAS user with only 3 year experience. But even I know that you need to press ‘ctrl’ and ‘F’ to search for ‘uninitialized’ and ‘more than’ in SAS log to ensure everything is OK.
    As far as a couple C++PHD in your group is concerned, they need to understand to play with rules of whatever system they are using……

  53. xin says:

    by the way, I found the comments of SAS people left are more tolerant, open-minded (maybe they are older, lol). Instad the majority of ‘R’ers on this thread act like a bunch of rebellious teens…..

  54. Joe says:

    I am a big fan of Stata over SAS for medium and small businesses. SAS is the mercedes-benz of stats I’ll admit for Govt and Big business. I use Stata a LOT for economics, it has all the most-used predictive methods (OLS, MLE, GLS, 2SLS, binary choice, etc) models built it. I think the model would have to be pretty essoteric not to be found in Stata.

    I ran Stata on linux server with 16GB ram and about 2TB of disk storage. The Hardware config was about $12K. I would not recommend using virtual memory for Stata. That said, you can stick a lot of data in 16GB ram! If I pay attention to the variable sizes (keep textual ones out), I got 100s of millons of rows into memory.

    Stata supports scripting (*do files) and are very easy to use as is the GUI. The GUI is probably the best feauture.

    The Hardware ($12,000) + Software ($3000 – 2 user license) costs $15,000. The equivilient SAS software was about $100,000. You do the math.

    I’ve used SPSS, but that was a while ago. At that time I felt Stata was the superior product.

  55. brendano says:

    Finally a direct Stata vs SAS comparison! Very interesting. Thanks for posting. I can’t believe SAS = $100,000.

    > I ran Stata on linux server with 16GB ram and about 2TB of disk storage.
    > I would not recommend using virtual memory for Stata.

    In my experience, virtual memory is *always* a bad idea. I remember working with ops guys who would consider a server as good as dead once it started using swap.

    All programs that effectively use hard disks always have custom code to control when to move data on and off the disk. Disk seeks and reads are just too slow and cumbersome compared to RAM to have the OS try to automatically handle it.

    This would be my guess why SAS handles on-disk data so well – they put a lot of engineering work into supporting that feature. Same for SQL databases, data warehouses, and inverted text indexes. (Or the widespread popuarity of Memcached among web engineers.) R, Matlab, Stata and the rest were originally written for memory data and still work pretty much only in that setting.

  56. brendano says:

    And also, on the RAM vs hard disk issue — according to Jude Ryan’s very interesting comment above, SAS has a heritage of working with datasets on *tape* drives. Tape, of course, is even further along the size-vs-latency spectrum than RAM or hard disk. Now hard disk sizes are rapidly growing but seek times are not catching up, so people like to say “hard disk is the new tape” — therefore, if your software was originally designed for tape, it may do best! :)

  57. brendano says:

    Here’s an overly detailed comparison of Stata, SAS, and SPSS. Basically no coverage of R beyond the complaint that it’s too hard.
    http://www.ats.ucla.edu/stat/technicalreports/

    There’s also an interesting reply from Patrick Burns, defending R and comparing it to those 3.
    http://www.ats.ucla.edu/stat/technicalreports/Number1/R_relative_statpack.pdf

    (Found linked from a comment on John D. Cook’s blog here:
    http://www.johndcook.com/blog/2009/05/01/r-the-good-parts/ )

  58. Jaime says:

    I feel so old. Been using SAS for many years. But what the hell is this R ?????? That’s what the kids are using now?

  59. Gye Greene says:

    Great comparison of SPSS, SAS, and Stata by Acock (a summary of his findings here — http://www.ocair.org/files/KnowledgeBase/willard/StatPgmEvalAb.pdf)

    Below is a summary of the summary — !!! — with my own observations added on.

    SAS: Scripting language is awkward, but it’s great for manipulating complex data structures; folks that analyze relational DBs (e.g. govt. folks) tend to use it.

    SPSS: Great for the “weekend warriors”; strongly GUI-based; has a scripting language, but it’s in-elegant. They charge a license for **each** “module” (e.g. correlations? linear regressions? Poisson regressions? A separate fee!). Also, charge an annual license. Can read Excel files directly. Used to have nicer graphs and charts than Stata (but, see below).

    Stata: Elegant, short-’n'-punchy scripting language; CLI and script-oriented, but also allows GUI. Strong user base, with user-written add-ons available for D/L. **Excellent** tech support! The most recent version (Stata 10) now has some pretty powerful chart/graph editing options (GUI, plus CLI, your choice) that makes it competitive with the SPSS graphs. (Minor annoyance: ever few versions, they make the data format NOT back-compatible with the previous version — have to remember to “Save As” last-year’s version, or else what you save at work won’t open at home…)

    My background: Took a course on SAS, but haven’t had a reason to use it. I’ve used SPSS and Stata both, on a reasonably regular basis: I currently teach “Intro to Methods” courses with SPSS, but use Stata for my own work. I dislike how SPSS handles missing values. Unlike SPSS, Stata sells a one-time license: once you buy a version, it’s yours to keep until you feel it’s too obsolete to use.

    –GG

  60. Gye Greene says:

    This may be an unfair generalization, but my personal observation is that SPSS users (within the social sciences, at least) tend to have less quantitative training than Stata users. Probably highly correlated with the GUI vs. CLI orientations of the two packages (although each of them allows for both).

    Another way of’ differentiating between various statistical software packages is its Geek Cred. I usually tell my Intro to Research Methods (for the social sciences), that…

    (On a scale of 0-10…)

    R, Matlab, etc. = 9

    SAS = 7

    Stata = 5

    SPSS = 3

    Excel = 2

    YMMV. :)

    COMMENT ON EXCEL: It’s a spreadsheet, first and foremost — so it doesn’t treat rows (cases) as “locked together”, like statistical software does. Thus, when you highlight a column and ask it to sort, it sorts **only** that column. I got burned by this once, back in my first year of grad school, T.A.-ing: sorted HW #1 scores (out of curiosity), and didn’t notice that the rest of the scores had stayed put. Oops.

    I now keep my gradebooks in Stata. :)

    –GG

  61. Chuck Moore says:

    I began programming in SAS every day at a financial exchange in 1995. SAS has three main benefits over all other Statistical/Data Analysis packages, as far as I know.

    1) Data size = truly unlimited. I learned to span 6 DASD (Direct Access Storage Devices) = disk drives on the mainframe for when I was processing > 100 million records = quotes and trading activity from all exchanges. We we went to Unix, we used 100 GB worth of temp “WORK” space, and were processing > 1 Billion transaction a day in < 1 hour (IBM p630 with 4x 1.45 GHz processors and 32 GB of memory, only the processing actually used < 4 GB).

    2) Tons and tons of preprogrammed statistical functions with just about every option possible.

    3) SAS can read data from almost anything: tapes, disk, etc. fixed field flat files, delimited text files (any delimiters, not just comma or tab or space), xml, most any database, all mainframe data file times. It also translates most any text value into data, and supports custom input and output formats.

    SAS is difficult for most real programmers (I took my first programming class in 1977, and have programmed in more languages than I care to share) because it has a data centric perspective as opposed to machine/control centric. It is meant to simplify the processing of large amounts of data for non-programmers.

    SAS used to have incredible documentation and support, at incredibly reasonable prices. Unforturnately, the new generation of programmers and product managers have lost their way, and I agree that SAS has been becoming a beast.

    For adhoc work, I immediately fell in love with SAS/EG = Enterprise Guide. Unfortunately, EG is written in .net and is not that well written. I would have preferred it being written in Java so that the interface was more portable and supported a better threading model. Oh well.

    One of the better features of SAS is that it is not an intepreted programming language, but from the start in 197? it was JIT. Basically, a block of code is read, compiled, and then executed. This is why it is so efficient at processing huge amounts of data. The concept of the “data step” does allow for some built in inefficiencies from the standpoint of multiple passes through the data, but that is because of SAS’s convenience. A C programmer would have done more things, in fewer passes, but the C programmer would have spent many more hours writing the programmer than SAS’s few minutes to do the same thing. I know this because I’ve done it.

    Some place I read a complaint about SAS holding only one observation in memory at a time. That is a gross misunderstanding/mistake. SAS holds one or more blocks of observations (records) in memory at a time. The number held is easily configurable. Each observation can be randomly accessed, whether in memory or not.

    SAS 9.2 finally fixes one the bigger complaints with PROC FCMP allowing the creation of custom functions. Originally SAS did not support custom functions, SAS wanted to write them for you.

    The most unfortunate thing about SAS currently is that it has such a long legacy on uniprocessor machines, that it is having difficulty getting going in the SMP world, being able to properly take advantage of multi-threading and multi-processing. I believe this is due to lack of proper technical vision and leadership. As such, I believe a Java language HPC derivative and tools will eventually take over, providing superior ease of use, visualization, portability, and processing speed on today’s servers and clusters. Since most data will come from an RDMS these days, flat file input won’t carry enough weight.

    But, for my current profession = Capacity Planning for computer systems, you still can’t beat SAS + Excel. On the other hand, it looks like I’m going to have to look into R.

  62. Chuck Moore says:

    On a side note. As a “real” programmer, having been an expert in Pascal and C and having programmed in, oh I don’t want to list them all, but I have also done more than just take classes in Java. Anyway, Macros have a place in programming. There have been a few times I wished Java supported macros and not just assertions, out of my own laziness. I am a firm believer in the right tool for the job, and that not everything is a nail, so I need more than just a hammer. The unfortunate thing is that macros can be abused, just like goto’s and programming labels and global variables.

    To me, SAS is/was the greatest data processing language/system on the planet. But, I still also program in Java, C, ksh, VBScript, Perl, etc. as appropriate. I’d like to see someone do an ARIMA forecast in Excel, or run a regression that does outlier elimination in only 3 lines of code!

  63. tom m says:

    If your dataset can’t fit on a single hard drive and you need a cluster, none of the above will work.

    One thing you have to consider, is that using SciPy, you get all of the python libraries for free. That includes the Apache Hadoop code, if you choose to use that. And as someone above pointed out, there is now parallel processing built right in in the most recent distributions (but I have no personal knowledge of that) for MPI or whatever.

    Coming from an engineer in industry (not academia), the really neat thing that I like about SciPy is the ease of creating web-based tools (as in, deployed to a web server for others to use) via deployment on an apache installation and mod_python. If you can get other engineers using your analysis, without sending them a excel spreadsheet, or a .m file (for which they need a matlab license), etc. it makes your work much more visible.

  64. sohan says:

    hello everyone…
    i want to know about the comrative study between SAS, R, SPSS in data analysis.
    can anyone provide me the papers related to those.

  65. ed says:

    having used sas, spss, matlab, gauss and r, let me say that describing stata as having a weak programming language is a sign of ignorance.

    it has a very powerful interpreted scripting language which allows one to easily extend stata. there is a very active community and many user written add-ons are available. see: http://ideas.repec.org/s/boc/bocode.html

    stata also has a full fledged matrix programming language called (mata) comparable to matlab with a c-like syntax, which is compiled and therefore very fast.

    managing and preparing data for analysis is a breeze in stata.

    finally stata is easy to learn.

    obviously not many people use stata around here.

    some more biased opinions:

    sas is handy you have some old punch cards in the cupboard or a huge dataset. apart from that it truly sucks. some people say that it is good to manage data, but why not use a good relational database to do that and then use decent statistical software to do the analysis?

    excel sucks obviously infinitely more that sas. apart from its (lack of) statistical capabilities and reliability, any point-and-click only software is an obvious no-no from the point of view of scientific reproducability

    i don’t care fore spss and cannot imagine anyone does.

    matlab is nice, but expensive. not so great for preparing/managing data.

    have not used scipy/numpy myself, but have colleagues who love it. one big advantage is that it uses python (ie good language to master and use)

    r is great, but more difficult to get into. i don’t like the loose syntax too much though. it is also a bitch with big datasets.

  66. Willem says:

    On high quality graphics in R, one should certainly check out the Cairo-package. Many graphics can be output in hip formats like SVG.

  67. Mathias says:

    On the point of Excel breaking down at 10,000+ rows, apparently Excel 2010 will come with Gemini, an add-on developed by the Excel and SQL team, aiming at handling large datasets:
    Project Gemini sneak preview
    I doubt this would make Excel the platform of choice for doing anything fancy with large datasets anyways, but I am intrigued.

  68. Jay Verkuilen says:

    Some reax, as I’ve used most of these at some point:

    SAS has great support for large files even on a modest machine. A few years ago I did a bunch of sims on my dissertation using it and it worked happily away without so much batting an eyelash on a crappy four year old Windoze XP machine with 1.5 GB of memory. Also, programs like NLP (nonlinear optimization), NLMIXED, MIXED, and GLIMMIX are really great for various mixed model applications—this is quite broad as many common models can be cast in the mixed model framework. NLMIXED in particular lets you write some pretty interesting models that would otherwise require special coding. Documentation in SAS/STAT is really solid and their tech support is great. Graphics suck and I don’t like the various attempts at a GUI.

    I prefer Stata for most “everyday” statistical analysis. Don’t knock that, as it’s pretty common even for a methodologist such as myself to need to fit logistic regression or whatever and not want to have to waste a lot of time on it, which Stata is fantastic for. Stata 11 looks to be even better, as it incorporates procedures such as Multiple Imputation easily. The sheer amount of time spent doing MI followed by logistic regression (or whatever) is irritating. Stata speeds that up. Also when you own Stata you own it all and the upgrade pricing is quite reasonable. Tech support is also solid.

    SPSS has a few gems in its otherwise incomprehensible mass of utter bilge. IMO it’s a company with highly predatory licensing, too.

    R is nice for people who don’t value their time or who are doing lots of “odd” things that require programming and extensibility. I like it for class because it’s free, there are nice books for it, and it lets me bypass IT as it’s possible to put a working R system on a USB drive. I love the graphics.

    Matlab has made real strides as a programming language and has superb numerics in it (or did), at least according to the numerics people I know (including my numerical analysis professor). However, Statistics Toolbox is iffy in terms of what procedures it supports, though it might have been updated. Graphics are also nice. But it is expensive.

    Mathematica is nice for symbolic calculation. With the MathStatica addon (sadly this has been delayed for an unconscionable amount of time) it’s possible to do quite sophisticated theoretical computations. It’s not a replacement for your theoretical knowledge, but is very helpful for doing all the inaccurate and tedious calculations necessary.

  69. Brett D says:

    I started in Matlab, moved on to R, looked at Octave, and am just getting into SciPy.

    Matlab is good for linear algebra and related multivariate stats. I could never get any nice plotting out of it. It can do plenty of things I never learnt about, but I can’t afford to buy it, so I can’t use it now anyway.

    R is powerful, but can be very awkward. It can write jpeg, png, and pdf files, make 3D plots and nice 2D plots as well. Two things put me off it: it’s an absolute dog to debug (how does “duplicate row names are not allowed” help as an entire error message when I’ve got 1000 lines of code spread between 4 functions?), and its data types have weird eccentricities that make programming difficult (like transposing a data frame turns it into a matrix, and using sapply to loop over something returns a data frame of factors… I hate factors). There are a lot of packages that can do some really nice things, although some have pretty thin documentation (that’s open source for you).

    Octave is nicer to use than R ( = Matlab is nicer to use than R), but I found it lacking in most things I wanted to do, and the development team seem to wait for something to come out in Matlab before they’ll do it themselves, so they’re always one step behind someone else.

    I’m surprised how quickly I’m picking up SciPy. It’s much easier to write, read and debug than R, and the code looks nicer. I haven’t done much plotting yet, but it looks promising. The only trick with Python is its assignments for mutable data types, which I’m still getting my head around.

  70. Mike says:

    Mathematica is also able to link to R via a third party add-on distributed by ScienceOps. The numeric capabilities of Mathematica were “ramped” up 6 years ago so should be thought of as more than a symbolic (only) environment. Further info here:

    http://reference.wolfram.com/mathematica/note/SomeNotesOnInternalImplementation.html#28959

    (I work for Wolfram Research)

  71. brendano says:

    R is nice for people who don’t value their time or who are doing lots of “odd” things that require programming and extensibility.

    Hah!

    Everyone really likes Stata. Interesting.

  72. I use Python/Matlab for most analysis, but Mathematica is really nice for building demos and custom visualization interfaces (and for debugging your formulas)

    For instance, here’s an example of taking some mutual fund data, and visualizing those mutual funds (from 3 different categories) in a Fisher Linear Discriminant transformed space (down to 3 dimensional from initial 57 or so)

    http://yaroslavvb.com/upload/strands/dim-reduce/dim-reduce.html

  73. brendano says:

    Also, a discussion looking for solutions that are both fast to prototype and fast to execute: suitable functional language for scientific/statistical computing

  74. Cristian says:

    I do not understand why SAS is so much hailed here because it handles large datasets. I use Matlab almost exclusively in finance and when I have problems with how large the data sets are then I don’t use SAS by I use mysql server instead. Matlab can talk to mysql server and thus I do not see why SAS is needed in this case.

    • Josh says:

      I am working with hundreds of experiments, and while these are all extremely small, the issue is that Matlab does not ship with the database toolkit. It’s outrageous! It makes it basically impossible to work with it unless you get the toolbox. This is true as well for the Curve Fitting Toolbox. Just FYI.

  75. Mike says:

    I have used Stata and R but for my purposes I actually prefer and use Mathematica. Unsurprisingly nobody has discussed its use so I guess I will.

    I work in ecology and I use Mathematica almost exclusively for modeling. I’ve found that the the elegance of the programming language lends itself to easily using it for statistical analysis as well. Although it isn’t really a statistics package being able to generate large amounts of data and then process them in the same place is extremely useful. To make up for the lack of built in statistical analysis I’ve built my own package over time by collecting and refining the tests I’ve used.
    For most people I would say using Mathematica for statistics is way more work than it is worth. Nevertheless, those who already use it for other things may find it is more than capable of performing almost any data analysis you can come up with using relatively little code. The addition of functionality targeted at statistics in versions 6 and 7 has made this use simpler, although the built in ANOVA package is still awkward and poorly documented. One thing it and Matlab beat other packages at hands down is list/matrix manipulation which can be extremely useful.

  76. Paul Kim says:

    I am using MATLAB along with SPSS. Does anyone know about how to connect SPSS with MATLAB? Or can we use any form of programming (e.g., “for” loops and “if”) in SPSS to connect with MATLAB?
    Thank you.

    Paul

  77. Mattia says:

    I worked at the International Monetary Fund so I thought I’d add the government perspective, which is pretty much the same as the business one. You need software that solves the following equation

    maximize amount of useful output
    such that: salaries of staff * hours worked – cost of software < budget

    It turns out IMF achieves that by letting every economist work with whatever they want. As a matter of fact, economists end up using Stata.

    Consider that most economics datasets are smaller than 1Gb. Stata MultiProcessor will work comfortably with up to 4Gb on the available machines. Stata has everything you need for econometrics, including a matrix language that is just like Matlab and state of the art maximum likelihood optimization, so you can create your own “odd” statistical estimators. Programming has a steeper learning curve than Matlab but once you know the language it’s much more powerful, including very nice text data support and I/O (not quite python, but good enough). If you don’t need some of the fancy add-on packages that engineers use, like say “hydrodynamics simulation”, that’s all you need. But most importantly importing, massaging and cleaning data with Stata is so unbelievably efficient that every time I have to use another program I feel like I am walking knee-deep in mud.

    So why do I have to use other programs, and which?

    IMF has one copy of SAS that we use for big jobs, such as when I had 100Gb of data. I won’t dwell on this because it’s been covered above, but in general SAS is industrial-grade stuff. One big difference between SAS and other programs is that SAS will try to keep working when something goes wrong. If you *need* numbers for the next morning, you go to bed, the next morning you come and Stata has stopped working because of a mistake. SAS hasn’t, and perhaps your numbers are garbage, but if you are able to tell that they are simply 0.00001% off then you are in perfectly good shape to make a decision.

    Occasionally I use Matlab or Gauss (yes, Gauss!) because I need to put the data through some black box written in that language and it would take too long to understand it and rewrite it.

    That’s all folks. Thanks for the attention.

  78. Mattia says:

    No that was not all, I forgot one thing. Stata can map data using a free user-written add-in (spmap), so you can save yourself the time of learning some brainy GIS package. Does anyone know whether R, SAS, SPSS or other programs can do it?

  79. brendano says:

    R has some packages for plotting geo data, including “maps”, “mapdata”, and also some ggplot2 routines. Now I just saw an entire “R-GIS” project, so I’m sure there’s a lot more related stuff for R…

  80. Pingback: مقایسه بسته‌های تحلیل داده (R, Matlab, SciPy, Excel, SAS, SPSS, Stata) « دنیای پیرامون

  81. Tao Wu says:

    Hi, all. I think I should mention about a C++ framework based software, named as ROOT. see http://root.cern.ch

    You will see ROOT is definitely better than R.

  82. Tao Wu says:

    As I can see, the syntax and grammar of R are really stupid. I can not image that R, S, S+ have been widely used by financial bodies. Furthermore, they are trying to claim they are very professional and very good at financial data analysis. I can predict that if they shift to ROOT (a real language with C++), they will see the power of data analysis.

  83. xin (April 19) writes:
    > the majority of ‘R’ers on this thread act like a bunch of rebellious teens …

    Well spotted — I’ve been a rebellious teen for decades now.

  84. Wei Zhang says:

    People in my work place, an economic research trust, love STATA. Economists love STATA and they ask new comers to use STATA as well. R is discouraged in my work place for excuses like it is for statisticians. Sigh~~~~

    But!!! I keep using it and keep discovering new ways of using it. Now, I use ‘dmsend’ function from the ‘twitteR’ package to inform me the status of my time-consuming simulations while I am not in office. It is just awesome that using R makes me feel bounded by nothing.

    BTW, anyone knows how to use R to send emails (on various OS, Win, Mac, Unix, Linux). I googled a bit and not very promising. Any plans to develop a package?

    If we had the package, we can just hit ‘paste to console’ (RWinEdt) or C-c C-c (ESS+Emacs) and let R to estimate, simulate and send results to co-authors automatically. What a beautiful world!!

    I use Matlab and STATA as well but R completely owns me. Being a bad boy naturally, I start to encourage new comers to use R in my work place.

  85. ynte says:

    I happened to hit this page, and I am impressed by the pro’s and con’s.
    Been using SPSS for over 30 years and I’ve been appreciating the steep increase in usability from punch card syntax to pull down menu’s. I only ran into R today because it can handle Zero Inflated Poisson Regression and SPSS can’t or won’t.
    I think it is Great to find open source statistical software. I guess it requires a special ment framework to actually enjoy struggling through the command structure, but if I were 25 years younger………
    It really is a bugger to find that SPSS (or whatever they like to be called) and R come up with different parameter estimates on the same dataset [at least in the negative binomial model I compared].
    Is there anyone out there with experience in comparing two or more of these packages on one and the same dataset?

  86. Wei says:

    @ynte
    Why don’t you join R: mailing list? If you ask questions properly there, you will get answers.

    I would suggest a place to start: http://www.r-project.org/mail.html

    Have fun.

  87. peng says:

    hi friends,
    I am new to R.I would like to know R-PLUS.Does any know where can I get the free training for R-PLUS.

    Regards,
    Peng.

  88. Wayne says:

    I use R.

    I’ve looked at Matlab, but the primitive nature of its language turns my stomach. (I mean, here’s a language that uses alternating strings and values to imitate named parameters? A language where it’s not unusual to have a half page of code in a routine dedicated to filling in parameters based on the number of supplied arguments.) And the Matlab culture seems to favor Perleqsue obfuscation of code as a value. Plus it’s expensive. It’s really an engineer’s tool, not a statistician’s tool.

    SAS creeps me out: it was obviously designed for punched cards and it’s an inconsistent mix of 1950′s and 1960′s languages and batch command systems. I’m sure it’s powerful, and from what I’ve read the other statistics packages actually bend their results to match SAS’s, even when SAS’s results are arguably not good. So it’s the Gold Standard of Statistics ™, literally, but it’s not flexible and won’t be comfortable for someone expecting a well-designed language.

    R’s language has a good design that has aged well. But it’s definitely open source: you have two graphical languages that come in the box (base and lattice), with a third that’s a real contender (ggplot2). Which to choose? There are over 2,000 packages and it takes a bit of analysis just to decide which of the four Wavelet packages you want to use for your project — not just current features, but how well maintained the package appears to be, etc.

    There are really three questions to answer here: 1) What field are you working in, 2) How focused are your needs, and 3) What’s your budget?

    In engineering (and Machine Learning and Computer Vision), 95% of the example code you find in articles, online, and in repositories, will be Matlab. I’ve done two graduate classes using R where Matlab was the “no brainer” choice, but I just can’t stomach Matlab “programming”. Python might’ve been a good choice as well, but with R I got an incredible range of graphics combined with multiple a huge variety of statistical and learning techniques. You can get some of that in Python, but it’s really more of a general-purpose tool when you definitely have to roll your own.

  89. Pingback: Bookmarks for February 12th from 15:49 to 15:54 « Johnny Logic

  90. Jay says:

    Yeah, quite the odd list. If *Py stuff is in there, then PDL definitely should be too.

  91. Pingback: Statistical functions in Excel — The Endeavour

  92. stat_stuff says:

    i like what you wrote to describe spss, clear and consise….nuf said :-)

  93. forkandwait says:

    I would like to comment on SAS versus R versus Matlab/ Octave.

    SAS seems to excel at data handling, both with large datasets and with wacked proprietary formats (how else can you read a 60GB text file and merge it with an access database from 1998). It is really ugly though, not interactive/ exploratory, and graphics aren’t great.

    R is awesome because it is a fully featured language (things like named parameters, object orientation, typing) etc, and because every new data analysis algorithm probably gets implemented in it first these days. I rather like the graphics. However, it is a mess, with bad naming conventions that have evolved badly over time, conflicting types, etc.

    Matlab is awesome in its niche, which is NOT data analysis, but rather math modeling with scripts between 10 and 1000 lines. It is really easy to get up an running if you have a math (ie linear algebra) background, the function file system is great for a medium level of software engineering, plotting is awesome and simpler than R, the datatypes (structs) are complex enough but dont’ involve the headaches of a “well developed” type system. If you are doing data management, gui interaction, or dealing with categorical data, it might be best to use SQL/ SAS or something else and export your data into matrices of numbers.

    I would like numpy and friends, but ZERO BASED INDEXING IS NOT MATHEMETICAL.

    Just my 2c

  94. anlaystenheini says:

    This is a great compilation, thank you.
    After working as an econometrics analyst for a while mainly using stata, I can tell the following about STATA:
    Stata is relativly easy to get startet with and to produce some graphics quickly (that’s what all the business people want, click click here’s your powerpoint presentation with lots of colourful graphics and no real content).
    BUT if you want to automate things and if you want to make stata to do things it isn’t capable of out of the box, it is pure pain!

    The big problem is: On one hand Stata has a scripting/command interface, which is not very powerful and very very inconsistent. On the other Hand, stata has a fully featured matrix-orientated programming language with c-like syntax, which is c-like, therefore not very handy (c is old and not made for mathematics, the matlab language is much more convenient), and which doesn’t work well with the rest of stata (you have a superflous level for interchanging data from one part to the other).

    All together programming STATA feels like persuading STATA:
    Error messages are almost useless, the macro text expansion used in the scripting language is not very suitable for things that has to do with mathematics (texts can’t calculate), and many other little things.
    It is very inconsitent sometimes very clumsy to handle and has silly limitations like string expressions limited to 254 chars like in the early 20th century.

    So go with stata for a little ad hoc statistics but do not use it for more sophisticated stuff, in that case learn R!

  95. George Wolfe says:

    I’ve used Mathematica as a general purpose programming language for the past couple of years. I’ve built a portfolio optimizer, various tools to manipulate data and databases, and a lot of statistics and graphs routines. People who use commercial portfolio optimizers are always surprised at how fast the Mathamatica optimizations run – faster then their own optimizers. Based on my experience, I can say that Mathematica is great for numerical and ordinary computational tasks.

    I did have to spent a lot of time learning how to think in Mathematica – it’s most powerful when used as a functional language, and I was a procedural programmer. However, if you want to use a procedural programming approach, Mathematica supports that.

    Regarding some of the other topics discussed above: (1) Mathematica has build in support for parallel computing, and can be run on supercomputing clusters (Wolfram Alpha is written in Mathematica). (2) The language is highly evolved and is being actively entended and improved every year. It seems to be in an exponential phase of development currently – Stephen Wolfram outlines the development plans every year and the annual user conferenced – and his expectations seem to be pretty much on target. (3) Wolfram has a stated goal of making Mathematica a universal computing platform which smoothly integrates theoretical and applied mathematics with general purpose, graphics, and computation. I admit to a major case of hero worship, but I think he is achiving this goal.

    I’m going on and on about Mathematica because, in spite of its wonderfulness, it doesn’t seem to have taken it’s rightful place in these discussions. Maybe Mathematica users drop out of the “what’s the best language for x” after they start using it. I don’t know, really. But anyway, that’s the way I see it.

  96. Dale says:

    I am amazed that nobody has mentioned JMP. It is essentially equivalent to SPSS or STATA in capabilities but far easier to use (certainly to teach or learn). The main reason why it is not so well known is that it is a SAS product and they don’t want to market it well for fear that nobody will want SAS any more.

  97. ad says:

    In the comparison I did not see Freemat. This is a open source tool that follows along the lines of MATLAB. It would interesting to see how the community compares Freemat to Matlab

    • Carl Witthoft says:

      Freemat is OK. The two drawbacks I see are:
      1) certain functions look just like Matlab functions but take different arguments or parse differently. This means your scripts are not perfectly portable — in either direction.
      2) It doesn’t appear to get much in the way of maintenance and updates.

      FWIW, I fully agree with other comments that Matlab’s syntax is abysmal. I’m a confirmed R user.

  98. bupka’s online menyediakan buku terpakai (used books) berkualitas dan asli
    original dengan harga miring,banyak buku teknik. silahkan kunjungi
    http://bupka.wordpress.com

    buku MATLAB yg dibicarakan diatas, ada stok saat ini.
    silahkan liat2 lainnya juga.

  99. Farhat says:

    @Wolfe: I have used Mathematica a lot over the past 8 years and still use it for testing ideas as small pieces of code can do fairly sophisticated stuff, I’ve found it poor for large datasets and longer code development. It even lacked things like support for a code versioning system until recently. The cost is also a major detractor. Mathematica costs like $2500 or so last time I checked. Also, some of the newer features like Manipulate seem to create issues, I had a small piece of code using that for interactivity which sent the CPU usage to 100% regardless of whether any change was happening or not.

    Also, SAGE ( http://www.sagemath.org ), the open source alternative to Mathematica has gotten quite powerful in the last few years.

  100. I just wanted to mention that Maple, which has not been commented on yet in this post or in the subsequent thread, generates beautiful visuals and I used to program in it all the time (as an alternative to Mathematica which was used by the “other camp” and I wouldn’t touch).

    Also, I’m starting to use Matlab now and loving how intuitive it is (for someone with programming experience anyway). st

  101. Jason says:

    let me quote some of Ross Ihaka’s reflection on R’s efficiency….

    “I’m one of the two originators of R. After reading Jan’s
    paper I wrote to him and said I thought it was interesting
    that he was choosing to jump from Lisp to R at the same
    time I was jumping from R to Common Lisp……

    We started work on R in the early ’90s. At the time
    decent Lisp implementations required much more resources
    than our target machines had. We therefore wrote a small
    scheme-like interpreter and implemented over that.
    Being rank amateurs we didn’t do a great job of the
    implementation and the semantics of the S language which
    we borrowed also don’t lead to efficiency (there is a
    lot of copying of big objects).
    R is now being applied to much bigger problems than we
    ever anticipated and efficiency is a real issue. What
    we’re looking at now is implementing a thin syntax over
    Common Lisp. The reason for this is that while Lisp is
    great for programming it is not good for carrying out
    interactive data analysis. That requires a mindset better
    expressed by standard math notation. We do plan to make
    the syntax thin enough that it is possible to still work
    at the Lisp level. (I believe that the use of Lisp syntax
    was partially responsible for why XLispStat failed to gain
    a large user community).
    The payoff (we hope) will be much greater flexibility and
    a big boost in performance (we are working with SBCL so
    we gain from compilation). For some simple calculations
    we are seeing orders of magnitude increases in performance
    over R, and quite big gains over Python…..”

    the full post is here:
    http://r.789695.n4.nabble.com/Ross-Ihaka-s-reflections-on-Common-Lisp-and-R-td920197.html#a920197

    it is quite interesting to note that such a “provactive” post from one of R’s originators got 0 response from R-dev list………..

  102. Pingback: Business Intelligence Tools: looking at R as a platform for big BI. - SkriptFounders

  103. Sam says:

    I came across this thread and I’m finding the comments very useful. Thanks to all!

    I’m trying to decide which software package to use. I’m a researcher working with clinical (patient-related) data. I have data sets with <10,000 rows (usually just a few thousand). I need software that will generate multivariate and logistic regression, and Kaplan-Meier survival curves. Visualization is very important.

    Of note, I’m an avid programmer as a hobby (C++, assembly, most anything), so I’m very comfortable with a more complex package, but I need something that just works. I’ve been using SPSS, which works, but clunky.

    Any suggestions? Stata? Systat? S-Plus? Maple?

  104. brendano says:

    I still haven’t used Stata, but its users have very strong praise for it, for situations that sound like yours. That might be the best option to start with.

    R might be worth trying too.

  105. Rashad says:

    I am working on my undergraduate degree in statistics in the SAS direction, which has surprised people in the field I meet. The choice was somewhat arbitrary; I just wanted something applied to complement a pure mathematics degree. This post has opened many (…..many) options to consider. Thanks for the great discussion.

  106. Donovan says:

    So my question here is simple:

    After you peel back all the layers and look at the solution that would require the least effort, the most power, with the greatest flexibility, why anyone would choose anything other than RPy first, and then the language du joire that your employer would be using second as a backup and scrap the code war?

    I mean for my money, you make sure you can build a model in Excel, learn RPy & C# and search for APIs if you need to user other languages or just plain partner with someone who can code C++ {if you can’t} and simply inject it.

    I mean I plan on learning Java, PHP and SAS as well, but that is really a personal choice. Coming from IT with in Finance, not knowing Java and SAS means you either won’t get in the door or reach a glass ceiling pretty quickly unless you play corporate politics really, really well. So for me, it is a necessity. But the flip side is, wanting to make the leap into Financial Engineering after completing a doctorate in Engineering, RPy has also become a near Realistically, unless you just like coding, I have to say that what I have suggested makes the most sense for the average analysis pro. But then alot of this is based upon whether you’re a Quant Research, Quant Developer, Analyst, etc. — different tools for different functions.

    Just thought

  107. Mark Smith says:

    Sas and r

    1. there is a book out on the topic (http://www.amazon.com/gp/product/1420070576?ie=UTF8&tag=sasandrblog-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1420070576)

    2. R interface available in SAS 9.2

    “While SAS is committed to providing the new statistical methodologies that the marketplace demands and will deliver new work more quickly with a recent decoupling of the analytical product releases from Base SAS, a commercial software vendor can only put out new work so fast. And never as as fast as a professor and a grad student writing an academic implementation of brand-new methodology.
    Both R and SAS are here to stay, and finding ways to make them work better with each other is in the best interests of our customers.
    “We know a lot of our users have both R and SAS in their toolkit, and we decided to make it easier for them to access R by making it available in the SAS 9.2 environment,” said Rodriguez.
    The SAS/IML Studio interface allows you to integrate R functionality with SAS/IML or SAS programs. You can also exchange data between SAS and R as data sets or matrices.
    “This is just the first step,” said Radhika Kulkarni, Vice President of Advanced Analytics. “We are busy working on an R interface that can be surfaced in the SAS server or via other SAS clients. In the future, users will be able to interface with R through the IML procedure.“

    http://support.sas.com/rnd/app/studio/Rinterface2.html

    While this is probably more for SAS users than R, I thought both camps might be interested in case you get coerced into using SAS one day… doesn’t mean you have to give up your experience with R.

  108. Iskander says:

    I am also amazed how few people here have said anything about StatSoft Statistica. I’ve been using it for close to 6 years and don’t see any shortcomings at all. Consider this:
    - full support of R
    - fully scriptable, which means you can call DLLs written in whatever programming language possible and implementing things which you didn’t find inbuilt in Statistica (which doesn’t mean it’s not there)
    - the Statistica solver / engine can be called externally from Excel and other applications via the COM/OLE interface
    - untrammelled graphics of virtually any complexity — extremely flexible and customizable (and scriptable)
    - the Data Miner (with its brand new ‘Data Miner Recipes’) is another extremely powerful tool that leaves only your imagination to limit you
    ….it would be tedious to list all its advantages (again, the Statistica Neural Networks and the Six Sigma modules are IMO very professionally implemented).

  109. ZZ says:

    No package other than sas can load the unstructured data like blogs posted here, analyze and extract the sentiments (positive, negative, neutral) about each of the packages debated here in pretty decent precision after sas bought teragram a few years ago.

  110. Pingback: links for 2010-09-04 : Web Data Mining & Data Visualisation

  111. Pingback: Interesting Comparison of data analysis packages - CCPR Computing

  112. Pingback: some links about statistical tools « 西瓜,桃子,坚果岛

  113. John says:

    A post above commented: “sas is handy you have some old punch cards in the cupboard or a huge dataset. apart from that it truly sucks. some people say that it is good to manage data, but why not use a good relational database to do that and then use decent statistical software to do the analysis?” A good relational database is good at supporting online transactional processing and will in most organizations come with a bureaucracy of gatekeepers whose role is to ensure the integrity of the database to support mission critical transactional applications. In other words it takes a mountain of paperwork to merely add one field to a table. The paradigm assume a business area of ‘users’ who have their requirements spelled out before anyone even thinks of designing let alone programming anything. It just kills analysis. Where SAS is used data must be extracted from such systems and loaded into text files for SAS to read, or SAS/Access used. Generally DBAs are loath to install the latter as it is difficult to optimize in the sense of minimizing the drain on operational systems.

    On IBM mainframes the choice of languages to use is limited and by default this will usually be SAS. Most large organisations have SAS, at least Base SAS, installed by default because the Merrill MXG capacity planning software uses it. Hence cost is sort of irrelevant. It then tends to be used for anything requiring processing of text files even in production applications and this often means processing text as text, e.g. JCL with date dependent parameters, rather than as preparing data for loading into SAS datasets for statistical analysis.

    I know nothing about R but seeing a few code sample it struck me how it resembled APL to which we were introduced in our stats course in college in the early 70s, not surprising as both are matrix oriented.

  114. James says:

    Just in case anyone thinks there is still anything more to add to this topic!

    I think I’ve used most of these over the last ten years or so now as an academic researcher (except the Python based stuff). Definitely not a programmer, and currently working in health research, where the choice locally is largely SAS or Stata, and I think this choice is generally driven by what package is used by the people you work with. Personally I use:

    Stata for “quick” stuff that needs a quick answer;

    SAS for anything with a big dataset (I’m typing this with 13 GB of data sticking itself together in the background as I finish up for the day) and for most in-depth analyses/projects (I like being able to keep the data management and the analysis within the same programme,)

    and then R for pretty much all of the graphics output (pretty rare for me), complex survey analysis (seems to do a great job and doesn’t cost huge amounts of money like SUDAAN) and for also for odd little “one off” unique analyses that people have written code for out in the community (thanks! — Stata has good package availability in this respect too.) I’m trying to do more and more in R to expand my skills and take advantage of the extra flexibility.

    Then I use SPSS and EpiInfo a little for teaching or giving to students who are doing one-off projects (and so e.g. data analysis might only last three days or so.)

    There does come a point when it’s time to say “no more programming languages,” and for me that point came right before I decided R was going to be useful :-( … but I’ve been back on the wagon since…

  115. Ben Gallarda says:

    Igor Pro (www.wavemetrics.com), a brilliant combination of many of the above points. I highly recommend anyone interest give it a try (30 day demo). Advantages – easy to learn, powerful built-in functions, publication quality graphing (much easier than MATLAB), stats functions included, unbelievable developer and community support; Disadvantages – not as powerful or fast with matrices; Open source – no; Typical Users – scientists, engineers.

  116. Victor R says:

    We should include some structured programing i.e. LISP and Heskell, it will be interesting to hear others opinion on these languages

  117. Pingback: Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata | Honglang Wang's Blog

  118. gah789 says:

    I have used all of these programmes – and quite a few more – over the last 30 odd years. What one uses tends to reflect personal history, intellectual communities, cost, etc but there are various points not highlighted in the discussion.
    1. SPSS & SAS were designed as data management packages in the days when memory & CPU were expensive. They have evolved into tools for analysing corporate databases, but they are ferociously expensive for ordinary users and dealing with academic licences is a pain. Increasingly the corporate focus means that they lag behind the state of the art in statistical methods, but there is no other choice when dealing with massive datasets – oh the days when such data sets had to be read from magnetic tapes!
    2. Stata’s initial USP was graphics combined with reasonable data management and statistics, but it developed an active user community which has greatly expanded its statistical capabilities if your interests match those of the community. In my view, its scripting language is not as bad as suggested by other comments and there is lots of support for, say, writing your own maximum likelihood routine or Monte Carlo analysis.
    3. R (& S-Plus), Matlab (& Octave), Gauss, … are essentially developments of matrix programming languages, but they are useless for any kind of complex data management. R has a horrible learning curve but a very active research community, so it is useful for implementations of new statistical techniques not available in pre-packaged form. For many casual users what matters is the existence of a front-end – Rcmdr, GaussX, etc – that takes away the complexity of the underlying program.
    4. Excel should never be used for any kind of serious statistical analysis. It is very useful for organising data or writing/testing simple models, but the key problem is that you cannot document what has been done and it is so easy to make small but vital errors – mis-copying rows for example. Actually, Statistica, JMP, and similar menu-driven programs fall into the same category: they are very good for data exploration but very poor for setting up analyses that can be checked and replicated in a reliable manner.
    5. Many of us have used a variety of programming languages for data management and analysis in the past, but that is daft today – unless you are dealing with massive datasets of the SAS type and can’t afford SAS. In such cases their primary use will be the extraction and manipulation of data that is voluminous and frequently updated, but not for data analysis.

    For anyone thinking what to use the key questions to consider are:
    A. Are you primarily concerned with data management or data analysis? If data management, then steer clear of matrix-oriented languages which assume that your datasets are small(ish) and reasonably well organised. On the other hand, R or Matlab are essential if you want to analyse financial options using data extracted from Bloomberg.
    B. Are your statistical needs routine – or, at least, standard within a research community? If so, go for a standard package with a convenient interface and easy learning curve or the one most commonly used in your community. The vast majority of users can rely upon whatever is the standard package within their discipline or work environment – from econometrics to epidemiology – and they will get much better support if they stick with the standard choice.
    C. How large an initial commitment of time and money do you expect to make? A researcher developing new statistical tools or someone analysing massive databases must expect to make a substantially larger initial investment in learning and/or developing software than someone who simply wants to deploy data analysis in the course of other work.
    D. Are you a student or a professional researcher? Partly this is a matter of cost and partly a matter of the reproducibility of research results. Open source and other low cost programs are great for students, but if you are producing research for publication or repeated replication it is essential to have a chain of evidence. R programs can be checked and reproduced for standard datasets, but even here there is a problem with documenting the ways in which more complex datasets have been manipulated.
    I am primarily an applied econometrician, but even within this field there is a substantial range of packages with large groups of users – from R, Matlab & Gauss through Stata to RATS & E-Views – according to the interests of the users and types of data examined. Personally, I use Stata much of the time but ultimately the choice of package is less important than good practice in managing and analysing data. That is the one thing about the older packages – they force you to document how your data was constructed and analysed which is as or more important than the statistical techniques that are used unless you are purely interested in statistical methods.

  119. ramesh says:

    in a nutshell, if i have to use a open source ware and that too without the use of syntax, which is the most formidable one..from basic statstics to neural calculations..?

  120. The comments here have interesting comparisons between Stata and R, especially for economic analysis. “Stata Resources,” at Marginal Revolution

  121. jingju11 says:

    I am (only) using SAS for four years. The reason is simply they needed a SAS user instead of an R user when I was here, a research institute in a hospital. In fact, I had some experiences on R in school (even better than SAS at that time). Now I feel very comfortable on SAS’s coding syntax, results output and graphics. At the beginning, I thought SAS was very weird but now I am thinking R is weird. I know it is because I have learned SAS but not R. Forgive my ignorance, comparing to SAS, R is not well-organized, from package to documentation, except you know it very well. Since R is free, R may not be able to get a chance of playing a bigger market role before it is completely outdated.
    I want to say, SAS is fast. Taking an example. By using about 75-85 lines of SAS code, from generating data with 500*500 samples*sample sizes, then running 500 parametric survival models, then summarizing the original data sets and model results (running 500 time of proc means and ttest each), then comparing the bias, and finally reporting externally, in total, it takes 3 or 4 minutes in my own PC (2.6G CPU, 3G memory). Of course, all final results were listed in a nice formatted word file without any manual work.
    The logic of SAS is elegant as well. I believe, most of ambiguity in SAS syntax was resulted from our unfamiliarity and ignorance. As professional software, it is supposed to be working in a very reliable way.
    Being an applied statistician, I appreciate some very delicate and well-written procedures in SAS, including proc logistic, genmod, phreg, nlimixed, optmodel, and mcmc. I like their coding inside and well-documented.
    I realized that all the words could be exactly uttered from an experienced R user, SPSS user, Stata…

  122. Pingback: Software tools for data analysis – an overview | R User Groups

  123. sampsa says:

    Nice discussion. As obvious as it is I can’t resist on addressing that the tools should be chosen according to the tasks. “statistical analysis” is not a narrow scope. Quite many of commentators have the background on academic environment (as do I) where the requirements are different than on business side. On research most of the things done are new and are done once whereas on commerce the matter is usually the automation of certain repeated tasks.

    But anyway, the thing I wanted to bring into the discussion is that for light weight analytics some database systems like PostgreSql seem to provide built in tools well comparable to Excel. And at least certain commercial extensions of PG have a bit more advanced stuff such as linear regression and stuff built in. Most likely Oracle and some others have all this and even more but I am not familiar with them.

  124. Will says:

    Hey thank you very much for this awesome post!

    In my studies I’m supposed to analyse some statistics parameters in a huge database (roughly 16Gs) and I’m researching which software to use. Considering that R statistics is free and its features overcome the other ones in almost all comparisons, I’ll get this one!

    Thank you dude for this informations, it helped me a lot!

  125. Will says:

    Hey thank you very much for this awesome post!

    In my studies I’m supposed to analyse some statistics parameters in a huge database (roughly 16Gs) and I’m researching which software to use. Considering that R statistics is free and its features overcome the other ones in almost all comparisons, I’ll get this one!

    Thank you dude for these informations, it helped me a lot!

    • Michael Tuchman says:

      So, it’s three months in. How are you doing with it? R has more than enough power, particularly in the single user setting you are in.

  126. Steven says:

    I’m a biologist, and I use R; every biologist and psychologist I know uses R; I just got my post-doc based in part on my knowledge of R. At the last ISBE (behavioural ecology) conference in Perth, the statistical symposium that followed the conference focused entirely on R, including (to my memory) some nice new routines for MCMC. If you asked me, I would say that R is alive and kicking in science. Well, at least in my corner of it.

  127. Jon says:

    I am teaching a high school course in introductory statistics. I also want the students to use a relevant tool – meaning something that can be used into college and maybe beyond. The three choices I am currently thinking about are:
    1) Excel
    2) R
    and
    3) Mathematica
    Any comments would be great. Thanks.

    • Sean says:

      I am speechless that you are even putting R and Mathematica there for high school students. Of course Excel. It is the tool your students will be most likely to use in their future career.
      In this world, not everyone will become statistician…..

      • Michael Tuchman says:

        I couldn’t disagree more. Mathematica is a lot more than statistics, and can be very affordable for educational applications. The math it can do is extremely powerful, and letting students solve problems and visualize lets them get a taste of what “real mathematics” is about.

        If you’re teaching a stastistics course, then you have to take a positive attitude about the value of your subject. You have to teach it as though everybody is going to become a statistician. Otherwise, you’re essentially giving up before you start.

        The goal of a good education is to show people the heights and inspire them to continue on their own, not to teach them what average bureaucrats do with their boring 9-5 jobs.

        Do you start a music class by saying ” not all of you are going to be musicians (true enough, but) so therefore you should all use crappy instruments and not ever really have a good musical experience?”. In that case NOBODY in that class will become a musician.

        Set your sights higher, man.

    • Michael Tuchman says:

      In this order, I would say

      Mathematica
      R
      and way down the list, Excel

      Mathematica has huge numbers of teaching modules, as well as effective ways (notebooks) for the students to communicate their results. Students can even create workbooks to teach a concept to their classmates by writing neat mathematica widgetry that does not require programming knowledge.

      Second R – being a real programming language, and it can help them develop logical thinking skills in addition to their stats skills.

      Finally, having your teeth pulled out without anasthesia, and somewhere down below that, Excel.

    • Skywalker says:

      Again strange for me no one mention SAS On Demand for Academics: http://www.sas.com/govedu/edu/programs/od_academics.html.
      SAS offers completly free access to its flagship products: Enterprise Guide, Enterprise Miner, Forecast Server for acadamics, both professors and students. It can be used not only for teaching but also for reserach purposes free of charge.

  128. Michael Tuchman says:

    I was almost not going to comment, but seeing as there are some recent comments.

    I have been using SAS for more time than I care to admit, but still only one digit when you write it in Hexadecimal. SAS, in many ways, is not a real programming language. This frustrated me to no end when I started with it, and that was before the era of ODS Statistical Graphics that make producing graphics as easy as producing any other kind of tabular data.

    SAS is designed to give a huge amount of flexibility, while keeping the “real programmers” out of the shop. After all, if you are a in a data analysis and reporting position, and you have a real programming language, you also need incredible discipline to keep everybody rowing in the same direction. SAS is designed to keep you from programming, by using pre-written procedures with lots-and-lots-o-options, as was pointed out earlier.

    Yet for a data analysis system, it has several extremely coherent way of carrying results from one step of the analysis to the next, but it is restrictive enough that people can generally pick up SAS quickly. There are no “obfuscated SAS coding contests” for a reason.

    Perhaps SAS doesn’t do any one thing the best, but it does the most things well. For example, I have rarely seen a package that does what SAS’s TRANSPOSE procedure does, and with such concise syntax. Hint: it does a lot more than just transpose a single matrix.

    I am confused about what makes the SAS programming language “outdated”. Something is not outdated just because you say it is. What features, or lack thereof, make it outdated. Remember again, that SAS was never intended to be a full programming language in the sense that most professional programmers think of the term. Therefore, it should not be compared with things like Python, or even to your greatest programming language ever invented.

    One-on-One, R is more facile than SAS, because of it’s scheme background, but if I had to get something done with data in 2 weeks, I’d much rather have a team of 5 SAS programmers than a team of 5 R programmers precisely because it’s NOT a real programming language, and it does restrict you to what it does well.

    Also, the pricing for SAS is a bit of a mystery, and they like to keep it that way just to keep people from shopping solely on price, but for small shops running SAS on a local PC (non-server environment) you can get in for a lot less than the $100,000 quoted. It’s still expensive per-seat, I won’t deny that. But it is misleading just to throw out a figure without stating what it is based upon.

    • Skywalker says:

      Strange for me:

      “Also, the pricing for SAS is a bit of a mystery,[]”

      SAS and JMP prices are generaly available at http://www.sas.com and http://www.jmp.com

      • Jared says:

        Obviously ‘the force’ is not with Skywalker. Prices for SAS and JMP are NOT available on the websites. From my 7+ years of using SAS, they have never advertised prices.

        Last time I checked, IBM took prices off their website too (for SPSS).

  129. Pmb cana says:

    At this time, R and Python used together gives the most power and possibilities. We need both at this time.

    Excel (with VBA macros) is necessity at lower stratum. For this statum, I would imagine the day when I could replace Excel+VBA with LibreOffice’s Calc + Python.

    Dear R and Python folks ! Please give LibreOffice a chance.

  130. Pingback: Comparison of data analysis packages « Homologus

  131. Pingback: Software Tutorial Information « Computational Economics

  132. Joe says:

    I am a SAS user, so my comments only relate to SAS. I’m not as familiar with R, SPSS, Stata, etc.

    I think there is a lot of misinformation about SAS. For example, SAS has many tools that enable you to never have to use their programming language (JMP, Enterprise Guide, Enterprise Miner, etc..) Our firm has used these and many of the folks aren’t programmers and use these SAS tools quite effectively.

    Also, SAS has really focused alot of their efforts on industry solutions. We are a healthcare firm and have implemented solutions like their “Healthcare Condition Management” and “Customer Intelligence ” solutions that are pre-built for our industry. It is more of a point and click environment for the different type of analytics needed.

    I’ve noticed they also have been releasing solutions around Social Media Analytics, Fraud Prevention and Detection, High-Performance Computing, etc..

    Now, I might be biased because we are a SAS shop, but SAS is definitely not as archaic and unusable as some people make it sound.

  133. matlab says:

    very Interesting….!!!!! I used Matlab over the years.

  134. reechar says:

    I only have extensive experience with with MATLAB, but it is very nice for the analyses that I typically do. Most of my projects last a few months at a time and subsequent projects are usually too different to make much code reuse practical. Also, my datasets are typically not terribly large (10,000 elements or fewer, a few GB of data or less) I’m willing to take about a factor of 10 reduction in execution speed if it means I can write, edit, or debug code more quickly. The MATLAB error messages and help files are excellent. As for the cost, a week or two shaved off of analysis SW development due to the good documentation and strong user community pretty much pays for the license.

    Here are they key features that I use:
    -Ability to handle, search, collapse, and reshape multidimensional matrices
    -Linear and non-linear filtering methods for images
    -Ability to make simple GUIs quickly
    -Nice handling of complex numbers, necessary for Fourier analysis

    One day, I hope to become a real programmer with a pony tail and a deep-seated disdain for a handful of shameful programming practices, but until then, MATLAB will keep helping me get work done quickly.

  135. David Marso says:

    Sounds like you don’t know anything about SPSS ;-(

  136. Pascal Létourneau says:

    I’m doing my Ph.D. thesis.
    I use STATA for most of my statistical estimations, but now I need to do something that is not implemented yet in STATA. So I plan to do it in MATLAB.

    One of the steps involve estimating a VAR(p) model.
    I can do it in STATA and get results. (very easy)
    I can do it in MATLAB (using vgxset and vgxvarx). (very easy)

    Results are similar, but statistically different !
    That is the coefficients given by STATA are outside the confidence intervals provided by MATLAB.
    and the coefficients given by MATLAB are outside the confidence intervals provided by STATA.

    I can’t find the source of the problem.

    Anyone can help ?

    Thanks’ in advance for your precious time.

  137. Chavoux Luyt says:

    One place where I (as a non-statistician) working in biology (ecology) got results fairly quickly (and in a way I could actually understand) was using the Resampling Stats add-in for Excel (written in VBA, I think). Is there anything similar for Gnumeric or Libre Office Calc? Just wanted to say that at least for these kind of non-parametric tests (and teaching) Excel might have a role to play.

    I used Statistica for (the few) parametric tests (and to test for a normal distribution) and while it seemed much easier to use and get going than using SPSS, it didn’t have proper support for resampling/permutation tests (at least at that stage). I am not sure that learning R (especially how to get the data into it in the right format for resampling statisitics) is worthwhile for somebody like me who will probably be doing statistical analysis about once a year only (at the end of the data sampling/observation period). Any suggestions for a good non-parametric / permutation test software (preferable open source) that a non-statistician can use?

    @Pascal: I cannot speak for STATA or MATLAB, but I do remember that in order to appear faster/more responsive to the user, Statistica would “cut short” its calculations and use an estimate if too much time has passed. Unfortunately, this kind of “optimization” can not be checked if the source code of the program is not available. Disclaimer: this was many years ago (2004), so things might have changed since then.

  138. sparklemotion says:

    What about JMP? Wonderful software.

  139. Adam K says:

    You can connect to R from Matlab and do anything from Matlab that can do in R.

  140. David P says:

    I have used R, Octave and SAS. While SAS has a lot of weaknesses (like a patchwork of language syntaxes) it can easily handle large data sets. It gives a choice for data handling with SQL pass through language (so if you don’t feel like sorting data that some types of programs require, you can just switch into SQL mode with a few lines of code). Here is an example of where SAS is king:
    Last week, on a slow server shared by dozens of research teams, I merged and processed a billion+ records and produced dozens of tabulations on those records. This was all messy data (ie.. missing fields/values, non-normalized, different data sets requiring different rules etc.) I was able to quickly switch between SAS data steps and PROC SQL to get things working. SAS has the flexibility to force things together that I think is hard to find elsewhere. This week I designed a Sample with 1000′s of strata, fuzzed those cells with less than 100 obs, set up 3 multiclass(one vs all with 32 classifications each) logistic regressions on 11million data points (results of the 5% sample) in around 45 minutes (counting the time I took for a coffee break while waiting for the sample data to run). Finally clicked run and went home for the weekend. If there is a problem with the code, SAS will email me. Do that in R.
    So SAS is powerful, however it is prohibitively expensive; even my company (government contractor) cannot afford it, we use the clients copies instead. The programming language sucks, I hate the development environment, and the list can go on. I think the main advantage SAS has is its ability to handle big datasets, its hundreds of built in functions and its SQL pass through language for manipulating said large data and analyzing it in one place. Other more open source solutions are eroding these advantages. R teaming up with MySQL and Hadoop is a sign that unless SAS starts working on its issues, it’s days are numbered.

    SAS Pros:
    Flexible, Handles Lots of data in almost any format on almost any machine, Well-built Functions, Macros that are easy to set up and deploy, PROC SQL!, tons of online documentation.
    Cons:
    Confused-on-Multi-thread processing, expensive, developer environment sucks.

  141. pybokeh says:

    Are we talking about doing statistics or are we talking about data analysis? I think some people need to let go of their elitist attitude and be a little more open-minded. Since the title of the blog alludes to data analysis which doesn’t necessarily mean one will be doing hard-core statistics, as a general purpose data analysis environment, Excel isn’t bad especially for throwing something together quickly and easily. If you need to automate things, there’s of course VBA. I’ve used mainframe SAS back in the day and recall it being very fast and being good at making canned outputs. But times have changed recently. I am in a more dynamic environment. I have learned that I need a more flexible environment due to the following scenarios: the data from it’s source may not be formatted correctly or I may have to retrieve/scrape the data from web sites or I have to consolidate data from multiple sources. What do I do then? Call my IT department? Haha yeah right! Unfortunately, scenarios like these call for a more lower level programming language. But who wants to go that route?! So I thought… But I discovered Python and I am glad that I did. It is a relatively simple language to learn. I was hesitant at first, but I kept at it and now I don’t regret it at all. Python is truly a hidden gem. With Python coupled with Pandas and Matplotlib, I have the best of all worlds. I can choose the right tool for the right job, and yes, that includes using Excel too if I have to. If you have an inclination to learning programming, I would definitely give Python a try and check out Pandas and Matplotlib. Otherwise, I would stick with domain specific languages like MATLAB, SAS, SPSS, Stata, etc.

  142. Pingback: Quora

  143. Pingback: Quora

  144. dm215 says:

    I use C++ for almost all of my numerical analysis projects, because they’re generally computationally intensive. But I use MATLAB for what I’ll call pre- and post-processing, which has grown to include more and more statistical analysis. I am wondering if the Statistics Toolbox is going to be sufficient for the things I need to do with it, or whether I am going to have to take the plunge with R. The work I’m doing is sort of basic Bayesian analysis, nonparametric regression, sensitivity analysis and DOE-type stuff.

  145. Harish says:

    I restrict my comparison to SAS vs. R. I’ve worked with Matlab a lot, and I love it.

    10 points for R on a scale of 1 to 10 (10 being the best).
    9.0 for Matlab (I’m giving a 9 only because it is not free).
    6 points for SAS. (Hereafter, I refer to SAS for SAS Base/EG tool, not the company)

    1. SAS is for rich firms; R is for ALL firms (if smart), because it is free.
    2. SAS is outdated; R is updated.
    3. SAS needs bulky set-up with servers and a whole lot of stuff; R is easy to be installed in any machine (Linux/UNIX; MacOS, Windows, Servers)
    4. SAS grows slowly, with annual updates (that also not free); R grows exponentially with several newly tested and validated packages (all FREE).
    5. SAS needs SAS PROGRAMMERS who are statisticians/analysts; But R only needs Statisticians (programming is absolutely easy to pick up).
    6. SAS runs off of an RDBMS for BIG DATA (we are talking TB/PB/ZB); R Hadoop integration does the job to run off of nosql dbs, and for RDBMS, R has several parallel computation packages (snow, multicore, Rmpi) etc).

    Did I miss any good points about R?

    OKAY, I shouldn’t say all bad things about SAS, because there are smart people working in SAS, and I’ve heard it is one of the good companies to work. For some reason, they can’t UNHAUL the product easily. But two smart things SAS did in their product strategy: 1) SAS ENTERPRISE MINER – to do ETL as flow diagram. 2) SAS PROC IML to allow R programmers to run their code.

    • JP says:

      1. If a Firm is smart, then they use the right tool for the job. Smart firms NEVER base decisions solely on if something is free or not. They use multiple criteria in decision making.

      2. SAS is not outdated. Usually people who make these statements have NO clue what SAS is, have never used it, or are biased. I used both SAS and R and I even run R code through SAS/IML. I do this because I use the right code for the job.

      3. If you have the SAS server to deal with, yes it will be much more complex to install. But standalone SAS Base and EG are a breeze to install. R could actually be more challenging because you need to download packages once it’s installed. With SAS, it’s all there.

      4. SAS releases updates/fixes every week.

      5. SAS EG doesn’t require one to be a program nor a fully qualified statistician. SAS BASE and R are comparable and need one to know SAS or R language and should have some statistics background. There are GUI’s for R that would be equivelant to SAS EG.

      6. You can run SAS as a standalone desktop version. If you have big data or high performance analytics needs, then you’d want to run a dedicated SAS server and install the software clients on desktop machines to connect to that server. R has only recently been able to handle the amounts of large data that SAS has been able to handle for years.

      You have a serious bias against SAS. Most of your claims are incorrect.

      By the way, the support you get from SAS the company is hands down the best support you will ever get from any company. With R, you rely on yourself or the community. Although you could get support from some companies like Revolution who have commercial R support. But then you aren’t much better ahead with using R over SAS because now you are paying $.

  146. brendano says:

    dm215, what is DOE?

  147. Have you guys heard about IGOR Pro? I am by no means an expert user, and I am not sure if the following list is true for most people. Anyways, here are the pros and cons in my personal opinion:

    Pros:
    Publication quality graphs
    Extremely easy to turn data analysis routines into GUIs
    Clear programming language
    Excellent documentation
    Runs on Mac and Win

    Cons:
    Small userbase

    This program is used extensively in an experiment I used to work for. Older grad-students had, over time, written lots of data analysis routines and turned them into GUI’s, so that undergrad’s and visitors (such as me) could easily pick up the very specialized data analysis without having to go into programming. As I looked into it, it turned out that building GUIs was much easier than in any other language I use (TCL/Tk, Python with Tkinter, C using LabWindowsCVI, Matlab with GUIDE, …), and it’s a unfortunate we don’t use this program in the experiment I now work for!

  148. David says:

    Let me see if I can summarize.

    First avoid SPSS and Strata at all cost. Nuff said. Use SAS is you are absolutely forced to. No other reasons to.

    Excel is fine for extremely simple things. Want to add two columns of data that are already in Excel? Want to sum a column? Sure do it in Excel. Excel is only useful in that it allows you to look at the data. If you add a new column or define a column based on a formula then it is fairly useful because these things persist. Anyone can look at your spreadsheet and tell what you are up to. Excel is therefore useful for calculations that often occur in business especially when the amount of data is small. Beyond that … move on.

    R, Matlab and Python.

    First ask yourself: Do you need a real general purpose programming language with an enormous set of libraries to analyze your data. That is, is it important that your analysis is integrated into
    a larger system that needs to do a lot more than data analysis. If so, you want someting like Python.

    Then ask yourself, if you are willing to pay a lot of money to develop on a platform that is closed source and expensive and therefore has a small community. If you are not, then avoid matlab or IDL or similar big ticket packages. You will want to avoid these anyway unless you are working closely with people who are tied to them.

    Python is a powerful and fairly easy to learn general purpose programming language. If you only knew one language Python would probably be the best one to know. The Python data analysis stack: numpy, scipy, matplotlib and related are mature enought and quite powerful and they will likely become even more so over the next few years.

    R is also free. While it is true that you can program anything in R (or javascript or php) , it is not really used as a general purpose programming language. It is used for data analysis and statistics and it is very good for that. In that sense it feels more like matplab, IDL, mathematica etc.

    In short, programmers are going to prefer Python and scientsts, engineers and business folk will probably prefer R. R is just simpler to use. Not simple like Excel, but simpler than python. Installing packages is faster. The platform is not changing rapidly. There are more good books and free tutorials on getting started. There are more data analysis, machine learning and statistics
    packages for R. In short, R is just made to be usable by people who may not be real programmers. Yeah, you can program and may have done so for years but your focus is not really on software development. If you are a real developer, you are likely to find R and certainly matlab pretty awkward languges.

    However R has many flaws. It handles memory poorly. It is rather awkward and old fashioned in many respects. Integrating into other programming frameworks is not easy. People do not often deploy R software to clients. R feels a bit old.

    So I think the choice is simple. Do you want to create and deploy real software? If the first, go with python. If you are just analyzing data and not deploying anthing and you are greatly attracted to the number of libraries already written for R, you might preffer to go that way. Either way, you are not paying any money and learning either of them is very useful and won’t be time wasted. In the longer run, I think R will be replaced with something else, perhaps Python, perhaps something else. But R still feels easier to work with and I think most people who are not excellent programmers will prefer it.

  149. SL says:

    I will be glad to learn R if it is in demand. Actually SPSS and Statistica are the most used programs for statistical analysis in the SMB sector. I see no means to learn another needless stat software.

  150. honfui says:

    There is new kid on the block – Julia (julialang.org). I know it is at infant stage, but how does its concept and design compare with more mature big brothers – R, MatLab, SAS, SPSS or even Excel?FYI, I’m new in this area – looking for alternative to Excel. Thanks.

  151. Pingback: Demystifying Big Data: Skytree Brings Machine Learning to the Masses | Greg Emmerich

  152. Pingback: Take care with those units… | Lonely Joe Parker

  153. Pingback: Comparison of data analysis packages - Homologus

  154. Scott U says:

    We use SPSS on large data-sets in production, terabytes of data and billions of records. It works well and has proven very stable and cost effective running on large servers over the years. The user base using the program in this way is small in comparison to SAS. The R and Python add-ins make SPSS a good choice if you have analysts who want to program in something else. SAS, historically in my mind, is less a statistical program than it is a career.

  155. kx001 says:

    u don’t know Matlab. expect for expensive price, matlab is the best one among them. SAS is just a shit.

  156. Ram says:

    Hi there,

    I would like to know from the specialists here if “Matlab” is the best tool perform data analysis and modeling throughout drug discovery, development, clinical trials, and manufacturing in biotech & pharma industries. Is there an alternative as Mallab is expensive? Please do give a comparative analysis if you have one.

    Thanks
    Ram

  157. Rikimura says:

    I think PDL (pdl.perl.org) is faster in data analysis as it has super fast number crunching abilities with various modules for stats and graphics. R is also better and no doubt soaring in popularity.

  158. QuantWizard says:

    I am a Financial Engineer and have used VBA, SAS, R & Matlab extensively in my work. I have found R to be the most robust and amenable tool, and learned it well before the books were written on the subject using the free manuals off Google.

    It is no issue to write an R script that reads from a database and writes out to a csv/hooks up to a VBA script which is able to farm out the results to a PowerPoint on the fly. Debugging is easy in R if you have a good design philosophy and spend more time planning and less time coding. In my previous job I had a very basic filter with SAS and pitched it to the SAS consultant who we were paying big bucks. Not surprisingly he never got back to me. But naturally it didn’t matter – I simply rolled up my sleeves and built a solution in R in 2 hours. What it really comes down to is that R is for people who think outside the box and do things that the SAS guys can’t do or are unable to even think of. R frees the hamster from its wheel and leaves the SAS guys in the dust. SAS’s only defense against R is that “It’s free so you must be getting an inferior product”. R’s learning curve is based upon your discipline, willpower, and mental fortitude. With packages like swirl, etc, there really is no excuse as to not being able to jump into R within 2 hours or less if you have a basic knowledge of object oriented programming. Furthermore, you don’t even need to read a book on R since every answer you could possibly want has already been asked by someone on Google and most of the R books are simply a compilation of the information gleaned from the internet forums.

  159. James L says:

    Hi. Did anyone see Doronix Math Toolbox? I’ve accidentally found it. Check this out

    http://www.doronix.com/statistics
    http://www.doronix.com

    You can download it there. It’s only 4.5 MB

  160. Pingback: Lesson 1: Data | Do Thy Math

  161. Pingback: Stock market live today - binary option contractor comparison