Statistics vs. Machine Learning, fight!

10/1/09 update — well, it’s been nearly a year, and I should say not everything in this rant is totally true, and I certainly believe much less of it now. Current take: Statistics, not machine learning, is the real deal, but unfortunately suffers from bad marketing. On the other hand, to the extent that bad marketing includes misguided undergraduate curriculums, there’s plenty of room to improve for everyone.


So it’s pretty clear by now that statistics and machine learning aren’t very different fields. I was recently pointed to a very amusing comparison by the excellent statistician — and machine learning expert — Robert Tibshiriani. Reproduced here:

Glossary

Machine learning Statistics

network, graphs model

weights parameters

learning fitting

generalization test set performance

supervised learning regression/classification

unsupervised learning density estimation, clustering

large grant = $1,000,000 large grant = $50,000

nice place to have a meeting:
Snowbird, Utah, French Alps

nice place to have a meeting:
Las Vegas in August

Hah. Or rather, ouch! I had two thoughts reading this. (1) Poor statisticians. Machine learners invent annoying new terms, sound cooler, and have all the fun. (2) What’s wrong with statistics? They have way less funding and influence than it seems they might deserve.

There are several issues going on here, both substantive and cultural:

There might be too much re-making-up of terms on the ML side. But lots of these are useful. “Weights” is a great, intuitive term for the parameters of a linear model. I use it all the time to explain classifiers and regressions to non-experts. I was surprised to see “test set” on the statistics side; I’m used to thinking of held-out test set accuracy as an extremely common ML technique, while in statistics model fit is assessed with parametric assumptions for standard errors and such. I really like cross-validation and bootstrapping as ways of thinking about generalization — again, something that’s far easier to grasp than sampling and hypothesis testing approaches to parameter inference — which keep getting taught to and misunderstood by generations of confused Introduction to Statistics students. For example, how many times has been explained that: No, a p-value is NOT the probability your model is wrong. But scientific papers regularly treat significance levels in that manner (look how many stars are on this result!) On the other hand, cross-validation accuracy *is* something you can interpret as being related to the probability your model is right.

I’ll also note that there are definitely a number of topics in ML that aren’t very related to statistics or probability. Max-margin methods: if all we care about is prediction, why bother using a probability model at all? Why not just optimize the spatial geometry instead? SVM’s don’t require a lick of probability theory to understand. (Of course probability-based approaches are huge in ML, but it’s important to remember they’re not the only game in town, and there is no necessary reason they must be.) And then there are non-traditional settings such as online learning, reinforcement learning, and active learning, where the structure of access to information is in play. There are certainly plenty of things in statistics that aren’t considered part of ML — say, regression diagnostics and significance testing. Finally, many ML problems involve large, high dimensional data and models, where computational issues are very important. For example, in statistical machine translation, alignment models are described with probability theory and fit to data, but their structure is complex enough that optimal inference is intractable, and how you do approximate inference (EM, Viterbi, beam search, etc.) is a very major issue.

But the most interesting differences between stats and ML are institutional.

I’ve been hearing lots of friends compare two dueling courses at Stanford: CS229, the CS department’s “machine learning” course taught by Andrew Ng; and Stat 315 A/B, the Statistics department’s “statistical learning” sequence taught by some combination of Tibshirani, Jerome Friedman, and Trevor Hastie. These people are all top-of-the-line researchers in the field. Their courses’ contents are extremely similar; I’d bet any of them could teach most of the material from the other side.

What differs most is the teaching style. CS has far better lecture notes. Of course, the stats people wrote a very good book; but better lecture notes win because I can access them later and send them to people for free. CS students I’ve talked to think the CS course is better taught; I can’t find stats students who take the CS course. (My sample is biased, though I know people in both.) Finally, the CS course has a big, open-ended project component; the Stats course follows more of a traditional problem set and tests format.

I think this is reflective of the differences in institutional culture between CS and Stats. There’s an interesting John Langford post on part of the issue, which he calls “The Stats Handicap”. He points out that stats Ph.D.’s have a big disadvantage in the job market because statistics has an old-school journal-oriented publishing culture, so students publish much less and have less experience engaging with a research community. CS is conference-oriented — certain conferences have a higher prestige than many journals (e.g. NIPS in ML, CHI in HCI) — and this results in faster turnaround, dissemination, and collaboration. (I’ve heard others make similar comparisons between CS and psychology.) I’d expect any discipline with a larger conference emphasis to have better courses since they should reward presentation/teaching skills — or at least encourage practice — more than in journal world.

ML sounds like it’s young, vibrant, interesting to learn, and growing; Stats does not.

Is marketing a problem? Machine learning terms definitely sound pretty cool. Maybe the perspective of computational intelligence lends itself to cool names. Though the Stanford statisticians certainly know how to play this game — for example, they made up their own names for variants of L1 and L2-regularized regression, leaving annoyed people like me forever googling “lasso” and “ridge” trying to remember which is which. (On the other hand, perhaps that’s child’s play compared to the true original sin of ML nomenclature: tossing around the highly deceptive term “neural network” for a stack of linear functions paired with a wonky, overhyped training algorithm; the combination of which, many years later, still causes confusion. Definitely blame CS for that one.)

Another issue is the definition of statistics itself. In 1997, Jerome Friedman wrote an extremely interesting analysis of the situation: “Data Mining and Statistics: What’s the Connection?”. He points out, quite correctly, the statistical impoverishment of some common approaches to data mining. You can certainly blame statistics for not marketing its ideas well enough, or blame CS for ignoring statistics. For example there’s a good case that lots of genetic algorithms and neural network research was much ado about nothing — that is, over-complicated cool-sounding hammers looking for nails when all you needed were some time-honored statistical and optimization techniques. (E.g. why NN when you haven’t tried a straight-up GLM? Why GA when you haven’t tried Nelder-Mead?) But this problem has been rectified somewhat — for example, NLP has seen a big move to simple linear models as the default technique, and NN’s and GA’s have fallen from grace in mainstream ML.

Friedman argues part of the problem is in how statisticians approach problems and the world:

One can catalog a long history of Statistics (as a field) ignoring useful methodology developed in other data related fields. Here are some of them that had seminal beginnings in Statistics but for the most part were subsequently ignored in our field: Pattern Recognition, Neural Networks, Machine Learning, Graphical Models, Chemometrics, Data Visualization.

That is not to say statistics is not important — it’s incredibly important. He quotes Efron as saying “Statistics has been the most successful information science.” However, information science is becoming bigger and broader and more exciting, thanks to computation and ever-increasing amounts of data. What should statisticians do? Friedman continues (light editing and emphasis is mine):

One view says that our field should concentrate on that small part of information science that we do best, namely probabilistic inference based on mathematics. If this view is adopted, we should become resigned to the fact that the role of Statistics as a player in the “information revolution” will steadily diminish over time.

Another point of view holds that statistics ought to be concerned with data analysis. The field should be defined in terms of a set of problems — rather than a set of tools — that pertain to data. Should this point of view ever become the dominant one, a big change would be required in our practice and academic programs.

First and foremost, we would have to make peace with computing. It’s here to stay; that’s where the data is. This has been one of the most glaring omissions in the set of tools that have so far defined Statistics. Had we incorporated computing methodology from its inception as a fundamental statistical tool (as opposed to simply a convenient way to apply our existing tools) many of the other data related fields would not have needed to exist. They would have been part of our field.

Friedman wrote this article more than 10 years ago. All his observations about the importance and increasing prevalence of data and computing power are even more true today than back then. Has the field of statistics changed? Not clear. (I’d appreciate seeing evidence to the contrary.)

On the other hand a world of data *has* to be increasingly statistical. The positive spin from Efron:

A new generation of scientific devices, typified by microarrays, produce data on a gargantuan scale – with millions of data points and thousands of parameters to consider at the same time. These experiments are “deeply statistical”. Common sense, and even good scientific intuition, won’t do the job by themselves. Careful statistical reasoning is the only way to see through the haze of randomness to the structure underneath. Massive data collection, in astronomy, psychology, biology, medicine, and commerce, is a fact of 21st Century science, and a good reason to buy statistics futures if they are ever offered on the NASDAQ.

I know that I’m interested in quantitative information science, including statistics and data analysis. Machine learning has many strengths, but it is definitely an odd way to go about analysis. But there’s a good case that statistics, as traditionally defined, is only going to have a smaller role in the future. “Data mining” sounds more relevant, but does it even exist as a coherent subject? Maybe it’s time to study a more applied statistical field like econometrics.

This entry was posted in Best Posts. Bookmark the permalink.

74 Responses to Statistics vs. Machine Learning, fight!

  1. Pingback: Machine learning — The Endeavour

  2. Pingback: Brendan O’Connor puts machine learning and statistics in a jar and shakes the jar « Mike Love’s blog

  3. Carlos says:

    Somewhat related to this:
    http://tinyurl.com/breiman2001
    Leo Breiman, Statistical Modeling: The Two Cultures.
    From the abstract
    “The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets.”

  4. brendano says:

    Very interesting, very related paper; thanks.

  5. ekzept says:

    To paraphrase Feynman, something (some problem, some theorem, some assessment, some analysis, some method) isn’t new just because it has been given a new name.

  6. miked98 says:

    Cool post. It seems to be a general phenomenon what happens when applied sciences (psychology, biology, economics) adopt methods of the [relatively more] pure sciences (physics, statistics, mathematics), or when newer disciplines (synthetic biology) venture into space occupied by an older one (electrical engineering).

    In the long run, things do get cleaned up. But it may take a younger generation of thinkers, with a stronger grasp of the pure sciences (think of the mathematicians who’ve ventured into economics, or the physicists who’ve gotten into synthetic biology) to sort out the mess of nomenclature and integrate the fields.

  7. Pingback: Machinel Learning İstatistiğe Karşı: Dövüş Başlasın! | FZ Blogs

  8. Tibshirani’s graph should really have included two additional factors: (1) average number of courses taught/year, and (2) median student/post-doc/faculty stipends and salaries. I think it’s part of the explanation of grant size, since most CS folks I knew at CMU simply bought themselves out of teaching by having grants pay their salaries. This makes CS departments much more highly leveraged (grant money needed per person to sustain the operation; at CMU our budget from the university didn’t even cover tenured faculty salaries, much less T.A.s). Even so, a $1M grant in CS is going to support a lot more people than a $50K grant in stats. You see an even stronger form of this effect in medical research, which has huge salaries and huge grants with armies of cheap post-docs.

    Once you put average courses taught/year and see the statisticians with 3 or 4 and the machine learning people with 1 or maybe 2

    Maybe we just need larger grants so we can go to the Alps and pay expensive graduate students and get away without having to actually teach for a living. I think the big tell would be

  9. Blame the physicists for the term “max entropy”, which is just plain old logistic regression (as are one-layer neural networks with sigmoid activations or softmax). But the non-Bayesian statisticians get the blame for “regularization”, and “lasso”/”ridge”; they’re just priors to the Bayesians.

    Don’t diss back-prop! It’s having a renaissance in stochastic gradient methods all over machine learning.

    Folks in machine learning are discovering Bayesian methods of dealing with uncertainty, whereas the Bayesian statisticians been using graphical models in custom and general-purpose systems like BUGS for decades.

    Dan Jurafsky and I were just discussing ML vs. stats, because we’ve both been doing more social science type stats. We were both surprised, like Brandon, that the statisticians don’t use cross-validation. I speculated it’s largely because the statistical paradigm of evaluating fit leads them to build models focused on analyzing existing data sets rather than to doing forward-looking predictions, but there are lots of counterexamples, such as FiveThirtyEight, which predicted the 2008 U.S. presidential election very accurately using Bayesian methods over polls.

    The other issue Dan and I discussed is that statisticians care deeply about their coefficients (weights, parameters, whatever), whereas machine learning folks tend to toss them all into a bin and let priors and cross-validation sort them out. Sure, we might look at the feature weights to make sure the algorithm’s doing something sensible, but we don’t write papers where the point is to explore the effect of the word “the” (a feature) on estimation (there actually should be more of these papers in ML, in my opinion). For instance, statisticians very much want to explore the effect of a person’s weight on their chance of diabetes and aren’t going to be very happy giving a doctor an SVM and saying “trust it, it worked well on cross-validation”. And they want to examine the role of income or church attendance on voting. The goal is to explore the parameters (“effects”) as much as predict which way a state’s going to vote in the next election.

    Finally, let me point out that the main systems used for microarrays in practice are simple linear factor models that any statistician would recognize, like dChip. What’s the justification for Efron’s comment that “Careful statistical reasoning is the only way to see through the haze of randomness to the structure underneath.”? Does statistics imply probability? If not, what about SVMs, as Brandon asks? Maybe all we need is room-sized 3D visualization coupled to human brain power.

  10. brendano says:

    On descriptive statistics by attention to coefficients — lots of social science empirical work involves small, limited situations where they’re trying to find out if certain effects are in play; extrapolation to other situations is usually done with reasoning by analogy. If your reasoning and decision making in future situations is going to be qualitative, a trained-up SVM from a different situation isn’t useful; but knowing the top 3 coefficients from a linear model there *is* useful qualitative information.

    I think this is the point of that bit about the jeff hammerbacher talk we were discussing at http://anyall.org/blog/2008/07/the-macgyver-of-data-analysis/ — he’s assuming the domain of analyzing web behavior logs and figuring out how to make a website better. you could worry about automated decision making for what content to show people (ranking, recommendations etc.); but probably the most productive thing to do is extract qualitative insights from the data to inform the design process. this is a pretty social science-y domain; t-tests and linear regressions are going to be the tools of choice.

  11. brendano says:

    Another response, from Andrew Gelman — on a rather pro-CS note: http://www.stat.columbia.edu/~cook/movabletype/archives/2008/12/machine-learnin.html

    I wish a statistician would come here aggressively defend their discipline. At the very least — what about experimental design? Or tricky low-evidence situations: don’t you want a statistician, not an MLer, to testify at a trial about the whether an event was a coincidence?

  12. Ethan Bauley says:

    Have you ever studied game theory?

  13. Pingback: Statistics vs. Machine Learning vs. Data Mining, fight! » No Random Walking!

  14. “…the highly deceptive term “neural network” for a stack of linear functions paired with a wonky, overhyped training algorithm; …”

    The term “neural network” covers a broad range of techniques, but I don’t think the above description accurately describes any of them. For one thing, any “stack of linear functions” reduces algebraically to a single linear function. I imagine that you are referring to a multi-layer perceptron, but that is built of a stack of non-linear functions.

    -Will Dwinnell
    Data Mining in MATLAB

  15. brendano says:

    I did mean a multilayer perceptron. Individual units are “linear” in the generalized linear models sense — the response is a function of a linear combination of the inputs. (The same way a logistic regression is a linear model; you stack a them up to get a multilayer NN.) Trained with backpropagation this is theoretically very powerful, but unfortunately is tricky to use in practice. Thus “overhyped.”

    Sorry for any confusion. And nice blog, by the way.

  16. Ethan Bauley says:

    @brendano

    I was just asking because of your comment about econometrics. I just discovered game theory a couple of months ago and have been reading some books about applications to business strategy, like Nalebuff/Brandeberger’s “Co-opetition.” Great stuff; seems applicable to the things you’re interested in.

    I met a CalTech PhD on a flight back from San Jose a couple of weeks ago who just finished his degree in theoretical computer science (emphasis on game theory). He interviewed at YHOO and they were looking for algorithms that identify Nash equilibria in huge data sets of user interactions.

    fwiw

    ;-)

  17. lilly says:

    On the conference vs journal oriented cultures, one frustration I have with conference oriented cultures is that they still feel too slow and competitive for getting feedback on work but so fast that they encourage the publication of a lot of low hanging fruit work so you can be up on a podium every year (or multiple times a year, depending on how many prestigious conferences are in your field).

    How do people decide what is a valid contribution for presentation at NIPS? I can’t seem to make sense of it at CHI, except for a particularly traditional sort of “build system – evaluate on 10 research lab mates – summarize results” that ends up not creating very powerful or synthetic new knowledge.

  18. brendano says:

    Hey Lilly, I don’t really know what the NIPS criteria are. They do both theory and applied papers. I do know that John Langford has a bunch of interesting things to say about it and in general about review criteria.

    http://hunch.net/?p=499
    http://hunch.net/?p=191
    http://hunch.net/?p=223

  19. brendano says:

    Ah, he has a number of interesting posts here: http://hunch.net/?cat=33

  20. hi
    afgbiiiq50qhstbe
    good luck

  21. Pingback: Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata - Brendan O'Connor's Blog

  22. Pingback: Statistics vs Machine Learning « Data Mining, a Course by Blog

  23. So, the “stack of linear functions” I took issue with too. A neural network with any modeling power has the crucial ingredient of nonlinearities at each hidden layer. The way I think about a single layer neural network is as a logistic regression model operating on a set of features where the feature extraction is learned as well. As for “overhyped” = tricky to use, yes, that’s true. There are entire books written on effectively training the damned things. Unlike SVMs, you need to know a little something about both your domain problem and the model in order to get good results. There’s been a recent renaissance, as Bob points out, in neural networks research with the advent of methods for training deep networks (which is nearly impossible with gradient descent + backprop alone, unless you tie a lot of the weights together i.e. convolutional networks).

    The basic dichotomy between statistics and machine learning that I see is in academic lineage. Yes, they’ve invented new names for lots of things, but that’s mostly because the machine learning community grew out of computer scientists, engineers, physicists (I am often taken aback at just how many physicists seem to pop up), and yes, theoretical neuroscientists back in the 1980s, with very little crosstalk with statistics. There’s a strong difference in problem focus, as you mention.

    I also think your discussion centers somewhat unfairly on classification and regression. There’s plenty of interesting work being done in unsupervised learning of complex, generative models of data, both with prior knowledge built in and without. Pleasantly, the two communities have converged on graphical models as a common parlance for describing probabilistic models of data.

  24. Abhijit says:

    Just discovered this post. Fantastic!! It aligns with many of my thoughts, specially since I’m a biostatistician interested in high-dimensional problems where ML techniques seem to be “easier”. Still learning about ML methods, though. You’re right about the cross-validation bit, though. Statisticians aren’t necessarily trained in predictive modeling and their techniques, including CV, model averaging, bagging, … I’ve recently felt the need for learning these areas since they’re apropos of some problems I’m consulting on. There NEEDS to be more cross-fertilization of the two fields, since we keep re-inventing wheels.

  25. Hanif says:

    A great blog and interesting discussion – I’m an engineer who’s spent the last decade in biotech and pharma. Whenever someone non-analytical asks what I do, I say “data mining” which is not far from the truth. However, more recently I’ve spent more time with statistical modeling and the associated community in biostatistics.

    I definitely see the differences Brendan and Bob mention – the focus on understanding the factor parameters/effects has a lot to with the fact that the same analysts helped design the study which includes specifying which data to collect and contrasts to select. Many of the machine learning folks seem to be more contract mercernaries / collaborating scientists who came on post-study.

    The Netflix contest seems to be an interesting context to compare the approaches and insights gained. The highest performing [most?] groups seem to be dominated by ML. My guess is that Netflix will favor the white-box modelers to hel them decide how to modify their actual suggestion engine, as opposed to wrap around the best performer’s algorithm.

    My experience is in biological data, which is almost always underspecified and overdetermined, full of correlated variables (the correlation itself being informative), and answers used as a starting point of another study/experiment/analysis. A ML-informed statistical (or statistically-informed ML) modeling approach ends up being the most useful. I use R mostly, with a lot of other domain-specific tools, and am starting to use MATLAB more as well (deployment of visualizations).

    Keep up the great articles…

  26. brendano says:

    Abhijit –

    > Statisticians aren’t necessarily trained in predictive modeling
    > and their techniques, including CV, model averaging, bagging…

    I agree and have thought the same thing myself, but it’s still funny to read that sentence given that there are so many papers from stats journals on those topics. I think they were all, or most of them, invented by statisticians too. Maybe this is showing a difference between statistics proper versus applied stats in traditional biology and social science.

    But sometimes the ideas are there just named differently. Economists know about the held-out accuracy method of evaluation (e.g. CV & friends). They call it “out of sample predictions”. Economists, of course, broke away hard from mainline stats a while ago, calling it “econometrics” and reinventing names for EVERYTHING, plus throwing in a bizarre obsession with the method of moments. In terms of intellectual arrogance and needless renaming/duplication, economists are much worse than computer scientists and engineers. Maybe as bad as physicists.

  27. Ping Li says:

    Very nice post and interesting discussions. However, your statement of “I can’t find stats students who take the CS course” is indeed biased.

    I am a junior faculty in Statistics, with a Ph.D. in Statistics (advised by Prof. Hastie). I have master’s degrees in CS, EE (both from Stanford), and other fields. I interned as software engineer (RealNetworks Server Team and Microsoft Visual Studio) and later spent many summers at MSR.

    I totally do not like to use small datasets as I (and possibly many others) believe results on small datasets could often be fairly easily tuned and one can hardly test the signficance on results from small datasets. However, I sometimes find I must also use small datasets, since they were used in many (CS) machine learning papers.

    I am just one example of “statistician”. There are much better examples. The statistics folks at ATT and Google are doing wonderful adorable things.

    There are probably some non-statisticians who are used to view “statisticans” as “statisticans”. My research proposals often received comments like “this is just a statistican’s view of …” ” this work is not inter-disciplinary” etc.

    I really believe there should not be any “statisticians’ view” or “Computer Scientistists’ view” As long as the algorithms work on the real data, then it is a good view. Why should we bother putting a label?

  28. Pingback: My daily readings 09/29/2009 « Strange Kite

  29. Great book. I posted that to a class mailing list earlier this week then today suddenly it’s all over programming websites like reddit. I always wonder…

  30. Pingback: “Statistics vs. Machine Learning, fight!” « Trying to Make Sense of Data

  31. Bastian says:

    Hi guys,
    you seem to be experts. I am a student in economics and I am working with Markov Logic Networks.
    But now I have a question: in machine learning, there are no parameters but weigths. In econometrics an important thing is to look, if there is a significant influence, how strong it is and in which direction it goes. From weigths (at least in Markov Logic Networks) I can’t get this information. So how can this problem be solved? By estimating the probability distribution and compare the result with a random probability distribution (how it is done in graph theory)?
    For examining empirical results and for testing theories I’m not sure, if machine learning is superior against statistics.
    What is your opinion.

    With regards.

  32. brendano says:

    Hi Bastian,

    I, at least, am not an expert at all!

    Weights are exactly the same thing as parameters. In fact, weights in an MLN are very similar to parameters in a logit regression. (Similarly with other log-linear structured models like CRFs and MRFs.) But their interpretation might be a little more complex given the structuredness of MLNs.

    MLNs have their own mailing list that might be useful to try your question on; “alchemy-discuss”, I think it’s called. Somewhere on the UWashington website.

    You’re right that techniques developed in ML-land are usually less focused on making accurate descriptive inferences. I’ve never seen anyone try to do significance testing for MLNs, for example. I think this is a big weakness of ML, at least as it’s usually conceived. All this stuff will be merged together eventually, but in the meantime, there’s still confusion.

  33. Bastian says:

    Hello Brendan,
    thank you for the answer! Wrigth, I know “alchemy-discuss”, and I posed there some theoretical questions. But nobody answered, I don’t know why.

    Bye bye.

  34. Rhiannon weaver says:

    At Carnegie Mellon stats, we’ve been aware of this for quite some time. I started there in 2000 and one of our first semester courses was stat computing. with the prevalence of bayesian methods (whereby you CAN figure out the ‘probability your model is right’), and practical ways of estimating complex hierarchical models, you have to take a very problem-oriented approach. See this letter to the American Statistician by Kass and Brown:

    http://pubs.amstat.org/doi/abs/10.1198/tast.2009.0019

    I would also argue, however, the ML doesn’t say very much at all about experimental design and/or controlling for multiple sources of error in experiments. You mention above a lot of things that ML has that stats doesn’t; there’s one thing at least that stats has that ML doesn’t.

  35. pm says:

    Hi Brandon,

    I stumbled across your blog about a year ago and peeked back in
    every now and then since. Considering myself a probabilist who
    comes from the theoretical side, I would not really call myself
    a statistician, but I am certainly more open to statistical
    methods than to machine learning.

    I got in touch with machine learners after leaving university, and
    the one thing that puzzled me most and is most critical to me
    is at the very heart of what modeling means. To me, ML is more
    focused on methods and techniques, and less on concepts that
    are suited to the problem at hand. The culture of ML is METHOD
    oriented, not PROBLEM ORIENTED, as it seems to me.

    To me, a model is anything that describes the important parts of
    a real world phenomenon I am interested in. Networks or graphs
    are only certain instances, or examples, of what a model might
    constitute. Model choice is super-critical in any data analysis you
    carry out, and any statistical inference and/or prediction you carry
    out is only valid within the model you chose. Moreover, by using
    a ‘tool’, you choose a model implicitly, always. There are no
    exceptions to this rule.

    A model might be given by a graph, a stochastic differential
    equation, a specification of distributional assumptions etc.
    The set of statistical methods that are suitable when observing
    data which are supposed to be generated by the model dynamics
    follows from the model assumptions. Many statistical methods
    are standard nowadays, and they are often employed without
    really asking whether the underlying assumptions are true.
    For example, even when using standard software and doing such
    trivial things as calculating sample means, you make assumptions
    about your data. (In this case, you assume the data are sufficiently
    independent and identically distributed.)

    And this is what people from the ML community seem not to be aware of.
    By choosing a neural network, an SVM, or any other kind of super-flexible
    mechanism and fitting that to your data, they make the assumption
    that the data are generated by the dynamics the tool implies.
    The model is implied by the tool, the tool _replaces_ the model.
    Depending on the application, the consequences of this approach
    are more or less serious. And sometimes they are very serious…
    e.g. in financial engineering when heavy tailed phenomena are of
    paramount importance, but ML techniques are mostly based on the
    assumption that noise follows a Gaussian law…

    In my opinion, this is what really makes up the different cultures
    between ML and statistics. Good statisticians are well aware of the
    limitations of their tools, MLers aren’t… what do you think?

  36. brendano says:

    @pm, that sounds about right to me. I think the best ML theory and practices are turning into more of the statistical-style approach, of understanding both the power and limits of the techniques in question.

  37. Ping Li says:

    @pm.

    Regarding your comments on “ML techniques are mostly based on the
    assumption that noise follows a Gaussian law…”

    Successful (for example, in industry) ML methods such as trees (together with boosting) are not affected much by the heavy-tailed nature of the data.

    You mentioned SVM, which has rich and beautiful theories. My limited experience is that SVM works extremely well (and very fast) when the data are “nice” (such as MNIST). As soon as the data become “difficult”, the performance of SVM may drop dramatically (can more experienced folks correct me on this?). We academia researchers love SVM-type of algorithms because we have the time and passion to carefully tune the parameters, designing kernels (if one kernel does not work, use multiples), clean the data (remove “outliers”), normalized the data, etc.

    ( I should add that linear SVM seems to be the right tool when the data are extremely high-dimensional, sparse, more or less binary, for example, text data).

    My guess is that both ML and statistics folks are well aware of the limitations, but one might be often under the pressure (for example, publish or perish) of developing sophisticated algorithms that may work well only on a few (and often small or even contrived) datasets but may not work well in general. The current publication model in ML seems to favors sophisticated stuff. Just my humble opinion.

  38. Akshay Bhat says:

    A post written in 2008 bashing ANN’s is really pointless, why don’t you
    talk about Probabilistic Graphical Models, Support Vector Machines or other area of learning such as Unsupervised or Online or Reinforcement Learning. Plus you don’t discuss Vapnik’s Statistical Theory of learning or PAC theory? Nor do you mention Deep learning architectures like restricted Boltzmann machines. What about emerging problem in Networks and Link Prediction. Or even whole subfield of Recommendation Systems [Collaborative Filtering as some call it?].

    If all those terms sound too much, then what tools [not invented by CS researchers] does statistics currently posses to deal with problems such as Reinforcement Learning?

    If you are going to bash Machine Learning by using ANN (Popular 1995 – 2005) and back-propagation ( popular circa 1990′s ) in 2008, isn’t convincing.

    Discounting Machine Learning by calling it as Statistics is saying all Biology is Chemistry and all Chemistry is Physics.

    • cttet says:

      Graphical Models and Support Vector Machines in my view are quite stats..
      But I wonder why everyone’s definition of the terms are different.
      For people I know, those are called statistical learning.
      But NN, GA and other methods are different.

  39. Akshay Bhat says:

    I take back my comment I didn’t read the text properly,

  40. Pingback: Machine learning hay Statistics « MFEPE

  41. Pingback: Learning about Machine Learning | Honglang Wang's Blog

  42. Pingback: Statistics vs. Machine Learning, fight! | Honglang Wang's Blog

  43. Pingback: Statistics vs. Machine Learning, fight! | Honglang Wang's Blog

  44. Pingback: Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata | Honglang Wang's Blog

  45. Pingback: Quora

  46. Pingback: Applying social psychology | Ready-to-hand

  47. Arthur says:

    Hello!
    I’ve just landed on this blog post and found it very interesting.
    We’re nearly two years since your last update. Has your position evolved? Statistics VS ML, what’s the score?
    Thanks,
    Arthur

  48. I’m more into stats these days. But I think the gap between the disciplines keeps narrowing anyways.

  49. Pingback: Quora

  50. Me Me Me says:

    Is there any textbook or so that you would recommend to CS students who have been exposed to ML techniques, but not classical stats ? I would like to read more of that, but I wouldn’t know where to start, and if it is approachable for the average CS student ;-)

  51. @”Me Me Me”: Get “All of Statistics” by Larry Wasserman. It’s basically written exactly for this use case.

    • Pumbaa says:

      @Brendan Thx for that time your explained Gibbs sampling to me. This post is really interesting!
      @ Me Me Me: I highly recommend “All of Statistics”, too. If by any chance you are from CMU, I would recommend Larry’s “10705 : Intermediate Statistics”, too. I am a CS background like you and didn’t take much stat courses, but after taking that I feel more equipped for a lot of ML stuff.

  52. Pingback: How do I become a data scientist? | spider's space

  53. Pingback: How do I become a data scientist? « Victor Fang's Computing Space

  54. Pingback: Statistical Modeling versus Machine Learning « Data Meaning…

  55. Pingback: Bombarded with big data,big science and big learning « Big AI Dream

  56. Well, here at the coal-face, I never really saw much distinction between the two. Machine Learning? Statistic + Algorithms, as far as I am concerned. I pull in what I need at the time that I need it, irrespective of where it originates from in the academic sphere. Having said that, statisticians (and mathematicians too, for that matter) need to pull their finger out when it comes to communicating. Neither I, nor my colleagues have the time nor spare mental capacity for navel-gazing “look-how-clever-I-am” papers. Save the proofs for the appendices. We need clearly presented advice, written in a well-developed tutorial style, wherever possible pitched at an intelligent-early-postgraduate level.

    • plancherel says:

      haha… sorry to come across this comment so long after it was posted.

      As a mathematician in a research lab, I cannot tell you how many times I come across people like you. You try an algorithm, that you don’t understand, on data, that you don’t understand, and you get a bad result, that you don’t understand. Then, you ask someone like me or a statistician, to explain why your approach doesn’t work.

      I usually refer you guys to some paper, that you won’t understand, and then put my finger back in it. :- )

  57. cttet says:

    Statistial learning is of course related to statistics.
    But you cannot desregard NN and GA in two sentence. Machine learning is far more general!

  58. Pingback: Data Science | ModrnWiki (Pre-Alpha)

  59. Anders Nielsen says:

    I have a good foundation on applied stats and mathematical statistics.

    Could you suggest a book about machine learning for people who knows statistics?

    Thanks!

    Kind regards,

    Anders

  60. I like this book generally, the new Murphy textbook. http://www.cs.ubc.ca/~murphyk/MLbook/index.html

    Also great is Hastie et al, and it is free! http://www-stat.stanford.edu/~tibs/ElemStatLearn/

  61. Pingback: How do I become a data scientist? | i4igeeks

  62. Ron Kenett says:

    My solution to this is to encourage the development of a Theory of Applied Statistics
    See: http://ssrn.com/abstract=2171179

  63. Al DeLosSantos says:

    Thanks for a great post Brendan, very helpful. I took Andrew Ng’s ML course on Coursera last fall and can highly recommend him and the course if anyone wants to learn some fundamental ML methods. Great lectures and well prepared assignments that introduce the ML methods using the Octave environment. All throughout the course I kept asking myself (should have posted this in the discussion group!) how I could reconcile his material with what I had previously studied in my few Statistics courses. Your blog discussion has helped…I just have to keep learning and participating in the discussion. :^)

  64. Pingback: What is Machine Learning | Machine Learning Mastery

  65. consultation astrologie en ligne parisDecouvrir ce excellent site web : voyance par telephone gratuite

  66. Pingback: Unrelated to all that, 6/26 | neuroecology

  67. Pingback: 機械学習とは何か? – 機械学習の定義と、使える言い回し | POSTD

  68. Pingback: 유전자 프로그래밍과 트레이딩 첫째

  69. Pingback: How do I become a data scientist? | lordtomriddle