10/1/09 update — well, it’s been nearly a year, and I should say not everything in this rant is totally true, and I certainly believe much less of it now. Current take: Statistics, not machine learning, is the real deal, but unfortunately suffers from bad marketing. On the other hand, to the extent that bad marketing includes misguided undergraduate curriculums, there’s plenty of room to improve for everyone.
So it’s pretty clear by now that statistics and machine learning aren’t very different fields. I was recently pointed to a very amusing comparison by the excellent statistician — and machine learning expert — Robert Tibshiriani. Reproduced here:
|test set performance
|density estimation, clustering
|large grant = $1,000,000
|large grant = $50,000
|nice place to have a meeting:
Snowbird, Utah, French Alps
|nice place to have a meeting:
Las Vegas in August
Hah. Or rather, ouch! I had two thoughts reading this. (1) Poor statisticians. Machine learners invent annoying new terms, sound cooler, and have all the fun. (2) What’s wrong with statistics? They have way less funding and influence than it seems they might deserve.
There are several issues going on here, both substantive and cultural:
There might be too much re-making-up of terms on the ML side. But lots of these are useful. “Weights” is a great, intuitive term for the parameters of a linear model. I use it all the time to explain classifiers and regressions to non-experts. I was surprised to see “test set” on the statistics side; I’m used to thinking of held-out test set accuracy as an extremely common ML technique, while in statistics model fit is assessed with parametric assumptions for standard errors and such. I really like cross-validation and bootstrapping as ways of thinking about generalization — again, something that’s far easier to grasp than sampling and hypothesis testing approaches to parameter inference — which keep getting taught to and misunderstood by generations of confused Introduction to Statistics students. For example, how many times has been explained that: No, a p-value is NOT the probability your model is wrong. But scientific papers regularly treat significance levels in that manner (look how many stars are on this result!) On the other hand, cross-validation accuracy *is* something you can interpret as being related to the probability your model is right.
I’ll also note that there are definitely a number of topics in ML that aren’t very related to statistics or probability. Max-margin methods: if all we care about is prediction, why bother using a probability model at all? Why not just optimize the spatial geometry instead? SVM’s don’t require a lick of probability theory to understand. (Of course probability-based approaches are huge in ML, but it’s important to remember they’re not the only game in town, and there is no necessary reason they must be.) And then there are non-traditional settings such as online learning, reinforcement learning, and active learning, where the structure of access to information is in play. There are certainly plenty of things in statistics that aren’t considered part of ML — say, regression diagnostics and significance testing. Finally, many ML problems involve large, high dimensional data and models, where computational issues are very important. For example, in statistical machine translation, alignment models are described with probability theory and fit to data, but their structure is complex enough that optimal inference is intractable, and how you do approximate inference (EM, Viterbi, beam search, etc.) is a very major issue.
But the most interesting differences between stats and ML are institutional.
I’ve been hearing lots of friends compare two dueling courses at Stanford: CS229, the CS department’s “machine learning” course taught by Andrew Ng; and Stat 315 A/B, the Statistics department’s “statistical learning” sequence taught by some combination of Tibshirani, Jerome Friedman, and Trevor Hastie. These people are all top-of-the-line researchers in the field. Their courses’ contents are extremely similar; I’d bet any of them could teach most of the material from the other side.
What differs most is the teaching style. CS has far better lecture notes. Of course, the stats people wrote a very good book; but better lecture notes win because I can access them later and send them to people for free. CS students I’ve talked to think the CS course is better taught; I can’t find stats students who take the CS course. (My sample is biased, though I know people in both.) Finally, the CS course has a big, open-ended project component; the Stats course follows more of a traditional problem set and tests format.
I think this is reflective of the differences in institutional culture between CS and Stats. There’s an interesting John Langford post on part of the issue, which he calls “The Stats Handicap”. He points out that stats Ph.D.’s have a big disadvantage in the job market because statistics has an old-school journal-oriented publishing culture, so students publish much less and have less experience engaging with a research community. CS is conference-oriented — certain conferences have a higher prestige than many journals (e.g. NIPS in ML, CHI in HCI) — and this results in faster turnaround, dissemination, and collaboration. (I’ve heard others make similar comparisons between CS and psychology.) I’d expect any discipline with a larger conference emphasis to have better courses since they should reward presentation/teaching skills — or at least encourage practice — more than in journal world.
ML sounds like it’s young, vibrant, interesting to learn, and growing; Stats does not.
Is marketing a problem? Machine learning terms definitely sound pretty cool. Maybe the perspective of computational intelligence lends itself to cool names. Though the Stanford statisticians certainly know how to play this game — for example, they made up their own names for variants of L1 and L2-regularized regression, leaving annoyed people like me forever googling “lasso” and “ridge” trying to remember which is which. (On the other hand, perhaps that’s child’s play compared to the true original sin of ML nomenclature: tossing around the highly deceptive term “neural network” for a stack of linear functions paired with a wonky, overhyped training algorithm; the combination of which, many years later, still causes confusion. Definitely blame CS for that one.)
Another issue is the definition of statistics itself. In 1997, Jerome Friedman wrote an extremely interesting analysis of the situation: “Data Mining and Statistics: What’s the Connection?”. He points out, quite correctly, the statistical impoverishment of some common approaches to data mining. You can certainly blame statistics for not marketing its ideas well enough, or blame CS for ignoring statistics. For example there’s a good case that lots of genetic algorithms and neural network research was much ado about nothing — that is, over-complicated cool-sounding hammers looking for nails when all you needed were some time-honored statistical and optimization techniques. (E.g. why NN when you haven’t tried a straight-up GLM? Why GA when you haven’t tried Nelder-Mead?) But this problem has been rectified somewhat — for example, NLP has seen a big move to simple linear models as the default technique, and NN’s and GA’s have fallen from grace in mainstream ML.
Friedman argues part of the problem is in how statisticians approach problems and the world:
One can catalog a long history of Statistics (as a field) ignoring useful methodology developed in other data related fields. Here are some of them that had seminal beginnings in Statistics but for the most part were subsequently ignored in our field: Pattern Recognition, Neural Networks, Machine Learning, Graphical Models, Chemometrics, Data Visualization.
That is not to say statistics is not important — it’s incredibly important. He quotes Efron as saying “Statistics has been the most successful information science.” However, information science is becoming bigger and broader and more exciting, thanks to computation and ever-increasing amounts of data. What should statisticians do? Friedman continues (light editing and emphasis is mine):
One view says that our field should concentrate on that small part of information science that we do best, namely probabilistic inference based on mathematics. If this view is adopted, we should become resigned to the fact that the role of Statistics as a player in the “information revolution” will steadily diminish over time.
Another point of view holds that statistics ought to be concerned with data analysis. The field should be defined in terms of a set of problems — rather than a set of tools — that pertain to data. Should this point of view ever become the dominant one, a big change would be required in our practice and academic programs.
First and foremost, we would have to make peace with computing. It’s here to stay; that’s where the data is. This has been one of the most glaring omissions in the set of tools that have so far defined Statistics. Had we incorporated computing methodology from its inception as a fundamental statistical tool (as opposed to simply a convenient way to apply our existing tools) many of the other data related fields would not have needed to exist. They would have been part of our field.
Friedman wrote this article more than 10 years ago. All his observations about the importance and increasing prevalence of data and computing power are even more true today than back then. Has the field of statistics changed? Not clear. (I’d appreciate seeing evidence to the contrary.)
On the other hand a world of data *has* to be increasingly statistical. The positive spin from Efron:
A new generation of scientiﬁc devices, typiﬁed by microarrays, produce data on a gargantuan scale – with millions of data points and thousands of parameters to consider at the same time. These experiments are “deeply statistical”. Common sense, and even good scientiﬁc intuition, won’t do the job by themselves. Careful statistical reasoning is the only way to see through the haze of randomness to the structure underneath. Massive data collection, in astronomy, psychology, biology, medicine, and commerce, is a fact of 21st Century science, and a good reason to buy statistics futures if they are ever offered on the NASDAQ.
I know that I’m interested in quantitative information science, including statistics and data analysis. Machine learning has many strengths, but it is definitely an odd way to go about analysis. But there’s a good case that statistics, as traditionally defined, is only going to have a smaller role in the future. “Data mining” sounds more relevant, but does it even exist as a coherent subject? Maybe it’s time to study a more applied statistical field like econometrics.