I got an email from a promoter for Predictive Analytics World, a very expensive conference next month in San Francisco for business applications of data mining / machine learning / predictive analytics. I’m not going because I don’t want to spend $1600 of my own money, but it looks like it has a good lineup and all (Andreas Weigend, Netflix BellKor folks, case studies from interesting companies like Linden Labs, etc.). If you’re a cs/statistics person and want a job, this is probably a good place to meet people. If you’re a businessman and want to hire one, this is probably a bad event since it’s too damn expensive for grad school types. I am supposed to have access to a promotional code for a 15% discount, so email me if you want such a thing.
John Langford posted a very interesting email interview with one of the organizers for the event, about how machine learning gets applied in the real world. The guy seemed to think that data integration — getting all the data out of different information systems within an organization and in the same place — is the most critical and hardest step. This aligns with my experiences. What machine learning people actually study, the algorithms and models, is often the 2nd or 3rd or lower priority concern in an applications realm, at least for creating a new system. (Similar points from that Jeff Hammerbacher video — most important thing for Facebook’s internal analytics efforts was data integration, e.g. clever combinations of Scribe and Hadoop). Important exception is if the research is creating a new domain that didn’t exist before. But knowing how to improve document classification f-score by another 2% isn’t going to matter too much unless you have a very mature system already.
Brendan –
Nice catch on this conference – Eric Siegel, the organizer, appears to be heavily focused on real-world case studies — which should be interesting.
As you know, the Bay Area R UseRs group is doing a free, co-located event on Wed evening of the conference — so if you’re interested in mingling with some PAW folks as well as some R users — you can sign up at: http://ia.meetup.com/67/calendar/9573566/
Mike
Amen to the data munging being most of the work. We’re currently working on a customer project and I’ve already written two different parsers for their CSV data. The customer’s programmers’ efforts to piece together the data from their web logs has been Herculean. And still we’re not 100% sure of what we have. There are all kinds of fields with numeric codes that are just plain hard to figure out, even with the business people and some of the coders present — much of it’s legacy codes that have to be extracted from comments in their .h header files.
We were getting good results with a feature that turned out to be cheating because while it made sense to use it, the value in the logs didn’t reflect the value in the incoming request, but rather the value in the outgoing response, which indirectly coded the category for classification.
The kicker is that this data’s only an approximation of the real problem. But it’s the best we have, and while more data’s better than more learning, some data’s better than nothing.
Other customers have wanted us to find things for them (e.g. forward earnings statements in 10Q footnotes, opinions of cars in blogs, recording artists in news), but there was no existing data, so we had to (help them) create it. That’s when they run into the problems like whether “Bob Dylans” is a person mention in The Six Bob Dylans: More Photos From Todd Haynes’ “I’m Not There” Movie. It turns out the customers are not semantic grad students or ontology boffins, so they usually just don’t care.
But the real problem with all of this tuning to within a percent of a system’s life is that it’s usually just overfitting when you go out into the wild. For instance, the customer mentioned above plans to change the overall organization and the instruction text on their site, so that none of our training data will exactly replicate the runtime environment.