By: Bob Carpenter

Bob Carpenter — Fri, 23 Jan 2009 21:47:48 +0000

Amen to the data munging being most of the work. We’re currently working on a customer project and I’ve already written two different parsers for their CSV data. The customer’s programmers’ efforts to piece together the data from their web logs has been Herculean. And still we’re not 100% sure of what we have. There are all kinds of fields with numeric codes that are just plain hard to figure out, even with the business people and some of the coders present — much of it’s legacy codes that have to be extracted from comments in their .h header files.

We were getting good results with a feature that turned out to be cheating because while it made sense to use it, the value in the logs didn’t reflect the value in the incoming request, but rather the value in the outgoing response, which indirectly coded the category for classification.

The kicker is that this data’s only an approximation of the real problem. But it’s the best we have, and while more data’s better than more learning, some data’s better than nothing.

Other customers have wanted us to find things for them (e.g. forward earnings statements in 10Q footnotes, opinions of cars in blogs, recording artists in news), but there was no existing data, so we had to (help them) create it. That’s when they run into the problems like whether “Bob Dylans” is a person mention in The Six Bob Dylans: More Photos From Todd Haynes’ “I’m Not There” Movie. It turns out the customers are not semantic grad students or ontology boffins, so they usually just don’t care.

But the real problem with all of this tuning to within a percent of a system’s life is that it’s usually just overfitting when you go out into the wild. For instance, the customer mentioned above plans to change the overall organization and the instruction text on their site, so that none of our training data will exactly replicate the runtime environment.

By: Michael E. Driscoll

Michael E. Driscoll — Fri, 23 Jan 2009 21:26:32 +0000

Brendan –

Nice catch on this conference – Eric Siegel, the organizer, appears to be heavily focused on real-world case studies — which should be interesting.

As you know, the Bay Area R UseRs group is doing a free, co-located event on Wed evening of the conference — so if you’re interested in mingling with some PAW folks as well as some R users — you can sign up at: http://ia.meetup.com/67/calendar/9573566/

Mike

Comments on: SF conference for data mining mercenaries

By: Bob Carpenter

By: Michael E. Driscoll