My iPhone auto-corrected “Harvard” to “Garbage”. Well played Apple engineers.
I was wondering how this would happen, and then noticed that each character pair has 0 to 2 distance on the QWERTY keyboard. Perhaps their model is eager to allow QWERTY-local character substitutions.
[('h', 'g'), ('a', 'a'), ('r', 'r'), ('v', 'b'), ('a', 'a'), ('r', 'g'), ('d', 'e')]
And then most any language model thinks p(“garbage”) > p(“harvard”), at the very least in a unigram model with a broad domain corpus. So if it’s a noisy channel-style model, they’re underpenalizing the edit distance relative to the LM prior. (Reference: Norvig’s noisy channel spelling correction article.)
On the other hand, given how insane iPhone autocorrections are, and from the number of times I’ve seen it delete a quite reasonable word I wrote, I’d bet “harvard” isn’t even in their LM. (Where the LM is more like just a dictionary; call it quantizing probabilities to 1 bit if you like.) I think Hal mentioned once he would gladly give up GB’s of storage for a better language model to make iPhone autocorrect not suck. That sounds like the right tradeoff to me.
Language models with high coverage are important. As illustrated in e.g. one of those Google MT papers. Wish Apple would figure this out too.