Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

Where can I find some real world typo statistics?

I'm trying to match people's input text to internal objects, and people tend to make spelling mistakes.

There are 2 kinds of mistakes:

typos - "Helllo" instead of "Hello" / "Satudray" instead of "Saturday" etc.

Spelling - "Shikago" instead of "Chicago"

I use Damerau-Levenshtein distance for the typos and Double Metaphone for spelling (Python implementations here and here).

I want to focus on the Damerau-Levenshtein (or simply edit-distance). The textbook implementations always use '1' for the weight of deletions, insertions substitutions and transpositions. While this is simple and allows for nice algorithms it doesn't match "reality" / "real-world probabilities".

1 Answer

0 votes
by (33.1k points)

You can check out some possible sources given below:

For real world typo statistics would be in Wikipedia's complete edit history.

http://download.wikimedia.org/

You can also try AWB's RegExTypoFix

http://en.wikipedia.org/wiki/Wikipedia:AWB/T

Hope this answer helps.

If you wish to know more about real-world typo statistics then enroll for the Machine Learning Course.

Browse Categories

...