Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Machine Learning by (19k points)

Where can I find some real world typo statistics?

I'm trying to match people's input text to internal objects, and people tend to make spelling mistakes.

There are 2 kinds of mistakes:

typos - "Helllo" instead of "Hello" / "Satudray" instead of "Saturday" etc.

Spelling - "Shikago" instead of "Chicago"

I use Damerau-Levenshtein distance for the typos and Double Metaphone for spelling (Python implementations here and here).

I want to focus on the Damerau-Levenshtein (or simply edit-distance). The textbook implementations always use '1' for the weight of deletions, insertions substitutions and transpositions. While this is simple and allows for nice algorithms it doesn't match "reality" / "real-world probabilities".

1 Answer

0 votes
by (33.1k points)

You can check out some possible sources given below:

For real world typo statistics would be in Wikipedia's complete edit history.

http://download.wikimedia.org/

You can also try AWB's RegExTypoFix

http://en.wikipedia.org/wiki/Wikipedia:AWB/T

Hope this answer helps.

If you wish to know more about real-world typo statistics then enroll for the Machine Learning Course.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94.1k users

Browse Categories

...