Removing non-English words from text using Python

Question

1 Answer

Anurag · Answer 1 · 2019-07-06T13:39:15+0000

You can simply use Python’s NLTK library. The Natural Language Toolkit (NLTK) is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.

You can use the words corpus method from NLTK:

import nltk
words = set(nltk.corpus.words.words())
sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if w.lower() in words or not w.isalpha())
# 'Io to the beach with my'

As the above output, Io happens to be an English word. In general, it may be hard to decide whether a word is English or not.

Hope this answer helps.

If you wish to learn more about Machine learning, visit Machine Learning tutorial and Machine Learning course by Intellipaat.

If you are interested to learn Python from Industry experts, you can sign up for this Python Certification Course by Intellipaat.

Removing non-English words from text using Python

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources