I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words that I would like to remove. I have been searching online whether I would be able to do this on Python using a tool kit like nltk.

For example, given some text :

"Io andiamo to the beach with my amico."

I would like to be left with :

"to the beach with my" 

Does anyone know of a way as to how this could be done? Any help would be much appreciated.

1 Answer

+1 vote


You can simply use Python’s NLTK library. The Natural Language Toolkit (NLTK) is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.

You can use the words corpus method from NLTK:

import nltk

words = set(nltk.corpus.words.words())

sent = "Io andiamo to the beach with my amico."

" ".join(w for w in nltk.wordpunct_tokenize(sent) \

         if w.lower() in words or not w.isalpha())

# 'Io to the beach with my'

As the above output, Io happens to be an English word. In general, it may be hard to decide whether a word is English or not.

Hope this answer helps.

