Explore Courses Blog Tutorials Interview Questions
+1 vote
in Machine Learning by (19k points)

I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words that I would like to remove. I have been searching online whether I would be able to do this on Python using a tool kit like nltk.

For example, given some text :

"Io andiamo to the beach with my amico."

I would like to be left with :

"to the beach with my" 

Does anyone know of a way as to how this could be done? Any help would be much appreciated.

1 Answer

+1 vote
by (33.1k points)

You can simply use Python’s NLTK library. The Natural Language Toolkit (NLTK) is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.

You can use the words corpus method from NLTK:

import nltk

words = set(nltk.corpus.words.words())

sent = "Io andiamo to the beach with my amico."

" ".join(w for w in nltk.wordpunct_tokenize(sent) \

         if w.lower() in words or not w.isalpha())

# 'Io to the beach with my'

As the above output, Io happens to be an English word. In general, it may be hard to decide whether a word is English or not.

Hope this answer helps.

If you wish to learn more about Machine learning, visit Machine Learning tutorial and Machine Learning course by Intellipaat.

If you are interested to learn Python from Industry experts, you can sign up for this Python Certification Course by Intellipaat.

Browse Categories