Natural language is a language that has developed naturally in humans. So, in this blog on “What is Natural Language Processing?” we will learn all the major concepts of NLP and work with packages such as NLTK and Spacy.
Introduction to Natural Language Processing
Consider any of these languages, say, English, Hindi, French, or any of the numerous languages which have evolved naturally in humans through use and repetition without conscious planning can be termed as natural language.
Humans can easily comprehend these languages. But if the same languages are fed to a machine, will it be able to comprehend them? So, this where Natural Language Processing plays a key role.
Natural Language Processing is the ability of a computer program to understand human language as it is spoken.
In other words, Natural Language Processing is used to gain knowledge from the raw textual data at disposal.
You can go through this video on “What is Natural Language Processing?” to get more insights into the concepts.
Want to discover the fascinating world of NLP? Check out our YouTube video on what NLP is!
Now we will go ahead and see, ‘What is Natural Language Processing used for?’
What is Natural Language Processing used for?
To answer the question, ‘What is Natural Language Processing used for?,’ here are a few of the applications of Natural Language Processing:
- Google Translate
- Personal assistant applications like Cortana, Siri, and OK Google
- Word processors such as Microsoft Word
- Editing platforms like Grammarly
Now, going ahead with this blog on “What is Natural Language Processing?” we will look into a use case.
Use Case: Google Translator
Here is a demonstration of Google Translator. We will type in Google, ‘What is Natural Language Processing?’ in English, and if we want a German version of this, it would be as follows:
Similarly, we can translate the same phrase into any other language.
Now that we have answered the questions ‘What is Natural Language Processing?’ and ‘What is Natural Language Processing used for?,’ now moving ahead with this blog on “What is Natural Language Processing?” let’s look at one of the components of Natural Language Processing that is Natural Language Understanding.
What is Natural Language Understanding?
In this blog on “What is Natural Language Processing” Natural Language Understanding, as the name states, deals with understanding inputs given in the form of sentences in text or speech. So, this is where the machine analysis the different aspects of a language.
Now, in this blog on “What is Natural Language Processing?” we will implement the NLP concepts, for that we would need some tailor-made packages. In this Natural Language Processing tutorial, we will study two packages that are NLTK and spaCy.
Generally, the first step in the NLP process is tokenization. In tokenization, we basically split up our text into individual units and each individual unit should have a value associated with it.
Let’s look at an example:
We have this sentence ‘What is Natural Language Processing?’
Here, each word in this sentence is taken as a separate token including the question mark.
We can use these tokens for other processes like parsing or text mining.
Tokenizing a Sentence Using the NLTK Package
import nltk
import nltk.corpus
from nltk.tokenize import word_tokenize
#string
cricket=”After surprising the host in the first test, Sri Lanka made a positive start to the Test as well by bowling South Africa out for 222 before slightly losing their advantage toward the end of the day’s play. ”
#tokenizing
cricket_tokens=word_tokenize(cricket)
cricket_tokens
#checking the type and number of tokens
type(cricket_tokens), len(cricket_tokens)
#frequency of tokens
from nltk.probability import FreqDist
fdist = FreqDist()
for i in cricket_tokens:
fdist[i]=fdist[i]+1
fdist
#10 most common tokens
top_10=fdist.most_common(10)
top_10
We are doing this Natural Language Processing using Python programming language. As mentioned earlier, human beings can understand the linguistic structures and their meanings easily. But machines are not successful enough for natural language comprehension yet. So, when a machine has to parse through one word at a time it may not be able to fully understand the semantics of a sentence. And these words when given one at a time are known as uni-grams.
Like in the below image, the sentence ‘What is Natural Language Processing?’ is parsed one word at a time.
Then in a bi-gram, the sentence ‘What is Natural Language Processing?’ is parsed two words at a time.
Finally, in a tri-gram, the sentence ‘What is Natural Language Processing?’ is parsed three words at a time.
#bigrams, ngrams
black_smoke=”Did you know, there was a tower, where they look out to the land, to see the people quickly passing by”
black_smoke_token= word_tokenize(black_smoke)
black_smoke_token
list(nltk.bigrams(black_smoke_token))
list(nltk.trigrams(black_smoke_token))
list(nltk.ngrams(black_smoke_token,4))
Stemming
Stemming is the process of reducing a word to its base form and this is done by cutting off the beginning or end of the word. This indiscriminate cutting will be successful in some occasions and will fail in others.
Let us look at an example:
In all these three cases, we can see that only in the third case we have a word that makes sense. So, when we are implementing stemming, it is always not necessary that the final stemmed word we get should have a meaning associated with it.
Now, there are many stemming algorithms available and one such algorithm is PorterStemmer.
#stemming
from nltk.stem import PorterStemmer
pst=PorterStemmer()
pst.stem(“winning”), pst.stem(“studies”), pst.stem(“buying”)
Output :
(‘win’, ‘studi’, ‘buy’ )
Lemmatization
Lemmatization is the process of reducing words into their lemma (the dictionary form of the word). Here, the base form into which the word is converted should definitely have a meaning associated with it.
So, here we can see that these three individual words have a meaning associated with them.
#Lemmatization
from nltk.stem import wordnet
from nltk.stem import WordnetLemmatizer
lemmatizer= WordnetLemmatizer()
words_to_stem=[“cats”, “cacti”, “geese”]
for i in words_to _stem:
print(i + “:” + lemmatizer.lemmatize(i))
Output:
cats:cat
Cacti:cactus
Geese:goose
Now, another important concept in Natural Language Processing is Parts of Speech Tagging (POS Tagging).
In English language, words are considered as the smallest elements that have distinctive meaning, and based on their use or function the words are categorized into different classes known as parts of speech. POS tagging is the process of marking or tagging every word in a sentence to its corresponding part of speech.
Example:
Here, we marked each of these words with a POS tag.
#pos tagging
NLP=”What is Natural Language Processing? I am a professional on this.”
#tokenizing
peace_tokenize = word_tokenize(NLP)
Now, we will start off with a for loop which will iterate through all of the tokens, and for each of the tokens we will add a POS tag with the help of the pos_tag function. Then, we will use nltk.pos_tag and add a POS tag to each of the individual tokens from this peace_tokenize list.
for i in NLP_tokenize:
print(nltk.pos_tag([i]))
Now, in this blog on “What is Natural Language Processing?”, we will look at Named Entity Recognition and implement it using the NLTK package and the Spacy package.
Named Entity Recognition
It is the process of taking a string of text as input and identifying the relevant nouns such as people, places, or organizations that are mentioned in that string.
Example:
So, in this sentence, ‘Apple’ is recognized as an ‘organization’; ‘UK ’ is recognized as a ‘Geo Political Entity,’ and ‘1$ billion’ is recognized as ‘Money.’ One thing to be noted here is that all these three words are ‘nouns.’ Named Entity Recognition is done on the nouns only.
For Named Entity Recognition, we will require the ne_chunk function.
#named entity recognition
from nltk import ne_chunk
john = ‘ John lives in New York ’
john_token=word_tokenize(john)
After tokenizing these words, we will add POS tags with respect to each of these tokens by using the function nltk.pos_tag and passing these tokens, and then we will store them in john_tags.
john_tags=nltk.pos_tag(john_token)
john_tags
#named entity recognition for pos tags
john_ner=ne_chunk(john_tags)
print(john_ner)
Till now, all the tasks were done using NLTK. We also have spaCy which is relatively a new framework in the Python natural language processing environment.
This spaCy is written in Cython, i.e., the C extension of Python that provides C-like performance to Python programs.
Importing the spaCy package:
This spaCy package has different models, so we will load the en_core_web_sm model and store it in the nlp object.
import spacy
nlp=spacy.load(‘en_core_web_sm’)
Note that using this nlp object, we can create different documents. These documents basically contain strings. Each string can be classified as a document.
doc=nlp(“This is sparta!!”)
Now, tokenization is very simple when it comes to the spaCy package. What we have to do is start a for loop and print all of the tokens present in that document.
#tokenization
for token in doc:
print(token.text)
token=doc[2]
token
Output:
sparta
#starting at 2 and ending at 5
span=doc[2:5]
span
Output:
Sparta!!
#printing token index and token text together
doc=nlp(“This is Sparta!!”)
for token in doc:
print(token.i, token.text)
Now, we will iterate through all of the tokens in this document. First, we will print the index of the token, then the text of the token, and finally we will print the pos_tag of this token.
#pos tagging
for token in doc:
print(token.i, token.text, token.pos_)
#Named Entity Recognition
doc=nlp(“Apple is looking to buy UK startup for 1$ billion”)
#this will print out the entities’ text and the corresponding entity label.
for ent in doc.ents:
print(ent.text, ent.label_)
doc = nlp(“Barak Obama the former president of United States will be vacating the White House today”)
for ent in doc.ents:
print(ent.text, ent.label_)
Now we will see matcher. It helps in recognizing patterns in a string with different criteria such as understanding the lemma of the word, understanding if the word is a digit or not, understanding if the word is a punctuation mark or not, etc.
from spacy.matcher import Matcher
doc = nlp(“Barack Obama the former president of United States will be vacating the White House today”)
First, we will create a pattern. A pattern is created as a list of dictionaries. Each dictionary indicates a word. Here, we are trying to match the first word with a Lemma value. So, the first word ‘Lemma’ should be ‘vacate’ and after that it needs to be followed by another word which is ‘white’.
So, basically, we need to extract a substring from this entire doc which is basically two words; the first is the lemma variation of vacate and the second word is white.
pattern = [{‘Lemma’:’vacate’}, {‘orth’:’white’}]
matcher=Matcher(nlp.vocab)
matcher.add(‘white patterrn’, None, pattern)
matches=matcher(doc)
for match_id, start, end in matches:
matched_span = doc[ start : end ]
print(matched_span.text)
Going ahead in this blog on “What is Natural Language Processing?”, we will implement sentiment analysis using the NLTK package.
Sentiment Analysis Using the NLTK Package
For doing sentiment analysis using the NLTK package, we will import the required package first.
import nltk
import random
from nltk.classify.scikitlearn import SklearnClassifier
import pickle
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, Linear SVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode
from nltk.tokenize import word_tokenize
import re
import os
We will be performing sentiment analysis on the IMDB reviews dataset. This basically consists of 25,000 reviews, and out of these reviews 12,500 are positive and the rest are negative. So, these two are stored in separate folders, pos for positive and neg for negative reviews.
file_pos=os.listdir(‘D:/sentiment/aclImdb/train/pos’)
file_pos=[ open(‘D:/sentiment/aclImdb/train/pos’ + f, ‘ r ’, encoding = ’ utf8 ’ ).read for f in files_pos]
file_neg=os.listdir(‘D:/sentiment/aclImdb/train/neg’)
file_neg=[ open(‘D:/sentiment/aclImdb/train/neg’ + f, ‘ r ’, encoding = ’ utf8 ’ ).read for f in files_neg]
len(files_pos),len(files_neg)
Output:
(12500, 12500)
Now, we will take a set of first 1,000 reviews.
files_pos=files_pos[0:1000]
files_neg=files_neg[0:1000]
len(files_pos), len(files_neg)
To preprocess all of these reviews, we will start off by creating two empty lists:
all_words = [ ]
documents = [ ]
The first task in preprocessing is to remove stopwords. Let’s see how to do that.
from nltk.corpus import stopwords
import re
stop_words = list(set(stopwords.words(‘English’)) )
Now, what we want is a bag of words or a bag of adjectives (because using adjectives is a better way to understand the sentiment of a review).
# j is adject , r is adverb, and v is verb
#allowed_word_types = [“J”,”R”,”V”]
allowed_word_types = [“J”]
This for loop starts from the 1st review and goes till the 1,000th positive review.
for p in files_pos:
#create a list of tuples where the first element of each tuple is a review and the second element is a label
documents.append(p, ”pos”)
#remove punctuations
cleaned=re.sub(r’[^(a-zA-Z)\s]’, ’ ’, p)
#tokenizing the words and passing in the cleaned object into word_tokenize
#here the cleaned object is the one that has no punctuation
tokenized=word_tokenize(cleaned)
#The below line basically means that if a token is not present in stop_words, then we have to go ahead and store it in stopped, which will continue this loop, and we will get only those tokens which are not present in stop_words
stopped= [w for w in tokenized if not w in stop_words]
#pos tagging
pos=nltk.pos_tag(stopped)
Over here inside this loop, we will just select only those tokens which are present in allowed_word_types or, in other words, we will select only those words that are adjectives and store it in all_words.
for w in pos:
if w[1][0] in allowed_word_types:
all_words.append(w[0].lower())
#Now, all_words will have all of the positive adjectives in it
#For negative reviews
for p in file_neg:
#create a list of tuples where the first element of each tuple is a review and the second element is a label
documents.append(p, “neg”)
#remove punctuations
cleaned=re.sub(r’[^(a-zA-Z)\s]’, ’ ’, p)
#tokenize
tokenized=word_tokenize(cleaned)
#remove stopwords
stopped= [w for w in tokenized if not w in stop_words]
#pos tagging
neg=nltk.pos_tag(stopped)
#make a list of all adjectives identified by the allowed word types listed above
for w in neg:
if w[1][0] in allowed_word_types:
all_words.append(w[0].lower())
Now, this all_words will have the combination of all of the negative and positive adjectives.
#creating frequency distribution of each adjectives
all_words=nltk.FreqDist(all_words)
all_words
import matplotlib.pyplot as plt
all_words.plot(30,cumulative=False)
plt.show()
#listing the 1,000 most frequent words
word_features=list(all_words.keys())[:1000]
word_features
#function to create a dictionary of features for each review in the list documents
#the keys are the words in the words_features
#the value of each key is either true or false denoting whether that feature appears in the review or not
def find_features(document) :
words = word_tokenize(document)
features={}
for w in word_features:
features[w] = (w in words)
return features
#creating features for each review
featuresets= [(find_features(rev),category) for (rev,category) in documents]
#shuffling the documents
random.shuffle(featuresets)
training_set=featuresets[:800]
testing_set=featuresets[800:]
featuresets[1]
Here, false means that adjectives are not present in the review and true means that those adjectives are present in the review.
Combining all of these adjectives, which are present in the review, we will get the positive label and this is the training set.
Then, we would build classifiers on top of this training set, and once the learning is done we will try to predict the values on top of the test set and see whether the prediction is correct.
Now, we will start off by implementing the Naïve Bayes classifier and train it on top of the training set and check for the accuracy on top of the test set.
classifier=nltk.NaiveBayesClassifier.train(training_sets)
print(“classifier accuracy percent:”,(nltk.classify.accuracy(classifier,testing_set))*100)
classifier.show_most_informative_features(15)
Classifier accuracy percent: 75.5
Here, the ratio of pos:neg is 11.3:1.0 for the first feature. This means that the word ‘powerful’ is 11.3 times more likely to occur in a positive review than a negative review. It is similarly for all the features.
Implementing Other Classifiers
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC
from sklearn import metrics
MNB_clf=SklearnClassifier(MultinomialNB())
mnb_cls=MNB_clf.train(training_set)
print(“classifier accuracy percent:”,(nltk.classify.accuracy(mnb_cls,testing_set))*100)
Accuracy: 77.5
BernoulliNB:
BNB_clf=SklearnClassifier(BernoulliNB())
bnb_cls=BNB_clf.train(training_set)
print(“classifier accuracy percent:”,(nltk.classify.accuracy(bnb_cls,testing_set))*100)
Accuracy: 75.9166
Logistic Regression:
LogReg_clf=SklearnClassifier(LogisticRegression())
log_cls=LogReg_clf.train(training_set)
print(“classifier accuracy percent:”,(nltk.classify.accuracy(log_cls,testing_set))*100)
Accuracy: 72.833
SGD Classifier:
SGD_clf=SklearnClassifier(SGDClassifier())
sgd_cls=SGD_clf.train(training_set)
print(“classifier accuracy percent:”,(nltk.classify.accuracy(sgd_cls,testing_set))*100)
Accuracy: 68.4166
SVC Classifier:
SVC_clf=SklearnClassifier(SVC())
svc_cls=SVC_clf.train(training_set)
print(“classifier accuracy percent:”,(nltk.classify.accuracy(svc_cls,testing_set))*100)
Accuracy: 49.5
So, we can see that the accuracy decreases as we go from Multinomial Naïve Bayes to SVC classifier.
Here, the highest accuracy is of the Multinomial NB classifier.
Therefore, we can choose this model over all the other models when we are trying to ascertain the sentiment of words.
This is how we can do the sentiment analysis using the NLTK package and the Scikit package.
In this blog on “What is Natural Language Processing?”, we have seen “What is Natural Language Processing?”, “What is Natural Language Processing used for?”, and we also implemented some of the NLP techniques using Python.
Frequently Asked Questions (FAQs)
What is NLP used for in data science?
NLP (Natural Language Processing) is used for analyzing, understanding, and generating human language data, aiding in sentiment analysis, chatbots, translation, and other language-related tasks in data science.
What does NLP mean in data?
NLP stands for Natural Language Processing, a field at the intersection of computer science, artificial intelligence, and linguistics, aiming to enable computers to understand and process human language.
Is NLP required for data science?
NLP is a specialized area within data science. It’s essential for projects involving text analysis, sentiment analysis, or language-based predictive modeling.
What is NLP with example?
NLP includes tasks like sentiment analysis (determining sentiment from text), machine translation (translating text between languages), and named entity recognition (identifying proper nouns in text).
What is the main use of NLP?
NLP’s main use is to enable machines to understand, interpret, and generate human language, facilitating human-computer interaction and analysis of text data.
What is the salary of an NLP scientist?
Salaries vary by region and experience. NLP scientists are highly specialized and can command high salaries, often comparable to or exceeding those of other data scientists.
What skills do you need for NLP data scientist?
Skills include programming (Python, Java), machine learning, deep learning, linguistic knowledge, and familiarity with NLP libraries like NLTK or spaCy.
Which type of data is used by NLP?
NLP primarily uses text data, but can also work with speech data when combined with speech processing techniques.
Is NLP a machine learning?
NLP often employs machine learning techniques to learn from and make predictions or decisions based on text data.
Is coding required for NLP?
Yes, coding is essential for implementing NLP algorithms, processing text data, and utilizing NLP libraries and frameworks.