• Articles
  • Tutorials
  • Interview Questions

What is Natural Language Processing?

What is Natural Language Processing?

Natural language is a language that has developed naturally in humans. So, in this blog on “What is Natural Language Processing?” we will learn all the major concepts of NLP and work with packages such as NLTK and Spacy.

Introduction to Natural Language Processing

Consider any of these languages, say, English, Hindi, French, or any of the numerous languages which have evolved naturally in humans through use and repetition without conscious planning can be termed as natural language.

Humans can easily comprehend these languages. But if the same languages are fed to a machine, will it be able to comprehend them? So, this where Natural Language Processing plays a key role.

Natural Language Processing is the ability of a computer program to understand human language as it is spoken.

In other words, Natural Language Processing is used to gain knowledge from the raw textual data at disposal.

You can go through this video on “What is Natural Language Processing?” to get more insights into the concepts.

Want to discover the fascinating world of NLP? Check out our YouTube video on what NLP is!

Video Thumbnail

Now we will go ahead and see, ‘What is Natural Language Processing used for?’

What is Natural Language Processing used for?

To answer the question, ‘What is Natural Language Processing used for?,’ here are a few of the applications of Natural Language Processing:

  • Google Translate
  • Personal assistant applications like Cortana, Siri, and OK Google
  • Word processors such as Microsoft Word
  • Editing platforms like Grammarly

Now, going ahead with this blog on “What is Natural Language Processing?” we will look into a use case.

Use Case: Google Translator

Here is a demonstration of Google Translator. We will type in Google, ‘What is Natural Language Processing?’ in English, and if we want a German version of this, it would be as follows:

google translate-What is Natural Language Procesing?-Intellipaat

Similarly, we can translate the same phrase into any other language.

Now that we have answered the questions ‘What is Natural Language Processing?’ and ‘What is Natural Language Processing used for?,’ now moving ahead with this blog on “What is Natural Language Processing?” let’s look at one of the components of Natural Language Processing that is Natural Language Understanding.

What is Natural Language Understanding?

In this blog on “What is Natural Language Processing” Natural Language Understanding, as the name states, deals with understanding inputs given in the form of sentences in text or speech. So, this is where the machine analysis the different aspects of a language.

Now, in this blog on “What is Natural Language Processing?” we will implement the NLP concepts, for that we would need some tailor-made packages. In this Natural Language Processing tutorial, we will study two packages that are NLTK and spaCy.

Learn new Technologies

Generally, the first step in the NLP process is tokenization. In tokenization, we basically split up our text into individual units and each individual unit should have a value associated with it.

Let’s look at an example:

We have this sentence ‘What is Natural Language Processing?’

Here, each word in this sentence is taken as a separate token including the question mark.

We can use these tokens for other processes like parsing or text mining.

Tokenizing a Sentence Using the NLTK Package

import nltk

import nltk.corpus

from nltk.tokenize import word_tokenize
#string

cricket=”After surprising the host in the first test, Sri Lanka made a positive start to the Test as well by bowling South Africa out for 222 before slightly losing their advantage toward the end of the day’s play. ”

#tokenizing

cricket_tokens=word_tokenize(cricket)

cricket_tokens

#checking the type and number of tokens

type(cricket_tokens), len(cricket_tokens)

#frequency of tokens

from nltk.probability import FreqDist

fdist = FreqDist()

for i in cricket_tokens:

fdist[i]=fdist[i]+1

fdist

#10 most common tokens

top_10=fdist.most_common(10)

top_10

We are doing this Natural Language Processing using Python programming language. As mentioned earlier, human beings can understand the linguistic structures and their meanings easily. But machines are not successful enough for natural language comprehension yet. So, when a machine has to parse through one word at a time it may not be able to fully understand the semantics of a sentence. And these words when given one at a time are known as uni-grams.

Like in the below image, the sentence ‘What is Natural Language Processing?’ is parsed one word at a time.

Unigrams-What is Natural Language Procesing?-Intellipaat

Then in a bi-gram, the sentence ‘What is Natural Language Processing?’ is parsed two words at a time.

Finally, in a tri-gram, the sentence ‘What is Natural Language Processing?’ is parsed three words at a time.

#bigrams, ngrams
black_smoke=”Did you know, there was a tower, where they look out to the land, to see the people quickly passing by”

black_smoke_token= word_tokenize(black_smoke)

black_smoke_token

list(nltk.bigrams(black_smoke_token))

list(nltk.trigrams(black_smoke_token))

list(nltk.ngrams(black_smoke_token,4)) 

Stemming

Stemming is the process of reducing a word to its base form and this is done by cutting off the beginning or end of the word. This indiscriminate cutting will be successful in some occasions and will fail in others.

Let us look at an example:

Stemming-What is Natural Language Procesing?-Intellipaat

In all these three cases, we can see that only in the third case we have a word that makes sense. So, when we are implementing stemming, it is always not necessary that the final stemmed word we get should have a meaning associated with it.

Now, there are many stemming algorithms available and one such algorithm is PorterStemmer.

#stemming

from nltk.stem import PorterStemmer

pst=PorterStemmer()

pst.stem(“winning”), pst.stem(“studies”), pst.stem(“buying”)

Output :

(‘win’, ‘studi’, ‘buy’ )

Lemmatization

Lemmatization is the process of reducing words into their lemma (the dictionary form of the word). Here, the base form into which the word is converted should definitely have a meaning associated with it.

Lemmatization-What is Natural Language Processing-Intellipaat

So, here we can see that these three individual words have a meaning associated with them.

#Lemmatization

from nltk.stem import wordnet

from nltk.stem import WordnetLemmatizer

lemmatizer= WordnetLemmatizer()

words_to_stem=[“cats”, “cacti”, “geese”]

for i in words_to _stem:

print(i + “:” + lemmatizer.lemmatize(i))

Output:

cats:cat

Cacti:cactus

Geese:goose

Now, another important concept in Natural Language Processing is Parts of Speech Tagging (POS Tagging).

In English language, words are considered as the smallest elements that have distinctive meaning, and based on their use or function the words are categorized into different classes known as parts of speech. POS tagging is the process of marking or tagging every word in a sentence to its corresponding part of speech.

Example:

POS Tagging-What is Natural Language Procesing?-Intellipaat

Here, we marked each of these words with a POS tag.

#pos tagging

NLP=”What is Natural Language Processing? I am a professional on this.”

#tokenizing

peace_tokenize = word_tokenize(NLP)

Now, we will start off with a for loop which will iterate through all of the tokens, and for each of the tokens we will add a POS tag with the help of the pos_tag function. Then, we will use nltk.pos_tag and add a POS tag to each of the individual tokens from this peace­­_tokenize list.

for i in NLP_tokenize:

print(nltk.pos_tag([i]))

Now, in this blog on “What is Natural Language Processing?”, we will look at Named Entity Recognition and implement it using the NLTK package and the Spacy package.

Named Entity Recognition

It is the process of taking a string of text as input and identifying the relevant nouns such as people, places, or organizations that are mentioned in that string.

Example:

Named Entity Recognition-What is Natural Language Procesing?-Intellipaat

So, in this sentence, ‘Apple’ is recognized as an ‘organization’; ‘UK ’ is recognized as a ‘Geo Political Entity,’ and ‘1$ billion’ is recognized as ‘Money.’ One thing to be noted here is that all these three words are ‘nouns.’ Named Entity Recognition is done on the nouns only.

For Named Entity Recognition, we will require the ne_chunk function.

#named entity recognition

from nltk import ne_chunk

john = ‘ John lives in New York ’

john_token=word_tokenize(john)

After tokenizing these words, we will add POS tags with respect to each of these tokens by using the function nltk.pos_tag and passing these tokens, and then we will store them in john_tags.

john_tags=nltk.pos_tag(john_token)

john_tags
#named entity recognition for pos tags

john_ner=ne_chunk(john_tags)

print(john_ner)

Till now, all the tasks were done using NLTK. We also have spaCy which is relatively a new framework in the Python natural language processing environment.

This spaCy is written in Cython, i.e., the C extension of Python that provides C-like performance to Python programs.

Importing the spaCy package:

This spaCy package has different models, so we will load the en_core_web_sm model and store it in the nlp object.

import spacy

nlp=spacy.load(‘en_core_web_sm’)

Note that using this nlp object, we can create different documents. These documents basically contain strings. Each string can be classified as a document.

doc=nlp(“This is sparta!!”)

Now, tokenization is very simple when it comes to the spaCy package. What we have to do is start a for loop and print all of the tokens present in that document.

#tokenization

for token in doc:

print(token.text)

token=doc[2]

token

Output:

sparta
#starting at 2 and ending at 5

span=doc[2:5]

span

Output:

Sparta!!
#printing token index and token text together

doc=nlp(“This is Sparta!!”)

for token in doc:

print(token.i, token.text)

Now, we will iterate through all of the tokens in this document. First, we will print the index of the token, then the text of the token, and finally we will print the pos_tag of this token.

#pos tagging

for token in doc:

     print(token.i, token.text, token.pos_)

#Named Entity Recognition

doc=nlp(“Apple is looking to buy UK startup for 1$ billion”)

#this will print out the entities’ text and the corresponding entity label.

for ent in doc.ents:

      print(ent.text, ent.label_)

doc = nlp(“Barak Obama the former president of United States will be vacating the White House today”)

for ent in doc.ents:

      print(ent.text, ent.label_)

Now we will see matcher. It helps in recognizing patterns in a string with different criteria such as understanding the lemma of the word, understanding if the word is a digit or not, understanding if the word is a punctuation mark or not, etc.

from spacy.matcher import Matcher

doc = nlp(“Barack Obama the former president of United States will be vacating the White House today”)

First, we will create a pattern. A pattern is created as a list of dictionaries. Each dictionary indicates a word. Here, we are trying to match the first word with a Lemma value. So, the first word ‘Lemma’ should be ‘vacate’ and after that it needs to be followed by another word which is ‘white’.

So, basically, we need to extract a substring from this entire doc which is basically two words; the first is the lemma variation of vacate and the second word is white.

pattern = [{‘Lemma’:’vacate’}, {‘orth’:’white’}]

matcher=Matcher(nlp.vocab)

matcher.add(‘white patterrn’, None, pattern)

matches=matcher(doc)

for match_id, start, end in matches:

matched_span = doc[ start : end ]

print(matched_span.text)

Going ahead in this blog on “What is Natural Language Processing?”, we will implement sentiment analysis using the NLTK package.

Sentiment Analysis Using the NLTK Package

For doing sentiment analysis using the NLTK package, we will import the required package first.

import nltk

import random

from nltk.classify.scikitlearn import SklearnClassifier

import pickle

from sklearn.naive_bayes import MultinomialNB, BernoulliNB

from sklearn.linear_model import LogisticRegression, SGDClassifier

from sklearn.svm import SVC, Linear SVC, NuSVC

from nltk.classify import ClassifierI

from statistics import mode

from nltk.tokenize import word_tokenize

import re

import os

We will be performing sentiment analysis on the IMDB reviews dataset. This basically consists of 25,000 reviews, and out of these reviews 12,500 are positive and the rest are negative. So, these two are stored in separate folders, pos for positive and neg for negative reviews.

file_pos=os.listdir(‘D:/sentiment/aclImdb/train/pos’)

file_pos=[ open(‘D:/sentiment/aclImdb/train/pos’ + f, ‘ r ’, encoding = ’ utf8 ’ ).read for f in files_pos]

file_neg=os.listdir(‘D:/sentiment/aclImdb/train/neg’)

file_neg=[ open(‘D:/sentiment/aclImdb/train/neg’ + f, ‘ r ’, encoding = ’ utf8 ’ ).read for f in files_neg]

len(files_pos),len(files_neg)

Output:

(12500, 12500)

Now, we will take a set of first 1,000 reviews.

files_pos=files_pos[0:1000]

files_neg=files_neg[0:1000]

len(files_pos), len(files_neg)

To preprocess all of these reviews, we will start off by creating two empty lists:

all_words = [ ]

documents = [ ]

The first task in preprocessing is to remove stopwords. Let’s see how to do that.

from nltk.corpus import stopwords

import re

stop_words = list(set(stopwords.words(‘English’)) )

Now, what we want is a bag of words or a bag of adjectives (because using adjectives is a better way to understand the sentiment of a review).

# j is adject , r is adverb, and v is verb

#allowed_word_types = [“J”,”R”,”V”]

allowed_word_types = [“J”]

This for loop starts from the 1st review and goes till the 1,000th positive review.

for p in files_pos:

#create a list of tuples where the first element of each tuple is a review and the second element is a label

documents.append(p, ”pos”)

#remove punctuations

cleaned=re.sub(r’[^(a-zA-Z)\s]’, ’ ’, p)

#tokenizing the words and passing in the cleaned object into word_tokenize

#here the cleaned object is the one that has no punctuation

tokenized=word_tokenize(cleaned)
#The below line basically means that if a token is not present in stop_words, then we have to go ahead and store it in stopped, which will continue this loop, and we will get only those tokens which are not present in stop_words

stopped= [w for w in tokenized if not w in stop_words]

#pos tagging

pos=nltk.pos_tag(stopped)

Over here inside this loop, we will just select only those tokens which are present in allowed_word_types or, in other words, we will select only those words that are adjectives and store it in all_words.

for w in pos:

if w[1][0] in allowed_word_types:

all_words.append(w[0].lower())

#Now, all_words will have all of the positive adjectives in it

#For negative reviews

for p in file_neg:

#create a list of tuples where the first element of each tuple is a review and the second element is a label

documents.append(p, “neg”)

#remove punctuations

cleaned=re.sub(r’[^(a-zA-Z)\s]’, ’ ’, p)

#tokenize

tokenized=word_tokenize(cleaned)

#remove stopwords

stopped= [w for w in tokenized if not w in stop_words]

#pos tagging

neg=nltk.pos_tag(stopped)

#make a list of all adjectives identified by the allowed word types listed above

for w in neg:

if w[1][0] in allowed_word_types:

all_words.append(w[0].lower())

Now, this all_words will have the combination of all of the negative and positive adjectives.

#creating frequency distribution of each adjectives

all_words=nltk.FreqDist(all_words)

all_words

import matplotlib.pyplot as plt

all_words.plot(30,cumulative=False)
plt.show()
Matplotlib graph-Intellipaat
#listing the 1,000 most frequent words

word_features=list(all_words.keys())[:1000]

word_features
#function to create a dictionary of features for each review in the list documents

#the keys are the words in the words_features

#the value of each key is either true or false denoting whether that feature appears in the review or not

def find_features(document) :

words = word_tokenize(document)

features={}

for w in word_features:

features[w] = (w in words)

                 return features

#creating features for each review

featuresets= [(find_features(rev),category) for (rev,category) in documents]

#shuffling the documents

random.shuffle(featuresets)

training_set=featuresets[:800]

testing_set=featuresets[800:]

featuresets[1]

Here, false means that adjectives are not present in the review and true means that those adjectives are present in the review.

Combining all of these adjectives, which are present in the review, we will get the positive label and this is the training set.

Then, we would build classifiers on top of this training set, and once the learning is done we will try to predict the values on top of the test set and see whether the prediction is correct.

Now, we will start off by implementing the Naïve Bayes classifier and train it on top of the training set and check for the accuracy on top of the test set.

classifier=nltk.NaiveBayesClassifier.train(training_sets)

print(“classifier accuracy percent:”,(nltk.classify.accuracy(classifier,testing_set))*100)

classifier.show_most_informative_features(15)
Classifier accuracy percent: 75.5

Here, the ratio of pos:neg is 11.3:1.0 for the first feature. This means that the word ‘powerful’ is 11.3 times more likely to occur in a positive review than a negative review. It is similarly for all the features.

Implementing Other Classifiers

from nltk.classify.scikitlearn import SklearnClassifier

from sklearn.naive_bayes import MultinomialNB, BernoulliNB

from sklearn.linear_model import LogisticRegression, SGDClassifier

from sklearn.svm import SVC

from sklearn import metrics

MNB_clf=SklearnClassifier(MultinomialNB())

mnb_cls=MNB_clf.train(training_set)

print(“classifier accuracy percent:”,(nltk.classify.accuracy(mnb_cls,testing_set))*100)
Accuracy: 77.5

BernoulliNB:

BNB_clf=SklearnClassifier(BernoulliNB())

bnb_cls=BNB_clf.train(training_set)

print(“classifier accuracy percent:”,(nltk.classify.accuracy(bnb_cls,testing_set))*100)
Accuracy: 75.9166

Logistic Regression:

LogReg_clf=SklearnClassifier(LogisticRegression())

log_cls=LogReg_clf.train(training_set)

print(“classifier accuracy percent:”,(nltk.classify.accuracy(log_cls,testing_set))*100)
Accuracy: 72.833
SGD Classifier:
SGD_clf=SklearnClassifier(SGDClassifier())

sgd_cls=SGD_clf.train(training_set)

print(“classifier accuracy percent:”,(nltk.classify.accuracy(sgd_cls,testing_set))*100)
Accuracy: 68.4166

SVC Classifier:

SVC_clf=SklearnClassifier(SVC())

svc_cls=SVC_clf.train(training_set)

print(“classifier accuracy percent:”,(nltk.classify.accuracy(svc_cls,testing_set))*100)
Accuracy: 49.5

So, we can see that the accuracy decreases as we go from Multinomial Naïve Bayes to SVC classifier.

Here, the highest accuracy is of the Multinomial NB classifier.

Therefore, we can choose this model over all the other models when we are trying to ascertain the sentiment of words.

This is how we can do the sentiment analysis using the NLTK package and the Scikit package.

In this blog on “What is Natural Language Processing?”, we have seen “What is Natural Language Processing?”, “What is Natural Language Processing used for?”, and we also implemented some of the NLP techniques using Python.

Frequently Asked Questions (FAQs)

What is NLP used for in data science?

NLP (Natural Language Processing) is used for analyzing, understanding, and generating human language data, aiding in sentiment analysis, chatbots, translation, and other language-related tasks in data science.

What does NLP mean in data?

NLP stands for Natural Language Processing, a field at the intersection of computer science, artificial intelligence, and linguistics, aiming to enable computers to understand and process human language.

Is NLP required for data science?

NLP is a specialized area within data science. It’s essential for projects involving text analysis, sentiment analysis, or language-based predictive modeling.

What is NLP with example?

NLP includes tasks like sentiment analysis (determining sentiment from text), machine translation (translating text between languages), and named entity recognition (identifying proper nouns in text).

What is the main use of NLP?

NLP’s main use is to enable machines to understand, interpret, and generate human language, facilitating human-computer interaction and analysis of text data.

What is the salary of an NLP scientist?

Salaries vary by region and experience. NLP scientists are highly specialized and can command high salaries, often comparable to or exceeding those of other data scientists.

What skills do you need for NLP data scientist?

Skills include programming (Python, Java), machine learning, deep learning, linguistic knowledge, and familiarity with NLP libraries like NLTK or spaCy.

Which type of data is used by NLP?

NLP primarily uses text data, but can also work with speech data when combined with speech processing techniques.

Is NLP a machine learning?

NLP often employs machine learning techniques to learn from and make predictions or decisions based on text data.

Is coding required for NLP?

Yes, coding is essential for implementing NLP algorithms, processing text data, and utilizing NLP libraries and frameworks.

About the Author

Principal Data Scientist

Meet Akash, a Principal Data Scientist with expertise in advanced analytics, machine learning, and AI-driven solutions. With a master’s degree from IIT Kanpur, Aakash combines technical knowledge with industry insights to deliver impactful, scalable models for complex business challenges.