Preprocessing Data in Machine Learning

Data preprocessing is a way of converting data from the raw form to a much more usable or desired form, i.e., making data more meaningful by rescaling, standardizing, binarizing, one hot encoding, and label encoding.
The process of getting raw data ready for a Machine Learning algorithm can be summarized in the below steps:
Preprocessing Data
Here’s the list of contents for this module.

  • Rescaling Data
  • Standardizing Data
  • Binarizing Data
  • One Hot Encoding
  • Label Encoding

Alright, let’s get started.

For the best of career growth, check out Intellipaat’s Machine Learning Course and get certified.

Rescaling Data

As the name suggests, rescaling data is the process of making non-uniform attributes of a dataset uniform. Now, the question is when we would know that a dataset is uniform or not. Well, when the scale of attribute varies widely that can be rather harmful to our predictive model, we call it a non-uniform dataset.
Rescaling method is useful in optimization algorithms such as in gradient descent. It is done using MinMaxScaler class which comes under scikit-learn, also known as sklearn.
Now, let us explore this method with an example. First, we will take a look at the dataset that we are going to perform rescaling on.
Dataset: The ‘winequality-red.csv’ dataset is used for explaining the data preprocessing methods here. This csv dataset looks something like this:

"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5
7.3;0.65;0;1.2;0.065;15;21;0.9946;3.39;0.47;10;7
7.8;0.58;0.02;2;0.073;9;18;0.9968;3.36;0.57;9.5;7
7.5;0.5;0.36;6.1;0.071;17;102;0.9978;3.35;0.8;10.5;5

Alright, let us perform rescaling now.

import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
df = pandas.read_csv( 'winequality-red.csv',sep=';')
array = df.values
#Separating data into input and output components
x = array[:,0:8]
y = array[:,8]
scaler = MinMaxScaler(feature_range = (0, 1))
rescaledX = scaler.fit_transform(x)
numpy.set_printoptions(precision = 3) #Setting precision for the output
print(rescaledX[0:5,:])

Output:

array([[0.248, 0.397, 0.   , 0.068, 0.107, 0.141, 0.099, 0.568, 0.606],
[0.283, 0.521, 0.   , 0.116, 0.144, 0.338, 0.216, 0.494, 0.362],
[0.283, 0.438, 0.04, 0.096, 0.134, 0.197, 0.17 , 0.509, 0.409],
[0.584, 0.11 , 0.56 , 0.068, 0.105, 0.225, 0.191, 0.582, 0.331],
[0.248, 0.397, 0.   , 0.068, 0.107, 0.141, 0.099, 0.568, 0.606]])

Here, we have rescaled the values from a wide scale into a range that lies between 0 and 1.
Alright, the next method of data preprocessing is standardizing.
Become Master of Machine Learning by going through this online Machine Learning course in Sydney.

Get 100% Hike!

Master Most in Demand Skills Now !

Standardizing Data

Standardizing data helps us transform attributes with a Gaussian distribution of differing means and of differing standard deviations into a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. Standardization of data is done using scikit-learn with the StandardScaler class.

from sklearn.preprocessing import StandardScaler
scaler=StandardScaler().fit(x)
rescaledX = scaler.transform(x)
rescaledX[0:5,:]
Output: 
array([[-0.528,  0.962, -1.391, -0.453, -0.244, -0.466, -0.379,  0.558,
1.289],
[-0.299,  1.967, -1.391,  0.043,  0.224,  0.873,  0.624,  0.028,
-0.72 ],
[-0.299,  1.297, -1.186, -0.169,  0.096, -0.084,  0.229,  0.134,
-0.331],
[ 1.655, -1.384,  1.484, -0.453, -0.265,  0.108,  0.412,  0.664,
-0.979],
[-0.528,  0.962, -1.391, -0.453, -0.244, -0.466, -0.379,  0.558,
1.289]])

Certification in Bigdata Analytics

Binarizing Data

In this method, all the values that are above the threshold are transformed into 1 and those equal to or below the threshold are transformed into 0. This method is useful when we deal with probabilities and need to convert the data into crisp values. Binarizing is done using the Binarizer class.

from sklearn.preprocessing import Binarizer
binarizer=Binarizer(threshold=0.0).fit(x)
binary_X=binarizer.transform(x)
binary_X[0:5,:]

Output:

array([[1., 1., 0., 1., 1., 1., 1., 1., 1.],
[1., 1., 0., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 0., 1., 1., 1., 1., 1., 1.]])

Go through this Artificial Intelligence Interview Questions And Answers to excel in your Artificial Intelligence Interview.

One Hot Encoding

While dealing with categorical data, one hot encoding is performed using the OneHotEncoder class.

from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder()
print(encoder.fit([[0,1,6,2],[1,5,3,5],[2,4,2,7],[1,0,4,2]]))

Output:

OneHotEncoder(categorical_features=None, categories=None,
dtype=<class 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
encoder.transform([[2,4,3,4]]).toarray()

Output:

array([[0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0.]])

Enroll in this Online M.Tech in AI and ML by IIT Jammu to enhance your career!

Career Transition

Intellipaat Job Guarantee Review | Intellipaat Job Assistance Review | Data Engineer Course
Got Job Promotion After Completing Artificial Intelligence Course - Intellipaat Review | Gaurav
How Can A Non Technical Person Become Data Scientist | Intellipaat Review - Melvin
Artificial Intelligence Course | Career Transition to Machine Learning Engineer - Intellipaat Review
Non Tech to Data Scientist Career Transition | Data Science Course Review - Intellipaat

Label Encoding

Labels can be words or numbers. Usually, the training data is labeled with words to make it readable. Label encoding converts word labels into numbers to let algorithms work on them.

from sklearn.preprocessing import LabelEncoder
label_encoder=LabelEncoder()
input_classes=['A','B','C','D','E']
label_encoder.fit(input_classes)

Output:

LabelEncoder()
for i,item in enumerate(label_encoder.classes_):print(item,'-->',i)
input_classes=['A','B','C','D','E']

Output:

A --> 0
B --> 1
C --> 2
D --> 3
E --> 4
labels=['B','C','D']
label_encoder.transform(labels)

Output:

array([1, 2, 3], dtype=int64)
label_encoder.inverse_transform(label_encoder.transform(labels))

Output:

array(['B', 'C', 'D'], dtype='<U1')

If you have any doubts or queries related to Data Science, do post on Machine Learning Community.

What Did We Learn?

In this module, we have discussed on various data preprocessing methods for Machine Learning such as rescaling, binarizing, standardizing, one hot encoding, and label encoding. In the next module, we will be diving into training, validation, and testing datasets. Let’s meet there!

Course Schedule

Name Date Details
Data Science Course 30 Mar 2024(Sat-Sun) Weekend Batch
View Details
Data Science Course 06 Apr 2024(Sat-Sun) Weekend Batch
View Details
Data Science Course 13 Apr 2024(Sat-Sun) Weekend Batch
View Details