Preprocessing Data in Machine Learning
Data preprocessing is a way of converting data from the raw form to a much more usable or desired form, i.e., making data more meaningful by rescaling, standardizing, binarizing, one hot encoding, and label encoding.
The process of getting raw data ready for a Machine Learning algorithm can be summarized in the below steps:
Here’s the list of contents for this module.
- Rescaling Data
- Standardizing Data
- Binarizing Data
- One Hot Encoding
- Label Encoding
Alright, let’s get started.
For the best of career growth, check out Intellipaat’s Machine Learning Course and get certified.
Rescaling Data
As the name suggests, rescaling data is the process of making non-uniform attributes of a dataset uniform. Now, the question is when we would know that a dataset is uniform or not. Well, when the scale of attribute varies widely that can be rather harmful to our predictive model, we call it a non-uniform dataset.
Rescaling method is useful in optimization algorithms such as in gradient descent. It is done using MinMaxScaler class which comes under scikit-learn, also known as sklearn.
Now, let us explore this method with an example. First, we will take a look at the dataset that we are going to perform rescaling on.
Dataset: The ‘winequality-red.csv’ dataset is used for explaining the data preprocessing methods here. This csv dataset looks something like this:
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5
7.3;0.65;0;1.2;0.065;15;21;0.9946;3.39;0.47;10;7
7.8;0.58;0.02;2;0.073;9;18;0.9968;3.36;0.57;9.5;7
7.5;0.5;0.36;6.1;0.071;17;102;0.9978;3.35;0.8;10.5;5
Alright, let us perform rescaling now.
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
df = pandas.read_csv( 'winequality-red.csv',sep=';')
array = df.values
#Separating data into input and output components
x = array[:,0:8]
y = array[:,8]
scaler = MinMaxScaler(feature_range = (0, 1))
rescaledX = scaler.fit_transform(x)
numpy.set_printoptions(precision = 3) #Setting precision for the output
print(rescaledX[0:5,:])
Output:
array([[0.248, 0.397, 0. , 0.068, 0.107, 0.141, 0.099, 0.568, 0.606],
[0.283, 0.521, 0. , 0.116, 0.144, 0.338, 0.216, 0.494, 0.362],
[0.283, 0.438, 0.04, 0.096, 0.134, 0.197, 0.17 , 0.509, 0.409],
[0.584, 0.11 , 0.56 , 0.068, 0.105, 0.225, 0.191, 0.582, 0.331],
[0.248, 0.397, 0. , 0.068, 0.107, 0.141, 0.099, 0.568, 0.606]])
Here, we have rescaled the values from a wide scale into a range that lies between 0 and 1.
Alright, the next method of data preprocessing is standardizing.
Become Master of Machine Learning by going through this online Machine Learning course in Sydney.
Get 100% Hike!
Master Most in Demand Skills Now!
Standardizing Data
Standardizing data helps us transform attributes with a Gaussian distribution of differing means and of differing standard deviations into a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. Standardization of data is done using scikit-learn with the StandardScaler class.
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler().fit(x)
rescaledX = scaler.transform(x)
rescaledX[0:5,:]
Output:
array([[-0.528, 0.962, -1.391, -0.453, -0.244, -0.466, -0.379, 0.558,
1.289],
[-0.299, 1.967, -1.391, 0.043, 0.224, 0.873, 0.624, 0.028,
-0.72 ],
[-0.299, 1.297, -1.186, -0.169, 0.096, -0.084, 0.229, 0.134,
-0.331],
[ 1.655, -1.384, 1.484, -0.453, -0.265, 0.108, 0.412, 0.664,
-0.979],
[-0.528, 0.962, -1.391, -0.453, -0.244, -0.466, -0.379, 0.558,
1.289]])
Binarizing Data
In this method, all the values that are above the threshold are transformed into 1 and those equal to or below the threshold are transformed into 0. This method is useful when we deal with probabilities and need to convert the data into crisp values. Binarizing is done using the Binarizer class.
from sklearn.preprocessing import Binarizer
binarizer=Binarizer(threshold=0.0).fit(x)
binary_X=binarizer.transform(x)
binary_X[0:5,:]
Output:
array([[1., 1., 0., 1., 1., 1., 1., 1., 1.],
[1., 1., 0., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 0., 1., 1., 1., 1., 1., 1.]])
Go through this Artificial Intelligence Interview Questions And Answers to excel in your Artificial Intelligence Interview.
One Hot Encoding
While dealing with categorical data, one hot encoding is performed using the OneHotEncoder class.
from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder()
print(encoder.fit([[0,1,6,2],[1,5,3,5],[2,4,2,7],[1,0,4,2]]))
Output:
OneHotEncoder(categorical_features=None, categories=None,
dtype=<class 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
encoder.transform([[2,4,3,4]]).toarray()
Output:
array([[0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0.]])
Enroll in this Online M.Tech in AI and ML by IIT Jammu to enhance your career!
Label Encoding
Labels can be words or numbers. Usually, the training data is labeled with words to make it readable. Label encoding converts word labels into numbers to let algorithms work on them.
from sklearn.preprocessing import LabelEncoder
label_encoder=LabelEncoder()
input_classes=['A','B','C','D','E']
label_encoder.fit(input_classes)
Output:
LabelEncoder()
for i,item in enumerate(label_encoder.classes_):print(item,'-->',i)
input_classes=['A','B','C','D','E']
Output:
A --> 0
B --> 1
C --> 2
D --> 3
E --> 4
labels=['B','C','D']
label_encoder.transform(labels)
Output:
array([1, 2, 3], dtype=int64)
label_encoder.inverse_transform(label_encoder.transform(labels))
Output:
array(['B', 'C', 'D'], dtype='<U1')
What Did We Learn?
In this module, we have discussed on various data preprocessing methods for Machine Learning such as rescaling, binarizing, standardizing, one hot encoding, and label encoding. In the next module, we will be diving into training, validation, and testing datasets. Let’s meet there!
We hope this tutorial helps you gain knowledge of Machine Learning Course Online. If you are looking to learn Machine Learning Training in a systematic manner with expert guidance and support then you can enroll to our Online Machine Learning Course.