Preprocessing Data

Data preprocessing is a way of converting data from the raw form to a much more usable or desired form, i.e., making data more meaningful by rescaling, standardizing, binarizing, one hot encoding, and label encoding.
The process of getting raw data ready for a Machine Learning algorithm can be summarized in the below steps:
Preprocessing Data
Here’s the list of contents for this module.

  • Rescaling Data
  • Standardizing Data
  • Binarizing Data
  • One Hot Encoding
  • Label Encoding

Alright, let’s get started.

For the best of career growth, check out Intellipaat’s Machine Learning Course and get certified.

Watch this complete Machine Learning Tutorial Video

Data Preprocessing for Machine Learning Preprocessing Data Data preprocessing is a way of converting data from the raw form to a much more usable or desired form, i.e., making data more meaningful by rescaling, standardizing, binarizing, one hot encoding, and label encoding. The process of getting raw data ready for a Machine Learning algorithm can

Rescaling Data

As the name suggests, rescaling data is the process of making non-uniform attributes of a dataset uniform. Now, the question is when we would know that a dataset is uniform or not. Well, when the scale of attribute varies widely that can be rather harmful to our predictive model, we call it a non-uniform dataset.
Rescaling method is useful in optimization algorithms such as in gradient descent. It is done using MinMaxScaler class which comes under scikit-learn, also known as sklearn.
Now, let us explore this method with an example. First, we will take a look at the dataset that we are going to perform rescaling on.
Dataset: The ‘winequality-red.csv’ dataset is used for explaining the data preprocessing methods here. This csv dataset looks something like this:

"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5
7.3;0.65;0;1.2;0.065;15;21;0.9946;3.39;0.47;10;7
7.8;0.58;0.02;2;0.073;9;18;0.9968;3.36;0.57;9.5;7
7.5;0.5;0.36;6.1;0.071;17;102;0.9978;3.35;0.8;10.5;5

Alright, let us perform rescaling now.

import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
df = pandas.read_csv( 'winequality-red.csv',sep=';')
array = df.values
#Separating data into input and output components
x = array[:,0:8]
y = array[:,8]
scaler = MinMaxScaler(feature_range = (0, 1))
rescaledX = scaler.fit_transform(x)
numpy.set_printoptions(precision = 3) #Setting precision for the output
print(rescaledX[0:5,:])

Output:

array([[0.248, 0.397, 0.   , 0.068, 0.107, 0.141, 0.099, 0.568, 0.606],
[0.283, 0.521, 0.   , 0.116, 0.144, 0.338, 0.216, 0.494, 0.362],
[0.283, 0.438, 0.04, 0.096, 0.134, 0.197, 0.17 , 0.509, 0.409],
[0.584, 0.11 , 0.56 , 0.068, 0.105, 0.225, 0.191, 0.582, 0.331],
[0.248, 0.397, 0.   , 0.068, 0.107, 0.141, 0.099, 0.568, 0.606]])

Here, we have rescaled the values from a wide scale into a range that lies between 0 and 1.
Alright, the next method of data preprocessing is standardizing.
Watch this Natural Language Processing(NLP) video by Intellipaat:

Become Master of Machine Learning by going through this online Machine Learning course in Sydney.

Standardizing Data

Standardizing data helps us transform attributes with a Gaussian distribution of differing means and of differing standard deviations into a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. Standardization of data is done using scikit-learn with the StandardScaler class.

from sklearn.preprocessing import StandardScaler
scaler=StandardScaler().fit(x)
rescaledX = scaler.transform(x)
rescaledX[0:5,:]
Output: 
array([[-0.528,  0.962, -1.391, -0.453, -0.244, -0.466, -0.379,  0.558,
1.289],
[-0.299,  1.967, -1.391,  0.043,  0.224,  0.873,  0.624,  0.028,
-0.72 ],
[-0.299,  1.297, -1.186, -0.169,  0.096, -0.084,  0.229,  0.134,
-0.331],
[ 1.655, -1.384,  1.484, -0.453, -0.265,  0.108,  0.412,  0.664,
-0.979],
[-0.528,  0.962, -1.391, -0.453, -0.244, -0.466, -0.379,  0.558,
1.289]])

Binarizing Data

In this method, all the values that are above the threshold are transformed into 1 and those equal to or below the threshold are transformed into 0. This method is useful when we deal with probabilities and need to convert the data into crisp values. Binarizing is done using the Binarizer class.

from sklearn.preprocessing import Binarizer
binarizer=Binarizer(threshold=0.0).fit(x)
binary_X=binarizer.transform(x)
binary_X[0:5,:]

Output:

array([[1., 1., 0., 1., 1., 1., 1., 1., 1.],
[1., 1., 0., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 0., 1., 1., 1., 1., 1., 1.]])

Go through this Artificial Intelligence Interview Questions And Answers to excel in your Artificial Intelligence Interview.

Watch this complete Machine Learning Tutorial Video

Data Preprocessing for Machine Learning Preprocessing Data Data preprocessing is a way of converting data from the raw form to a much more usable or desired form, i.e., making data more meaningful by rescaling, standardizing, binarizing, one hot encoding, and label encoding. The process of getting raw data ready for a Machine Learning algorithm can

One Hot Encoding

While dealing with categorical data, one hot encoding is performed using the OneHotEncoder class.

from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder()
print(encoder.fit([[0,1,6,2],[1,5,3,5],[2,4,2,7],[1,0,4,2]]))

Output:

OneHotEncoder(categorical_features=None, categories=None,
dtype=<class 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
encoder.transform([[2,4,3,4]]).toarray()

Output:

array([[0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0.]])

Interested in learning Machine Learning? Click here to learn more in this Machine Learning Training in New York!

Watch this video on Machine Learning by Intellipaat:

Data Preprocessing for Machine Learning Preprocessing Data Data preprocessing is a way of converting data from the raw form to a much more usable or desired form, i.e., making data more meaningful by rescaling, standardizing, binarizing, one hot encoding, and label encoding. The process of getting raw data ready for a Machine Learning algorithm can

Label Encoding

Labels can be words or numbers. Usually, the training data is labeled with words to make it readable. Label encoding converts word labels into numbers to let algorithms work on them.

from sklearn.preprocessing import LabelEncoder
label_encoder=LabelEncoder()
input_classes=['A','B','C','D','E']
label_encoder.fit(input_classes)

Output:

LabelEncoder()
for i,item in enumerate(label_encoder.classes_):print(item,'-->',i)
input_classes=['A','B','C','D','E']

Output:

A --> 0
B --> 1
C --> 2
D --> 3
E --> 4
labels=['B','C','D']
label_encoder.transform(labels)

Output:

array([1, 2, 3], dtype=int64)
label_encoder.inverse_transform(label_encoder.transform(labels))

Output:

array(['B', 'C', 'D'], dtype='<U1')

If you have any doubts or queries related to Data Science, do post on Machine Learning Community.

What Did We Learn?

In this module, we have discussed on various data preprocessing methods for Machine Learning such as rescaling, binarizing, standardizing, one hot encoding, and label encoding. In the next module, we will be diving into training, validation, and testing datasets. Let’s meet there!

Leave a Reply

Your email address will not be published. Required fields are marked *

Solve : *
9 × 15 =