Data Preprocessing in Machine Learning: A Comprehensive Guide

Preprocessing Data in Machine Learning

Data preprocessing is a way of converting data from the raw form to a much more usable or desired form, i.e., making data more meaningful by rescaling, standardizing, binarizing, one hot encoding, and label encoding.
The process of getting raw data ready for a Machine Learning algorithm can be summarized in the below steps:
Preprocessing Data
Here’s the list of contents for this module.

  • Rescaling Data
  • Standardizing Data
  • Binarizing Data
  • One Hot Encoding
  • Label Encoding

Alright, let’s get started.

Rescaling Data

As the name suggests, rescaling data is the process of making non-uniform attributes of a dataset uniform. Now, the question is when we would know that a dataset is uniform or not. Well, when the scale of attribute varies widely that can be rather harmful to our predictive model, we call it a non-uniform dataset.
Rescaling method is useful in optimization algorithms such as in gradient descent. It is done using MinMaxScaler class which comes under scikit-learn, also known as sklearn.
Now, let us explore this method with an example. First, we will take a look at the dataset that we are going to perform rescaling on.
Dataset: The ‘winequality-red.csv’ dataset is used for explaining the data preprocessing methods here. This csv dataset looks something like this:

"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5
7.3;0.65;0;1.2;0.065;15;21;0.9946;3.39;0.47;10;7
7.8;0.58;0.02;2;0.073;9;18;0.9968;3.36;0.57;9.5;7
7.5;0.5;0.36;6.1;0.071;17;102;0.9978;3.35;0.8;10.5;5

Alright, let us perform rescaling now.

import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
df = pandas.read_csv( 'winequality-red.csv',sep=';')
array = df.values
#Separating data into input and output components
x = array[:,0:8]
y = array[:,8]
scaler = MinMaxScaler(feature_range = (0, 1))
rescaledX = scaler.fit_transform(x)
numpy.set_printoptions(precision = 3) #Setting precision for the output
print(rescaledX[0:5,:])

Output:

array([[0.248, 0.397, 0.   , 0.068, 0.107, 0.141, 0.099, 0.568, 0.606],
[0.283, 0.521, 0.   , 0.116, 0.144, 0.338, 0.216, 0.494, 0.362],
[0.283, 0.438, 0.04, 0.096, 0.134, 0.197, 0.17 , 0.509, 0.409],
[0.584, 0.11 , 0.56 , 0.068, 0.105, 0.225, 0.191, 0.582, 0.331],
[0.248, 0.397, 0.   , 0.068, 0.107, 0.141, 0.099, 0.568, 0.606]])

Here, we have rescaled the values from a wide scale into a range that lies between 0 and 1.
Alright, the next method of data preprocessing is standardizing.

Get 100% Hike!

Master Most in Demand Skills Now!

Standardizing Data

Standardizing data helps us transform attributes with a Gaussian distribution of differing means and of differing standard deviations into a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. Standardization of data is done using scikit-learn with the StandardScaler class.

from sklearn.preprocessing import StandardScaler
scaler=StandardScaler().fit(x)
rescaledX = scaler.transform(x)
rescaledX[0:5,:]
Output: 
array([[-0.528,  0.962, -1.391, -0.453, -0.244, -0.466, -0.379,  0.558,
1.289],
[-0.299,  1.967, -1.391,  0.043,  0.224,  0.873,  0.624,  0.028,
-0.72 ],
[-0.299,  1.297, -1.186, -0.169,  0.096, -0.084,  0.229,  0.134,
-0.331],
[ 1.655, -1.384,  1.484, -0.453, -0.265,  0.108,  0.412,  0.664,
-0.979],
[-0.528,  0.962, -1.391, -0.453, -0.244, -0.466, -0.379,  0.558,
1.289]])

Certification in Bigdata Analytics

Binarizing Data

In this method, all the values that are above the threshold are transformed into 1 and those equal to or below the threshold are transformed into 0. This method is useful when we deal with probabilities and need to convert the data into crisp values. Binarizing is done using the Binarizer class.

from sklearn.preprocessing import Binarizer
binarizer=Binarizer(threshold=0.0).fit(x)
binary_X=binarizer.transform(x)
binary_X[0:5,:]

Output:

array([[1., 1., 0., 1., 1., 1., 1., 1., 1.],
[1., 1., 0., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 0., 1., 1., 1., 1., 1., 1.]])

One Hot Encoding

While dealing with categorical data, one hot encoding is performed using the OneHotEncoder class.

from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder()
print(encoder.fit([[0,1,6,2],[1,5,3,5],[2,4,2,7],[1,0,4,2]]))

Output:

OneHotEncoder(categorical_features=None, categories=None,
dtype=<class 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
encoder.transform([[2,4,3,4]]).toarray()

Output:

array([[0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0.]])

Label Encoding

Labels can be words or numbers. Usually, the training data is labeled with words to make it readable. Label encoding converts word labels into numbers to let algorithms work on them.

from sklearn.preprocessing import LabelEncoder
label_encoder=LabelEncoder()
input_classes=['A','B','C','D','E']
label_encoder.fit(input_classes)

Output:

LabelEncoder()
for i,item in enumerate(label_encoder.classes_):print(item,'-->',i)
input_classes=['A','B','C','D','E']

Output:

A --> 0
B --> 1
C --> 2
D --> 3
E --> 4
labels=['B','C','D']
label_encoder.transform(labels)

Output:

array([1, 2, 3], dtype=int64)
label_encoder.inverse_transform(label_encoder.transform(labels))

Output:

array(['B', 'C', 'D'], dtype='<U1')

 

What Did We Learn?

In this module, we have discussed on various data preprocessing methods for Machine Learning such as rescaling, binarizing, standardizing, one hot encoding, and label encoding. In the next module, we will be diving into training, validation, and testing datasets. Let’s meet there!

We hope this tutorial helps you gain knowledge of Machine Learning Course Online. If you are looking to learn Machine Learning Training in a systematic manner with expert guidance and support then you can enroll to our Online Machine Learning Course.

Our Machine Learning Courses Duration and Fees

Program Name
Start Date
Fees
Cohort starts on 11th Jan 2025
₹70,053
Cohort starts on 1st Feb 2025
₹70,053

About the Author

Principal Data Scientist

Meet Akash, a Principal Data Scientist with expertise in advanced analytics, machine learning, and AI-driven solutions. With a master’s degree from IIT Kanpur, Aakash combines technical knowledge with industry insights to deliver impactful, scalable models for complex business challenges.