Feature Selection in Machine Learning

Feature-Selection-Techniques-in-Machine-Learning.jpg

Feature selection techniques are important in machine learning as these techniques help to reduce overfitting, improve the performance, and reduce the training time of the model. In datasets, not all features are useful, and they also do not help in meaningful prediction, which means that some features are irrelevant; thus, selecting only the most relevant and useful features from the dataset is important in machine learning, which is done by a few common feature selection methods. In this article, we will discuss what feature selection in machine learning is, its importance, types of feature selection techniques with examples, and how to choose the right feature selection method for the dataset.

Table of Contents:

What is Feature Selection in Machine Learning?

Feature selection in machine learning is the process of selecting a subset of relevant and significant features from the original dataset. It helps in improving the model performance by reducing overfitting, increasing accuracy, and decreasing training time. By removing irrelevant or redundant features, models become simpler and more interpretable. Feature selection can be one using techniques such as filter methods, wrapper methods, and embedded methods. Also, it is an important step in the data preprocessing phase, which makes sure that the model focuses on the most informative inputs.

Importance of Feature Selection in Machine Learning

Here are a few points that help you to understand the importance of feature selection in machine learning:

  1. By removing the irrelevant or noisy features, feature selection improves the accuracy and generalizability of the model.
  2. Feature selection reduces overfitting as it selects and groups the important features from the original dataset, and the model learns from less noisy data.
  3. It speeds up the training because the smaller set of features takes less computation time and resources.
  4. With fewer features, it makes it easier to understand how the model makes decisions.
  5. Feature selection also reduces the data storage and complexity due to the smaller set of data.
  6. Identifying the key features of the model helps in understanding the underlying patterns in data.

Here is a comparison table that will help you to understand the importance of the feature selection technique in machine learning, comparing model performance with and without feature selection.

Metric Without Feature Selection With Feature Selection
Accuracy 82% 88%
Training Time (seconds) 12.5 6.8
Number of Features 50 15
Overfitting Risk High Low
Model Interpretability Low High

The above table clearly shows that with the help of feature selection, both the efficiency and effectiveness of the machine learning model have improved.

Types of Feature Selection Techniques in Machine Learning

Feature selection techniques in machine learning are classified into two main categories: Supervised and Unsupervised Feature Selection.

1. Supervised Feature Selection: Supervised feature selection uses labeled data to select the features based on their relationship with the target variable. It identifies the most relevant features that contribute to predicting the output. The common methods of supervised feature selection are wrapper methods, filter techniques, and embedded methods. These techniques improve the accuracy of the model and reduce overfitting.

2. Unsupervised Feature Selection: Unsupervised feature selection is used when the data lacks labels. It selects features based on patterns, structure, or statistical properties within the input data itself. The common unsupervised feature selection techniques are variance thresholding, correlation analysis, and clustering-based methods. The main goal of these feature selection methods is to retain the informative features while reducing redundancy and noise.

Supervised Feature Selection Methods

As we have already discussed, the types of feature selection methods, let us discuss the three main supervised learning feature selection techniques in more detail.

1. Filter Methods

Filter methods evaluate the significance of the features through statistical methods of any machine learning model. These methods calculate a score for all features based on their association with the target feature, and the top scores will be considered while training the model. Filter methods are simple, computationally inexpensive, and model-agnostic. The drawback of the filter methods is that they do not consider interaction with other features. The filter methods are Information Gain, Chi-square, Fisher’s Score, and missing value ratio.

Let us discuss in detail the typical filter methods with an example.

a) Information Gain

This feature selection method measures the uncertainty (in entropy) removed in the target variable by splitting the dataset by a feature. It is a common measure in decision trees. The information gain method is primarily used in selecting features for text classification or binary classification problems.

Example: In a spam detection model, the word “free” would have a lot of information gain if it mostly indicated spam. 

b) Chi-square Test

The Chi-square test is used to determine the statistical independence of categorical features and the target variable. If there is a high Chi-square value feature, then it would indicate a more important feature. The Chi-square tests are typically used in classification problems that have categorical features.

Example: In a loan approval dataset, you can check the association of “marital status” and “loan approved” using a Chi-squared analysis.

Master Machine Learning - Get Certified and Trained by IIT Professors & Microsoft Experts!
Enroll Now - Start Your ML Career!
quiz-icon

c) Fisher’s Score

This feature selection method is used to check the rank of the features based on inter-class variance or separation, and intra-class variance. The higher Fisher’s score means the better separation of classes. This method is applied in biometrics, image processing, and other domains where separating classes is important.

Example: For a face recognition model, pixel intensities with Fisher’s scores are essential to identify the unique individuals.

d) Missing Value Ratio

This technique removes the features with a large percentage of missing values because they might provide only little useful information or don’t give any benefit to the usefulness of the model. The missing value ratio method is used for data cleaning or data preprocessing prior to training the models.

Example: If you have a feature named “Middle Name” and there are about 95% missing values, it will be removed as it is under a threshold.

Here is a Python example that will help us to understand how these methods works in an efficient manner.

Python

Output: 

Filter Methods

The code above displays the approach to identifying a few supervised feature selection techniques using the Iris data. The code first loads the data and generates missing data to show how to compute the missing value ratio and how to handle it with missing value methods. The code continues to leverage Information gain, the Chi-square Test, and finally Fisher’s Score (ANOVA F-test) response values. It ranked the features against the target variable. The features were normalized just before the Chi-square, and for missing values, the mean strategy was used to impute the missing features.

2. Wrapper Methods

Wrapper methods evaluate subsets of features according to model performance. In the wrapper methods, the machine learning model is trained and tested separately with multiple features to discover the one feature combination with the best results. They can be used when accuracy is more important than runtime. It can be very costly computationally, especially for large feature sets. Recursive Feature Elimination (RFE), Forward Selection, and Backward Selection are examples of wrapper-based methods. 

Now, we will discuss the common wrapper methods with an example.

1. Forward Selection

Forward selection begins with no features and adds one feature at a time that has the most positive impact or increases the best model performance. This method is used with small datasets and any new features that could be tested. For each round or iteration, an expert can consider which feature has a positive impact on the model (i.e, best accuracy, lowest RMSE, etc.) and add to the base model. Once there are no new features that will meaningfully increase model performance, we will stop adding features.

Example: In predicting diabetes data, you may absolutely add the “Glucose” feature first, then “BMI” if it adds to the accuracy.

2. Backward Elimination

Backward elimination is a wrapper method that works by starting with all features, then eliminating the least useful feature. This elimination was based on some sort of criterion (e.g., p-value, accuracy, AUC). At each point of the elimination challenge, the model is on hand to evaluate and check that it performed better or the same as the model with the previous features. This will continue on until you are left with the relevant features.

Example: In a house price prediction model, “Zip Code”, after evaluation, may or may not be relevant, and if you found it not to be useful, it would be the first feature to go.

3. Recursive Feature Elimination (RFE)

Recursive feature elimination is also based on a wrapper model concept and works similarly to backward elimination, as the least important features are iteratively removed based on the model coefficients or importance ranking. RFE requires the model to be trained multiple times it eliminating the weakest predictor on each iteration until the desired number of features is achieved. RFE works well with models with coefficients (e.g., logistic regression), SVMs, and random forests.

Example: In the cancer classification example, RFE may only keep the top 10 genes for the diagnosis.

For a better understanding, let’s see their implementation with the help of an example in Python.

Python

Output:

Wrapper Methods

The code above used three wrapper methods of feature selection: Forward Selection, Backward Elimination, and Recursive Feature Elimination (RFE). The code was tested on the (now binary classification) Iris dataset. The code first selects features using every wrapper method and then prints the resulting selected features. Backward Elimination contains a step to remove features of high correlation coefficient, otherwise an error in the number of features would result. Lastly, it evaluates the accuracy of a model using the features that were selected via RFE.

3. Embedded Methods

Embedded methods conduct feature selection in the model training process and links feature selection with learning algorithms. They have the advantage of incorporating model performance metrics into the measurement of lower-dimensional performance, providing the performance benefits of filter and wrapper methods while being more efficient. They also automatically select and eliminate the most important features based on model-specific metrics such as coefficients or feature importance. The examples of embedded methods are Regularization, Random Forrest etc.

Now, we will discuss the common embedded methods with an example.

1. Regularization

Regularization is an embedded method that takes a penalty to the loss function of the model to help mitigate overfitting and shrink the coefficient of the less important features. Lasso (L1) and Ridge (L2) regularization are techniques employed in machine learning to select features by shrinking the important features to zero.

Example: In Lasso regression, it may keep features such as “Age” or “Income”, while removing irrelevant features during training.

2. Random Forest

The random forest algorithm is a collection of decision trees and it takes into consideration of features based on how they make the splits in the model better. Features that do not score well in importance are able to be dropped out.

Example: In a credit scoring model, the Random Forest ranked “Credit History’ more important than “ZIP Code”, which denoted that the predictors were now only meaningful predictors.

For a better understanding, let’s see their implementation with the help of an example in Python.

Python

Output:

Embedded Methods

The above program shows how we may use embedded methods for feature selection on the Iris dataset. Lasso selects features by reducing some coefficients to zero using L1 regularization and a Random Forest ranks features by importance according to how useful the features are for making decisions about splitting. The script will print the features selected and the importance or coefficients of the features selected.

Get 100% Hike!

Master Most in Demand Skills Now!

Unsupervised Feature Selection Techniques

Unsupervised feature selection methods are typically used when working with data where outputs are not labeled. The objective of these methods is to select those features that best represent the underlying structure or variance of the data. Let’s take a closer look at the more common unsupervised feature selection methods.

1. Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a linear dimensionality reduction technique. It converts the original features into a new set of uncorrelated variables called the principal components. These components are generally ranked on the basis of how much variance they can capture. Also, this method helps to simplify data while retaining the most important information. 

Example: In image compression, PCA can reduce thousands of pixel values to a few key components while preserving image structure. 

Here is a Python example that helps you to understand the Principal Component Analysis (PCA) using the Iris dataset.

Python

Output:

Principal Component Analysis (PCA) Example

The above given example describes using Principal Component Analysis (PCA) on the Iris dataset to reduce down the 4-dimensional feature space into 2 dimensions for visualization. First, it standardizes the features, then the PCA projects the data, and finally, a scatter plot shows class separation in the reduced space.

2. Independent Component Analysis (ICA)

ICA is another method of dimensionality reduction that separates a multivariate signal into independent non-Gaussian components. It is mostly used when the goal is to identify the underlying source or signal from the observed data.

Example: ICA can be used to separate mixed audio recordings into individual speaker voices in audio processing.

The following Python code demonstrates how Independent Component Analysis (ICA) is applied to the Iris dataset.

Python

Output:

Independent Component Analysis (ICA) Example

The above code conducts ICA on the Iris dataset to reduce 4 features into 2 statistically independent components. It first standardizes the data, then applies “FastICA” to transform the features, and finally plots a scatter plot that brings forth class separations.

3. Non-negative Matrix Factorization (NMF)

Non-negative matrix factorization is also a dimensionality reduction technique, which allows decomposing a matrix of data into two matrices with a non-negativity constraint. It comes in used when one expects an interpretable “parts-based” representation. Consider that one’s data values cannot go below zero, such as images or text data.

Example: NMF works as an extractor for topics in document-clustering models by factoring the document-term matrix and generating a topic-word matrix and a document-topic matrix.

Here is Python example that illustrates Non-negative Matrix Factorization (NMF) on the Iris dataset. Since NMF expects all inputs to be non-negative, we will ignore scaling that introduces negatives and make one assumption about the data being compliant.

Python

Output:

Non-negative Matrix Factorization (NMF) Example

The above code applies Non-negative Matrix Factorization (NMF) to the Iris dataset by first scaling all values to a [0, 1] range. It reduces the 4 original features to 2 non-negative components and then visualizes class separation using a scatter plot.

4. T-distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a nonlinear dimensionality reduction technique that is also primarily used just to visualize high-dimensional data, like reducing a dataset for visualization, often done in 2 or 3 dimensions. t-SNE takes the similarities between data points and turns them into joint probabilities while only preserving local relationships.

As an example, we might use t-SNE in, say, machine learning, to visualize how some of the clusters produced by our network show similar images or similar word embeddings.

Here is a Python example that uses t-SNE on the Iris dataset to reduce it down to 2 dimensions and visualize the separation between classes.

Python

Output:

T-distributed Stochastic Neighbor Embedding (t-SNE) Example

The above code uses t-SNE to reduce the Iris dataset from 4 features to 2 dimensions for visualization. It first standardizes the data, applies t-SNE for non-linear dimensionality reduction, and then plots the results to show how well the different flower classes are separated in 2D space.

Start Your Free ML Certification Today - Build a Strong Foundation!
Join for Free - Learn ML Now!
quiz-icon

5. Autoencoder

An autoencoder is an unsupervised neural network method that performs non-linear feature reduction by learning to compress (encode) the data inputs into a lower-dimensional representation, then reconstruct (decode) back and contains key patterns and structures.

For example, in image compression, an autoencoder could remove the dimensionality stretching of high-quality images while still preserving the core visual characteristics.

Below is a simple Autoencoder example written in Python using the Iris dataset, reducing the dimensionality from 4 to 2, using TensorFlow/Keras.

Python

Output:

Autoencoder EXAMPLE

The code above creates, trains, and produces an autoencoder using the Iris dataset to convert the four features to two. The autoencoder uses the encoder to have the neural network compress the input data, and then shows a 2D plot of the encoded features for each of the three classes. This autoencoder is an implementation of a neural network that automatically extracts features (unsupervised feature extraction) and performs dimensionality reduction.

Choosing the Right Feature Selection Method

When selecting a feature selection method, there are considerations as provided below:

  1. Supervised vs Unsupervised: If you have labels, use supervised methods, and if you don’t have labels, use unsupervised methods like PCA or Autoencoders.
  2. Data Size & Complexity: In general, use the Filter methods, which are best for large datasets for speed, and use the wrapper methods for small datasets that need some precision.
  3. Model Type: Use Lasso for linear models, and the feature importances from Random Forest for tree-based models.
  4. Interpretability vs Performance: Fewer features increase interpretability, and the dimensionality reduction techniques increase performance but decrease explanation.
  5. Redundancy & Noise: Feature selection eliminates irrelevant or redundant features, which improves generalization and reduces overfitting.

Conclusion

Feature selection is a crucial step in creating efficient and accurate machine learning models. Machine learning has a variety of different feature selection techniques that can help to improve performance, mitigate overfitting, and improve interpretability by removing the irrelevant or redundant features. So, whether you are using filter, wrapper, embedded, or unsupervised methods, the importance of the technique selected should not be disregarded. The right technique is based on the dataset you are using, the task at hand, and the type of model.

Feature Selection Techniques in Machine Learning – FAQs

Q1. Why is feature selection beneficial?

Feature selection helps eliminate the irrelevant features that reduce model complexity, training time, overfitting, and increases accuracy and interpretability.

Q2. What is the difference between filter methods and wrapper methods?

Unlike filter methods, which evaluate different features independent of the model, wrapper methods evaluate subsets based on a predictive model.

Q3. Can feature selection methods be used with unsupervised learning?

Of course, you can use feature selection methods with unsupervised learning, within the specifications of PCA, ICA, and Autoencoders dimension reductions that use no class label.

Q4. What are embedded methods?

Embedded methods discover features as the model is being trained, like Lasso (L1) or importance based on the decision tree.

Q5. Do I have to use feature selection methods all of the time?

Certainly not, you don’t have to use feature selection methods all of the time; however, feature selection methods are highly encouraged with high-dimensional data or when you want to facilitate model generalization.

About the Author

Principal Data Scientist, Accenture

Meet Akash, a Principal Data Scientist with expertise in advanced analytics, machine learning, and AI-driven solutions. With a master’s degree from IIT Kanpur, Aakash combines technical knowledge with industry insights to deliver impactful, scalable models for complex business challenges.

EPGC Data Science Artificial Intelligence