Random Forest Algorithm in Python: Classification and Regression

Machine learning is the exciting fusion of computer science and statistics, and Random Forest is one of the fastest-growing algorithms. Random Forest, also known as Random Decision Trees, is a collection of decision trees collectively producing a single output. Leo Breiman introduced it and has since become a fundamental tool for machine learning Masters.

Table of content

Random Forest Example
Random Forest Algorithm vs. Decision Tree Algorithm
How does the Random Forest algorithm work?
Build Random Forest Regression Model in Machine Learning Using Python and Sklearn
Build Random Forest Classification Model in Machine Learning Using Python and Sklearn
Feature Selection in Random Forest Algorithm Model

What is a Random Forest Algorithm in Machine Learning?

As the name suggests, a random forest is nothing but a collection of multiple decision tree models. Random forest is a supervised Machine Learning algorithm. This algorithm creates a set of decision trees from a few randomly selected subsets of the training set and picks predictions from each tree. Then by means of voting, the random forest algorithm selects the best solution.

Random Forest Example

Let us understand the concept of a random forest with the help of a pictorial example.

Say, we have four samples as shown below:

Random forest algorithm will create four decision trees taking inputs from subsets, for example,

Random forest algorithm works well because it aggregates many decision trees, which reduce the effect of noisy results, whereas the prediction results of a single decision tree may be prone to noise.

Random forest algorithms can be applied to build both classification and regression models.

In the case of a random forest classification model, each decision tree votes; then to get the final result, the most popular prediction class is chosen.
In the case of the random forest regression model, the mean of all decision tree results is considered as the final result.

Advantages of Random Forest Algorithm

Random forest algorithm is considered as a highly accurate algorithm because to get the results it builds multiple decision trees.
In random forest algorithm, over fitting is not an issue to worry about, since this algorithm considers all multiple decision tree outputs, which generate no bias values in the results.
It can be used to build both random forest classification and random forest regression models.
In the important feature selection process, random forest algorithm allows us to build the desired model.

Become the Go-To Expert in Machine Learning Analysis

Achieve More with Machine Learning Training

Explore Program

Disadvantages of Random Forest Algorithm

Random forest algorithm is comparatively slow in generating predictions because it has multiple decision trees.

Random Forest Algorithm vs. Decision Tree Algorithm

Decision trees are prone to overfitting, but a random forest algorithm prevents overfitting.
Random forest algorithm is comparatively time-consuming, whereas decision tree algorithm gives fast results.

How does the Random Forest algorithm work?

Step 1: It selects random data samples from a given dataset.

Step 2: Then, it constructs a decision tree for each sample and considers all predicted outputs of those decision trees.

Step 3: With the help of voting, it picks the most voted result of those decision trees.

Building a random forest Regression Model in Machine Learning Using Python and Sklearn

Problem Statement: Use Machine Learning to predict the selling prices of houses based on some economic factors. Build a random forest regression model in Python and Sklearn

Dataset: Boston House Prices Dataset

Let us have a quick look at the dataset:

Regression Model Building: Random Forest in Python

Let us build the regression model with the help of the random forest algorithm.

Step 1: Load the required packages and the Boston dataset

import numpy as np
import pandas as pd
df = pd.read_csv('BostonHousing.csv')

Step 2: Define the features and the target

x = pd.DataFrame(df.iloc[:,:-1])
x

y = pd.DataFrame(df.iloc[:,-1])
y

Step 3: Split the dataset into train and test sets

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100, random_state=0)
regressor.fit(x_train , y_train)
y_pred = regressor.predict(x_test)

Step 5: Evaluate the random forest regression model

from sklearn import metrics
print("Mean absolute error", metrics.mean_absolute_error(y_test, y_pred))
print("Mean squared error", metrics.mean_squared_error(y_test, y_pred))
print("Root mean squared error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Become a Game-Changer in Machine Learning

Become a Machine Learning Pro Today

Explore Program

Creating and Visualizing a Random Forest Classification Model in Machine Learning Using Python

Problem Statement: Use Machine Learning to predict cases of breast cancer using patient treatment history and health data

Dataset: Breast Cancer Wisconsin (Diagnostic) Dataset

Let us have a quick look at the dataset:

Classification Model Building: Random Forest in Python

Let us build the classification model with the help of a random forest algorithm.

Step 1: Load Pandas library and the dataset using Pandas

import pandas as pd
dataset = pd.read_csv('Cancer_data.csv')
dataset
dataset.head()

Step 2: Define the features and the target

X = pd.DataFrame(dataset.iloc[:,2:-1])
y = pd.DataFrame(dataset.iloc[:,1])
X
y

Step 3: Split the dataset into train and test sklearn

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

Step 4: Import the random forest classifier function from the sklearn ensemble module. Build the random forest classifier model with the help of the random forest classifier function

from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=100, criterion='gini',random_state=1,max_depth=3)
classifier.fit(X_train, y_train)

Step 5: Predict values using the random forest classifier model

y_pred = classifier.predict(X_test)

Step 6: Evaluate the random forest classifier model

from sklearn.metrics import classification_report , confusion_matrix, accuracy_score
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

Get 100% Hike!

Master Most in Demand Skills Now!

Feature Selection in Random Forest Algorithm Model

With the help of Scikit-Learn, we can select important features to build the random forest algorithm model in order to avoid the overfitting issue. There are two ways to do this:

Visualize which feature is not adding any value to the model
Take help of the built-in function SelectFromModel, which allows us to add a threshold value to neglect features below that threshold value.

Let us see if selecting features make any difference in the accuracy score of the model.

Step 7: Let us find out important features and visualize them using Seaborn

import pandas as pd
feature_imp = pd.Series(classifier.feature_importances_,index=X.columns).sort_values(ascending=False)
feature_imp

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()

Now let us see how the ‘SelectFromModel’ function helps in building a random forest classifier model with important features.

Step 8: Import the SelectFromModel function. We will pass the classifier object we’ve created above. Also, we will add a threshold value of 0.1

from sklearn.feature_selection import SelectFromModel
feat_sel = SelectFromModel(classifier,threshold=0.1)
feat_sel.fit(X_train, y_train)

Step 9: With the help of the ‘transform’ method, we will pick the important features and store them in new train and test objects

X_imp_train = feat_sel.transform(X_train)
X_imp_test = feat_sel.transform(X_test)

Step 10: Let us now build a new random forest classifier model (so that we can compare the results of this model with the old one)

clf_imp = RandomForestClassifier(n_estimators=20, criterion='gini',random_state=1,max_depth=7)
clf_imp.fit(X_imp_train, y_train)

Step 11: Let us see the accuracy result of the old model

y_pred = classifier.predict(X_test)
accuracy_score(y_test, y_pred)

Step 12: Let us see the accuracy result of the new model after feature selection

y_imp_pred = clf_imp.predict(X_imp_test)
accuracy_score(y_test, y_imp_pred)

Note: After the feature selection process, the accuracy score is decreased. But, we have successfully picked out the important features at a small cost of accuracy.

Also, automatic feature selection reduces the complexity of the model but does not necessarily increase the accuracy. In order to get the desired accuracy, we have to perform the feature selection process manually.

Conclusion

Random Forest is a very powerful, versatile machine-learning algorithm that boosts accuracy by combining multiple decision trees. It reduces overfitting, handles missing data, and captures complex patterns, thus being suitable for both classification and regression tasks. This ability to provide insights into feature importance and scale well with large datasets makes it a reliable choice for a wide range of applications. Overall, the Random Forest is a strong and effective tool to achieve good performance in machine learning models.

If you want to understand the importance of machine learning, after reading this blog you can follow the Machine learning tutorial in order to learn about machine learning in detail. More than that, check out our Executive Post Graduate Certification in Data Science & Artificial Intelligence course which will help to excel in your career.

Check this Random Forest Algorithm Tutorial video :