Machine learning is the exciting fusion of computer science and statistics, and Random Forest is one of the fastest-growing algorithms. Random Forest, also known as Random Decision Trees, is a collection of decision trees collectively producing a single output. Leo Breiman introduced it and has since become a fundamental tool for machine learning Masters.
Table of content
What is a Random Forest Algorithm in Machine Learning?
As the name suggests, a random forest is nothing but a collection of multiple decision tree models. Random forest is a supervised Machine Learning algorithm. This algorithm creates a set of decision trees from a few randomly selected subsets of the training set and picks predictions from each tree. Then by means of voting, the random forest algorithm selects the best solution.
Random Forest Example
Let us understand the concept of a random forest with the help of a pictorial example.
Say, we have four samples as shown below:
Random forest algorithm will create four decision trees taking inputs from subsets, for example,
Random forest algorithm works well because it aggregates many decision trees, which reduce the effect of noisy results, whereas the prediction results of a single decision tree may be prone to noise.
Random forest algorithms can be applied to build both classification and regression models.
- In the case of a random forest classification model, each decision tree votes; then to get the final result, the most popular prediction class is chosen.
- In the case of the random forest regression model, the mean of all decision tree results is considered as the final result.
Advantages of Random Forest Algorithm
- Random forest algorithm is considered as a highly accurate algorithm because to get the results it builds multiple decision trees.
- In random forest algorithm, over fitting is not an issue to worry about, since this algorithm considers all multiple decision tree outputs, which generate no bias values in the results.
- It can be used to build both random forest classification and random forest regression models.
- In the important feature selection process, random forest algorithm allows us to build the desired model.
Become the Go-To Expert in Machine Learning Analysis
Achieve More with Machine Learning Training
Disadvantages of Random Forest Algorithm
- Random forest algorithm is comparatively slow in generating predictions because it has multiple decision trees.
Random Forest Algorithm vs. Decision Tree Algorithm
- Decision trees are prone to overfitting, but a random forest algorithm prevents overfitting.
- Random forest algorithm is comparatively time-consuming, whereas decision tree algorithm gives fast results.
How does the Random Forest algorithm work?
Step 1: It selects random data samples from a given dataset.
Step 2: Then, it constructs a decision tree for each sample and considers all predicted outputs of those decision trees.
Step 3: With the help of voting, it picks the most voted result of those decision trees.
Building a random forest Regression Model in Machine Learning Using Python and Sklearn
Problem Statement: Use Machine Learning to predict the selling prices of houses based on some economic factors. Build a random forest regression model in Python and Sklearn
Dataset: Boston House Prices Dataset
Let us have a quick look at the dataset:
Regression Model Building: Random Forest in Python
Let us build the regression model with the help of the random forest algorithm.
Step 1: Load the required packages and the Boston dataset
import numpy as np
import pandas as pd
df = pd.read_csv('BostonHousing.csv')
Step 2: Define the features and the target
x = pd.DataFrame(df.iloc[:,:-1])
x
y = pd.DataFrame(df.iloc[:,-1])
y
Step 3: Split the dataset into train and test sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100, random_state=0)
regressor.fit(x_train , y_train)
y_pred = regressor.predict(x_test)
Step 5: Evaluate the random forest regression model
from sklearn import metrics
print("Mean absolute error", metrics.mean_absolute_error(y_test, y_pred))
print("Mean squared error", metrics.mean_squared_error(y_test, y_pred))
print("Root mean squared error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Become a Game-Changer in Machine Learning
Become a Machine Learning Pro Today
Creating and Visualizing a Random Forest Classification Model in Machine Learning Using Python
Problem Statement: Use Machine Learning to predict cases of breast cancer using patient treatment history and health data
Dataset: Breast Cancer Wisconsin (Diagnostic) Dataset
Let us have a quick look at the dataset:
Classification Model Building: Random Forest in Python
Let us build the classification model with the help of a random forest algorithm.
Step 1: Load Pandas library and the dataset using Pandas
import pandas as pd
dataset = pd.read_csv('Cancer_data.csv')
dataset
dataset.head()
Step 2: Define the features and the target
X = pd.DataFrame(dataset.iloc[:,2:-1])
y = pd.DataFrame(dataset.iloc[:,1])
X
y
Step 3: Split the dataset into train and test sklearn
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
Step 4: Import the random forest classifier function from the sklearn ensemble module. Build the random forest classifier model with the help of the random forest classifier function
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=100, criterion='gini',random_state=1,max_depth=3)
classifier.fit(X_train, y_train)
Step 5: Predict values using the random forest classifier model
y_pred = classifier.predict(X_test)
Step 6: Evaluate the random forest classifier model
from sklearn.metrics import classification_report , confusion_matrix, accuracy_score
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))
Get 100% Hike!
Master Most in Demand Skills Now!
Feature Selection in Random Forest Algorithm Model
With the help of Scikit-Learn, we can select important features to build the random forest algorithm model in order to avoid the overfitting issue. There are two ways to do this:
- Visualize which feature is not adding any value to the model
- Take help of the built-in function SelectFromModel, which allows us to add a threshold value to neglect features below that threshold value.
Let us see if selecting features make any difference in the accuracy score of the model.
Step 7: Let us find out important features and visualize them using Seaborn
import pandas as pd
feature_imp = pd.Series(classifier.feature_importances_,index=X.columns).sort_values(ascending=False)
feature_imp
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()
Now let us see how the ‘SelectFromModel’ function helps in building a random forest classifier model with important features.
Step 8: Import the SelectFromModel function. We will pass the classifier object we’ve created above. Also, we will add a threshold value of 0.1
from sklearn.feature_selection import SelectFromModel
feat_sel = SelectFromModel(classifier,threshold=0.1)
feat_sel.fit(X_train, y_train)
Step 9: With the help of the ‘transform’ method, we will pick the important features and store them in new train and test objects
X_imp_train = feat_sel.transform(X_train)
X_imp_test = feat_sel.transform(X_test)
Step 10: Let us now build a new random forest classifier model (so that we can compare the results of this model with the old one)
clf_imp = RandomForestClassifier(n_estimators=20, criterion='gini',random_state=1,max_depth=7)
clf_imp.fit(X_imp_train, y_train)
Step 11: Let us see the accuracy result of the old model
y_pred = classifier.predict(X_test)
accuracy_score(y_test, y_pred)
Step 12: Let us see the accuracy result of the new model after feature selection
y_imp_pred = clf_imp.predict(X_imp_test)
accuracy_score(y_test, y_imp_pred)
Note: After the feature selection process, the accuracy score is decreased. But, we have successfully picked out the important features at a small cost of accuracy.
Also, automatic feature selection reduces the complexity of the model but does not necessarily increase the accuracy. In order to get the desired accuracy, we have to perform the feature selection process manually.
Conclusion
Random Forest is a very powerful, versatile machine-learning algorithm that boosts accuracy by combining multiple decision trees. It reduces overfitting, handles missing data, and captures complex patterns, thus being suitable for both classification and regression tasks. This ability to provide insights into feature importance and scale well with large datasets makes it a reliable choice for a wide range of applications. Overall, the Random Forest is a strong and effective tool to achieve good performance in machine learning models.
If you want to understand the importance of machine learning, after reading this blog you can follow the Machine learning tutorial in order to learn about machine learning in detail. More than that, check out our Executive Post Graduate Certification in Data Science & Artificial Intelligence course which will help to excel in your career.
Check this Random Forest Algorithm Tutorial video :
Our Machine Learning Courses Duration and Fees
Cohort starts on 18th Jan 2025
₹70,053
Cohort starts on 8th Feb 2025
₹70,053