Intellipaat
Intellipaat

Building Random Forest Algorithm Models in Python and Sklearn

In this random forest tutorial blog, we will learn what random forest algorithm is? We will see how to build random forest models with the help of random forest classifier and random forest regression functions. This blog highlights the implementation of random forest in Python and Sklearn.

Building Random Forest Algorithm Models in Python and Sklearn
 26th Jun, 2019
 140 Views

What is Random Forest Algorithm in Machine Learning?

As the name suggests, random forest is nothing but a collection of multiple decision tree models. Random forest is a supervised Machine Learning algorithm. This algorithm creates a set of decision trees from a few randomly selected subsets of the training set and picks predictions from each tree. Then by means of voting, the random forest algorithm selects the best solution.

Let us first take a look at the table of contents for this random forest tutorial blog:

Check this Intellipaat Machine Learning Tutorial  video :

Random Forest Example

Let us understand the concept of random forest with the help of a pictorial example.

Say, we have four samples as shown below:

Random forest algorithm will create four decision trees taking inputs from subsets, for example,

Random forest algorithm works well because it aggregates many decision trees, which reduce the effect of noisy results, whereas the prediction results of a single decision tree may be prone to noise.

Random forest algorithm can be applied to build both classification and regression models.

  • In the case of a random forest classification model, each decision tree votes; then to get the final result, the most popular prediction class is chosen.
  • In the case of random forest regression model, the mean of all decision tree results is considered as the final result.

Advantages of Random Forest Algorithm

  • Random forest algorithm is considered as a highly accurate algorithm because to get the results it builds multiple decision trees.
  • In random forest algorithm, over fitting is not an issue to worry about, since this algorithm considers all multiple decision tree outputs, which generate no bias values in the results.
  • It can be used to build both random forest classification and random forest regression models.
  • In the important feature selection process, random forest algorithm allows us to build the desired model.

Disadvantages of Random Forest Algorithm

  • Random forest algorithm is comparatively slow in generating predictions because it has multiple decision trees.

Random Forest Algorithm vs. Decision Tree Algorithm

  • Decision trees are prone to overfitting, but random forest algorithm prevents overfitting.
  • Random forest algorithm is comparatively time-consuming, whereas decision tree algorithm gives fast results.

How does the Random Forest algorithm work?

Step 1: It selects random data samples from a given dataset.

Step 2: Then, it constructs a decision tree for each sample and considers all predicted outputs of those decision trees.

Step 3: With the help of voting, it picks the most voted result of those decision trees.

Building a random forest Regression Model in Machine Learning Using Python and Sklearn

Problem Statement: Use Machine Learning to predict the selling prices of houses based on some economic factors. Build a random forest regression model in Python and Sklearn

Dataset: Boston House Prices Dataset

Let us have a quick look at the dataset:

Regression Model Building: Random Forest in Python

Let us build the regression model with the help of the random forest algorithm.

Step 1: Load required packages and the Boston dataset

Step 2: Define the features and the target

Step 3: Split the dataset into train and test sets

Step 4: Build the random forest regression model with random forest regressor function

Step 5: Evaluate the random forest regression model

Creating and Visualizing a Random Forest Classification Model in Machine Learning Using Python

Problem Statement: Use Machine Learning to predict cases of breast cancer using patient treatment history and health data

Dataset: Breast Cancer Wisconsin (Diagnostic) Dataset

Let us have a quick look at the dataset:

Classification Model Building: Random Forest in Python

Let us build the classification model with the help of a random forest algorithm.

Step 1: Load Pandas library and the dataset using Pandas

Step 2: Define the features and the target

Step 3: Split the dataset into train and test sklearn

Step 4: Import the random forest classifier function from sklearn ensemble module. Build the random forest classifier model with the help of the random forest classifier function

Step 5: Predict values using the random forest classifier model

Step 6: Evaluate the random forest classifier model

Feature Selection in Random Forest Algorithm Model

With the help of Scikit-Learn, we can select important features to build the random forest algorithm model in order to avoid the overfitting issue. There are two ways to do this:

  • Visualize which feature is not adding any value to the model
  • Take help of the built-in function SelectFromModel, which allows us to add a threshold value to neglect features below that threshold value.

Let us see if selecting features make any difference in the accuracy score of the model.

Step 7: Let us find out important features and visualize them using Seaborn

Now let us see how the ‘SelectFromModel’ function helps in building a random forest classifier model with important features.

Step 8: Import the SelectFromModel function. We will pass the classifier object we’ve created above. Also, we will add a threshold value of 0.1

Step 9: With the help of the ‘transform’ method, we will pick the important features and store them in new train and test objects

Step 10: Let us now build a new random forest classifier model (so that we can compare the results of this model with the old one)

Step 11: Let us see the accuracy result of the old model

Step 12: Let us see the accuracy result of the new model after feature selection

Note: After the feature selection process, the accuracy score is decreased. But, we have successfully picked out the important features at a small cost of accuracy.

Also, automatic feature selection reduces the complexity of the model but does not necessarily increase the accuracy. In order to get the desired accuracy, we have to perform the feature selection process manually.

What did we learn so far?

In this random forest tutorial blog, we answered the question, ‘what is random forest algorithm?’ We also learned how to build random forest models with the help of random forest classifier and random forest regressor functions. Additionally, we talked about the implementation of the random forest algorithm in Python and Scikit-Learn.

 

Related Articles