Decision Tree Algorithm in Machine Learning Using Sklearn

Decision Tree Algorithm in Machine Learning Using Sklearn

Decision Tree is a supervised learning method that is used for classification and regression tasks. The main aim of a Decision Tree is to create a model that can predict the value of a target variable by learning simple decision rules. 

In this blog, we are going to guide you through the fundamentals of decision trees, their working process, and their implementation in Python using the Scikit-Learn library. This will help you gain a strong grasp of decision trees, and you can also apply it to your own datasets. So let’s get started!

Table of Contents:

Introduction to Decision Trees

A decision tree is basically a non-parametric supervised learning technique. It is used for classification and regression tasks. The decision tree helps in modelling the decisions and the possible sequences in a tree format, which makes it easy to understand for the users. Each internal node in the decision tree is used to represent a “test” or decision rule on an attribute, each branch represents the results from the test, and each leaf node is used to represent a class label or a continuous value. Decision trees have a hierarchical structure that allows the model to capture complex decision boundaries in the data.

Some of the important terminologies related to Decision trees are given below.

  • Root Node: It is the node that is present at the top of the decision tree where the first splitting happens.
  • Leaf Node: These nodes are responsible for giving the final output or predictions.
  • Internal Node: It denotes any node that is responsible for splitting the data further based on a condition.
  • Splitting: It denotes the process of splitting a node into two child nodes based on a feature.
  • Branch/sub-tree: It is a section in the entire structure of the tree that represents a part of the decision process.
  • Information Gain: It is basically a measurement that is useful for choosing the best feature that splits the data most effectively.
  • Gini Impurity: It is a metric that is responsible for measuring how frequently an element, which is randomly chosen, would be labeled incorrectly.
  • Entropy: It is a concept from information theory that is responsible for measuring the randomness and disorder in the data.
  • Pruning: It is a technique that is used for cutting down different parts of the tree to avoid overfitting and improve generalization.
  • Overfitting: It represents the moment when the tree performs well on training data, but does not perform well on unseen data.
  • Depth of the tree: It denotes the length of the longest path from the root node to the leaf node.
  • Recursive Partitioning:  It is the process of splitting datasets into subsets repeatedly using selected features.

A diagrammatic representation of decision trees is given below for your better understanding.

Working of Decision Trees

The working process of decision trees includes the following steps mentioned below.

  • At first, decision trees split the dataset repeatedly based on the feature values. This creates subsets that contain only one class.
  • The splitting criteria involve choosing the best feature to split on by using criteria like Gini Impurity or Information Gain. This helps in measuring how well the feature separates the classes.
  • This process of splitting datasets continues repeatedly, and branches and nodes the created. The splitting does not stop until a stopping point a reached. 
  • After the tree is built, you can make predictions by following the path from the tree down to a leaf node. It uses the feature values from the input data to guide the way.
Certification in Bigdata Analytics

Implementing Decision Trees with Scikit-Learn

Now, let’s look at the step-by-step implementation of Decision Trees in Python using the Scikit-Learn library.

Step 1: Importing Necessary Libraries

At the first step, you will have to import all the necessary libraries as given below.

Example:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

Explanation: 

The above code only imports the necessary libraries, loads the iris dataset, prepares for the training process of a decision tree model using scikit-learn, and then visualizes it with matplotlib. This code does not generate an output because it does not perform any training, prediction, or plotting of data points. Nothing is executed in this code for displaying output.

Step 2: Loading the Dataset

In the second step, we have used the Iris dataset. It consists of the measurements of Iris flowers and their species.

iris = load_iris()
X = iris.data
y = iris.target

Explanation:

The above code only loads the iris dataset. It then assigns the feature values to X and the target labels (species of the flowers) to y. Although the dataset is loaded and the variables are assigned, this code does not generate an output because no print or display function is used.

Step 3: Splitting the dataset

In this step, we are going to split the dataset into training and testing sets. This will help in evaluating the performance of the model.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Explanation:

In the above code, the dataset is split where 80% is used for training, and 20% data is reserved for testing.  This code does not generate an output because there is no print or display statement in this code.

Step 4: Training the Decision Tree Classifier

In this step, we will create and train the decision tree classifier.

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

Output:

Explanation:

The above code involves building the DecisionTreeClassifier. This is done by delecting the best splits which are based on the training data.

Step 5: Making Predictions

In this step, we will make predictions with the trained model on the test set.

Example:

y_pred = clf.predict(X_test)

Explanation:

The above line uses the trained Decision Tree model clf for predicting the class labels for the test dataset X_test and stores the predictions in y_pred. The above code does not generate an output because there is no display function used with y_pred. The predictions are calculated but it does not show the output.

Step 6: Model Evaluation

In this step, for calculating the accuracy of the model, we compare the predictions to the actual labels.

Example:

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Output:

Explanation:

In the above code, the accuracy of the model is calculated by comparing the predicted labels (y_pred) with the actual test labels (y_test). After that, the result is printed.

Step 7: Visualizing the Decision Tree

In this step, we are going to visualize the decision tree, which will help you to understand the decision rules learned by the model.

Example:

plt.figure(figsize=(20,10))
tree.plot_tree(clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.show()

Output:

Explanation:

The above visualization displays the structure of the decision tree. It shows the way decisions are made at each node, which are based on the feature values.

Hyperparameter Tuning in Decision Trees

In order to improve the performance of the Decision Tree model and to prevent overfitting, you can tune the model’s hyperparameters by using the following ways.

  • max_depth: It helps to calculate the maximum depth of the tree. By limiting the depth of the tree, you can prevent the model from becoming too complex and also help the model avoid overfitting the training data. 
  • min_samples_split: It is used for calculating the minimum number of samples that are necessary for splitting an internal node. High values prevent the model from learning specific patterns.
  • min_samples_leaf: It is used for calculating the minimum number of samples which are required to be at the leaf node. By setting this parameter, you can improve the model’s performance and reduce variance.

Example: Creating a tree with a maximum depth of 3.

clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train)

Output:

Explanation:

By adjusting these parameters, you can control the complexity of the tree and also help in generalizing the model to unseen data.

Advantages of Decision Trees

Some of the advantages of decision trees are given below:

  • Easy to Understand and Interpret: The flow structure of the decision tree makes it easy for you to follow the decision making process.
  • Requires Little Data Preparation: Another advantage is that it does not require any feature scaling and normalization.
  • Handles both numerical and categorical data: Decision Trees are versatile in managing different types of data like both numerical and categorical data.

Disadvantages of Decision Trees

Some of the disadvantages of decision trees include:

  • Prone to overfitting: Decision trees are prone to overfitting especially with small datasets.
  • Unstable: Decision trees are very much unstable. Small variations in data results in different tree structures.
  • Biased with Imbalanced data: Decision trees favor classes that have more instances.

Conclusion

In this blog, you have learnt about the basics and implementation of Decision trees using the Scikit-learn library. You have also learned the working process of decision trees, terminologies that are related to it, and it’s step by step implementation in Python. Decision trees are simple and interpretable, which makes them a good choice for many classification and regression problems. Understanding decision trees will give you a strong foundation in the world of Machine Learning.If you want to learn more about this technology, then check out our Comprehensive Data Science Course.

FAQs

  1. Can decision trees handle both classification and regression tasks?

Yes, Decision trees can be used for both classification and regression tasks, DecisionTreeClassifer is used for classification tasks, and DecisionTreeRegressor is used for regression tasks.

  1. How can I improve the performance of a Decision Tree model?

For improving the performance of a decision tree model you can follow techniques like pruning, hyperparameter tuning, or you can also use ensemble methods such as Random Forests.

  1. Are decision trees affected by missing values in the data?

Yes, decision trees are affected by the missing values in data. This is because basic decision trees like Scikit-learn don’t handle missing values well. It is important that you always preprocess the data and handle the missing values before you train the model.

  1. What are some common disadvantages of Decision Trees?

Some of the common disadvantages include decision trees getting prone to overfitting. Decision trees are sensitive to small changes in the data, which leads to different tree structures.

  1. Can I visualize the tree structure after training?

You can use sklearn.tree.plot_tree() or you can also export it as  text using export_text() for visualizing the tree structure.

Our Machine Learning Courses Duration and Fees

Program Name
Start Date
Fees
Cohort Starts on: 12th Apr 2025
₹70,053

About the Author

Principal Data Scientist

Meet Akash, a Principal Data Scientist with expertise in advanced analytics, machine learning, and AI-driven solutions. With a master’s degree from IIT Kanpur, Aakash combines technical knowledge with industry insights to deliver impactful, scalable models for complex business challenges.