What is Reinforcement Learning?

What is Reinforcement Learning?

Understanding Reinforcement Learning

We are living in the 21st century, the era of automation. Machine Learning has been a rock band in the field of automation. The automated machines that we create using the techniques of Machine Learning carry out iterative tasks to reduce human effort and time.

However, the real-world tasks are way too complex for a machine to execute. It is a highly redundant task to program every course of action for a machine. There emerges the need for a technique that enables the machine to learn and improve itself. This Machine Learning technique is called reinforcement learning.

Reinforcement learning in Machine Learning is a technique where a machine learns to determine the right step based on the results of the previous steps in similar circumstances.

Watch this video on Reinforcement Learning Tutorial:

Video Thumbnail

Mechanism of Reinforcement Learning

  • Reinforcement learning works on the principle of feedback and improvement.
  • In reinforcement learning, we do not use datasets for training the model.
  • Instead, the machine takes certain steps on its own, analyzes the feedback, and then tries to improve its next step to get the best outcome.

Reinforcement Learning Process

Reinforcement learning is the craftsmanship of devising optimal judgments for a machine using experiences. Splitting it further, the method of reinforcement learning includes the following steps:

Reinforcement Learning Process

  1. Investigating circumstances
  2. Deciding an action by applying some tactics
  3. Performing the action
  4. Obtaining a reward or punishment
  5. Discovering new areas with the help of past experiences and improving the approach
  6. Iteratively sticking to the strategy and performing the action until the machine learns properly

Let’s now understand the theory behind reinforcement learning with the help of a use case to make the picture clearer.

You have a chessboard in front of you. You don’t have any idea of playing chess. The game has started, and you have to make a move. Now, you randomly picked up a Bishop (the RL agent) and made a straight move as shown in the image below:

Reinforcement Learning Chess board1

But, that’s a wrong move! A Bishop can only move diagonally either through white or black squares, backward or forward, given the way is empty. So, the learning outcome from this move is that next time you would probably try to make the right move. In a similar way, you would iteratively continue gaining a thorough knowledge of moves from the feedback you receive and try to learn the right moves.

Reinforcement Learning Chess board1

This is nothing but reinforcement learning. With the help of this reinforcement learning example, we have understood the theory behind it. Now, we will look into the algorithm that is used to implement reinforcement learning.

Certification in Bigdata Analytics

How do we implement Reinforcement Learning?

So far, we have discussed the theoretical aspects of reinforcement learning. But, the question that arises is, how do we implement reinforcement learning on a model? Is there any method or a reinforcement learning algorithm to do so?

Yes! There is an algorithm named Q-learning that helps the RL (reinforcement learning) agent decide the actions it needs to take in different circumstances.

How does Q-learning work?

The Q-learning technique acts as a crib sheet for the reinforcement learning agent. It enables the RL agent to use the feedback of the environment to learn the best actions it can take in different circumstances.

Q-learning makes use of Q-values to track and improve the performance of the RL agent. Initially, the Q-values are set to any arbitrary value. When the RL agent performs different actions and receives the feedback (a reward or a punishment) for the actions, the Q-values are updated.

To update the Q-values, we use the following Bellman equation:

Bellman's Equation in Reinforcement Learning

The above equation can also be written as follows:

Bellman equation

Here,

S: The present condition (state) of the RL agent

A: The present action to be performed

S′: The subsequent state where the agent stops

A′: The next most suitable step to be chosen using the present Q-value

R: The immediate reward received from the environment in response to the action performed

α: The learning rate. Its value is greater than 0 and less than or equal to 1. It is used to measure the degree at which the updates in Q-values happen in each iteration

Video Thumbnail

γ: The discount factor. Its value lies between 0 and 1 (0 ≤ γ ≤ 1). It determines the significance of future rewards. A high value for γ (nearly 1) carries a long-term productive reward, and a value of 0 for γ denotes that the RL agent reflects only on instant rewards

The above Bellman equation declares that the Q-value generated from staying at state S and implementing an action A is the next reward R(S,A) plus the highest Q-value probable from the next state S’.

Also, Q(S’,A) is further dependent on Q(S”,A), and so on as shown in the below equation:

Bellman's Equation 2

When we adjust the γ value, it will decrease or enhance the contribution of the expected rewards.

Since the Bellman equation is recursive, we can make random hypotheses for all the Q-values. By gaining exposure, the model will focalize to the optimal strategy.

Practically, it is implemented as follows:

Bellman equation is recursive

where, t denotes the iterations.

We can also make a ε-greedy policy for the chosen action. We do this by evaluating the Q-value.

The action, for which the value of Q is large and probability 1-ε,is chosen. After that, the actions with probability ε is chosen at random.

Presently, we have looked at all the theoretical concepts. Now, in this blog on ‘What is Reinforcement Learning?’ we will implement Q-learning in Python.

Become an Artificial Intelligence Engineer

Implementing Q-learning for Reinforcement Learning in Python

For implementing algorithms of reinforcement learning such as Q-learning, we use the OpenAI Gym environment available in Python.

Now, let’s look at the steps to implement Q-learning:

Step 1: Importing Libraries

import gym 
import itertools
import matplotlib
import matplotlib.style
import numpy as np
import pandas as pd
import sys
from collections import defaultdict
from windy_gridworld import WindyGridworldEnv
import plotting
matplotlib.style.use('ggplot')

Step 2: Creating the Gym Environment

env = WindyGridworldEnv()

Step 3: Constituting the Greedy Strategy

def createEpsilonGreedyPolicy(Q, epsilon, n_action): 
def policyFunction(state): 

    Action_probabilities = np.ones(n_action, 
            dtype = float) * epsilon / n_action 

    best_step = np.argmax(Q[state]) 
    Action_probabilities[best_step] += (1.0 - epsilon) 
    return Action_probabilities 

return policyFunction 

Step 4: Building the Q-learning Model

def qLearning(env, num_episodes, discount_factor = 1.0, 
                             alpha = 0.6, epsilon = 0.1): 
Q = defaultdict(lambda: np.zeros(env.action_space.n)) 

# Tracking the important statistics 
stats = plotting.EpisodeStats( 
    episode_lengths = np.zeros(num_episodes), 
    episode_rewards = np.zeros(num_episodes))     

# Creating function for an epsilon greedy policy 

policy = createEpsilonGreedyPolicy(Q, epsilon, env.action_space.n) 


for ith_episode in range(num_episodes): 

    state = env.reset() 

    for t in itertools.count(): 


        action_probabilities = policy(state) 


        action = np.random.choice(np.arange( 
                  len(action_probabilities)), 
                   p = action_probabilities) 


        next_state, reward, done, _ = env.step(action) 


        stats.episode_rewards[i_episode] += reward 
        stats.episode_lengths[i_episode] = t 


        best_next_step = np.argmax(Q[next_state])     
        td_target = reward + discount_factor * Q[next_state][best_next_step] 
        td_delta = td_target - Q[state][action] 
        Q[state][action] += alpha * td_delta 


        if done: 
            break

        state = next_state 

return Q, stats 

Step 5: Training the Model

 Q, stats = qLearning(env, 1000) 

Step 6: Plotting the Visualization Graph

plotting.plot_episode_stats(stats)

Episode length over time
Episode time per step
Episode reward over time

From the above graph we can infer that reward is increasing as the time increases. The maximum value of reward per episode shows that the RL agent learns to take right action by maximizing its total reward.
This is all about Reinforcement Learning and its implemented.

Our Machine Learning Courses Duration and Fees

Program Name
Start Date
Fees
Cohort starts on 18th Jan 2025
₹70,053
Cohort starts on 8th Feb 2025
₹70,053

About the Author

Principal Data Scientist

Meet Akash, a Principal Data Scientist with expertise in advanced analytics, machine learning, and AI-driven solutions. With a master’s degree from IIT Kanpur, Aakash combines technical knowledge with industry insights to deliver impactful, scalable models for complex business challenges.