Understanding Reinforcement Learning
We are living in the 21st century, the era of automation. Machine Learning has been a rock band in the field of automation. The automated machines that we create using the techniques of Machine Learning carry out iterative tasks to reduce human effort and time.
However, the real-world tasks are way too complex for a machine to execute. It is a highly redundant task to program every course of action for a machine. There emerges the need for a technique that enables the machine to learn and improve itself. This Machine Learning technique is called reinforcement learning.
Reinforcement learning in Machine Learning is a technique where a machine learns to determine the right step based on the results of the previous steps in similar circumstances.
Watch this video on Reinforcement Learning Tutorial:
Mechanism of Reinforcement Learning
- Reinforcement learning works on the principle of feedback and improvement.
- In reinforcement learning, we do not use datasets for training the model.
- Instead, the machine takes certain steps on its own, analyzes the feedback, and then tries to improve its next step to get the best outcome.
For the best of career growth, check out Intellipaat’s Machine Learning Course and get certified.
Reinforcement Learning Process
Reinforcement learning is the craftsmanship of devising optimal judgments for a machine using experiences. Splitting it further, the method of reinforcement learning includes the following steps:
- Deciding an action by applying some tactics
- Performing the action
- Obtaining a reward or punishment
- Discovering new areas with the help of past experiences
and improving the approach
sticking to the strategy and performing the action until the machine
Let’s now understand the
theory behind reinforcement learning with the help of a use case to make the
You have a chessboard in front of you. You don’t have any idea of playing chess. The game has started, and you have to make a move. Now, you randomly picked up a Bishop (the RL agent) and made a straight move as shown in the image below:
But, that’s a wrong move! A Bishop can only move diagonally either through white or black squares, backward or forward, given the way is empty. So, the learning outcome from this move is that next time you would probably try to make the right move. In a similar way, you would iteratively continue gaining a thorough knowledge of moves from the feedback you receive and try to learn the right moves.
This is nothing but reinforcement learning. With the help of this reinforcement learning example, we have understood the theory behind it. Now, we will look into the algorithm that is used to implement reinforcement learning.
How do we implement Reinforcement Learning?
So far, we have discussed
the theoretical aspects of reinforcement learning. But, the question that arises
is, how do we implement reinforcement learning on a model? Is there any method
or a reinforcement learning algorithm to do so?
Yes! There is an algorithm named Q-learning that helps the RL (reinforcement learning) agent decide the actions it needs to take in different circumstances.
Learn more about Artificial Intelligence from this Artificial Intelligence Course to get ahead in your career!
How does Q-learning work?
technique acts as a crib sheet for the reinforcement learning agent. It enables
the RL agent to use the feedback of the environment to learn the best actions
it can take in different circumstances.
makes use of Q-values to track and improve the performance of the RL
agent. Initially, the Q-values are set to any arbitrary value. When the RL
agent performs different actions and receives the feedback (a reward or a
punishment) for the actions, the Q-values are updated.
To update the Q-values, we use the following Bellman equation:
The above equation can also be written as follows:
S: The present
condition (state) of the RL agent
A: The present action to be performed
S′: The subsequent
state where the agent stops
A′: The next
most suitable step to be chosen using the present Q-value
R: The immediate reward received from the environment
in response to the action performed
α: The learning rate. Its value is greater
than 0 and less than or equal to 1. It is used to measure the degree at which the
updates in Q-values happen in each iteration
γ: The discount factor. Its value lies between
0 and 1 (0 ≤ γ ≤ 1). It determines
the significance of future rewards. A high value for γ (nearly 1) carries a long-term productive reward, and a value of
0 for γ denotes that the RL agent reflects only on instant rewards
The above Bellman equation declares that the Q-value generated from staying at state S and implementing an action A is the next reward R(S,A) plus the highest Q-value probable from the next state S’.
Also, Q(S’,A) is further dependent on Q(S”,A), and so on as shown in the below equation:
adjust the γ value, it will decrease or enhance the contribution of the
Bellman equation is recursive, we can make random hypotheses for all the
Q-values. By gaining exposure, the model will focalize to the optimal strategy.
Practically, it is implemented as follows:
where, t denotes the iterations.
We can also make a ε-greedy policy for the chosen action. We do this by evaluating the Q-value.
for which the value of Q is large and probability 1-ε,is chosen. After that, the actions with probability ε is chosen
Presently, we have looked at all the theoretical concepts. Now, in this blog on ‘What is Reinforcement Learning?’ we will implement Q-learning in Python.
Implementing Q-learning for Reinforcement Learning in Python
implementing algorithms of reinforcement learning such as Q-learning, we use the
OpenAI Gym environment available in Python.
Now, let’s look at the steps to implement Q-learning:
Step 1: Importing Libraries
import numpy as np
import pandas as pd
from collections import defaultdict
from windy_gridworld import WindyGridworldEnv
Step 2: Creating the Gym Environment
env = WindyGridworldEnv()
Step 3: Constituting the Greedy Strategy
def createEpsilonGreedyPolicy(Q, epsilon, n_action):
Action_probabilities = np.ones(n_action,
dtype = float) * epsilon / n_action
best_step = np.argmax(Q[state])
Action_probabilities[best_step] += (1.0 - epsilon)
Step 4: Building the Q-learning Model
def qLearning(env, num_episodes, discount_factor = 1.0,
alpha = 0.6, epsilon = 0.1):
Q = defaultdict(lambda: np.zeros(env.action_space.n))
# Tracking the important statistics
stats = plotting.EpisodeStats(
episode_lengths = np.zeros(num_episodes),
episode_rewards = np.zeros(num_episodes))
# Creating function for an epsilon greedy policy
policy = createEpsilonGreedyPolicy(Q, epsilon, env.action_space.n)
for ith_episode in range(num_episodes):
state = env.reset()
for t in itertools.count():
action_probabilities = policy(state)
action = np.random.choice(np.arange(
p = action_probabilities)
next_state, reward, done, _ = env.step(action)
stats.episode_rewards[i_episode] += reward
stats.episode_lengths[i_episode] = t
best_next_step = np.argmax(Q[next_state])
td_target = reward + discount_factor * Q[next_state][best_next_step]
td_delta = td_target - Q[state][action]
Q[state][action] += alpha * td_delta
state = next_state
return Q, stats
Step 5: Training the Model
Q, stats = qLearning(env, 1000)
Step 6: Plotting the Visualization Graph
From the above graph we can infer that reward is increasing as the time increases. The maximum value of reward per episode shows that the RL agent learns to take right action by maximizing its total reward.
This is all about Reinforcement Learning and its implemented.
Go through this Machine Learning Interview Questions And Answers to excel in your Machine Learning Interview.