Understanding Reinforcement Learning
We are living in the 21st century, the era of automation. Machine Learning has been a rock band in the field of automation. The automated machines that we create using the techniques of Machine Learning carry out iterative tasks to reduce human effort and time.
However, the real-world tasks are way too complex for a machine to execute. It is a highly redundant task to program every course of action for a machine. There emerges the need for a technique that enables the machine to learn and improve itself. This Machine Learning technique is called reinforcement learning.
Reinforcement learning in Machine Learning is a technique where a machine learns to determine the right step based on the results of the previous steps in similar circumstances.
Watch this video on Reinforcement Learning Tutorial:
Mechanism of Reinforcement Learning
- Reinforcement learning works on the principle of feedback and improvement.
- In reinforcement learning, we do not use datasets for training the model.
- Instead, the machine takes certain steps on its own, analyzes the feedback, and then tries to improve its next step to get the best outcome.
Reinforcement Learning Process
Reinforcement learning is the craftsmanship of devising optimal judgments for a machine using experiences. Splitting it further, the method of reinforcement learning includes the following steps:
- Investigating
circumstances
- Deciding an action by applying some tactics
- Performing the action
- Obtaining a reward or punishment
- Discovering new areas with the help of past experiences
and improving the approach
- Iteratively
sticking to the strategy and performing the action until the machine
learns properly
Let’s now understand the theory behind reinforcement learning with the help of a use case to make the picture clearer.
You have a chessboard in front of you. You don’t have any idea of playing chess. The game has started, and you have to make a move. Now, you randomly picked up a Bishop (the RL agent) and made a straight move as shown in the image below:
But, that’s a wrong move! A Bishop can only move diagonally either through white or black squares, backward or forward, given the way is empty. So, the learning outcome from this move is that next time you would probably try to make the right move. In a similar way, you would iteratively continue gaining a thorough knowledge of moves from the feedback you receive and try to learn the right moves.
This is nothing but reinforcement learning. With the help of this reinforcement learning example, we have understood the theory behind it. Now, we will look into the algorithm that is used to implement reinforcement learning.
How do we implement Reinforcement Learning?
So far, we have discussed
the theoretical aspects of reinforcement learning. But, the question that arises
is, how do we implement reinforcement learning on a model? Is there any method
or a reinforcement learning algorithm to do so?
Yes! There is an algorithm named Q-learning that helps the RL (reinforcement learning) agent decide the actions it needs to take in different circumstances.
How does Q-learning work?
The Q-learning
technique acts as a crib sheet for the reinforcement learning agent. It enables
the RL agent to use the feedback of the environment to learn the best actions
it can take in different circumstances.
Q-learning
makes use of Q-values to track and improve the performance of the RL
agent. Initially, the Q-values are set to any arbitrary value. When the RL
agent performs different actions and receives the feedback (a reward or a
punishment) for the actions, the Q-values are updated.
To update the Q-values, we use the following Bellman equation:
The above equation can also be written as follows:
Here,
S: The present
condition (state) of the RL agent
A: The present
action to be performed
S′: The subsequent
state where the agent stops
A′: The next
most suitable step to be chosen using the present Q-value
R: The immediate
reward received from the environment
in response to the action performed
α: The learning rate. Its value is greater than 0 and less than or equal to 1. It is used to measure the degree at which the updates in Q-values happen in each iteration
γ: The discount factor. Its value lies between
0 and 1 (0 ≤ γ ≤ 1). It determines
the significance of future rewards. A high value for γ (nearly 1) carries a long-term productive reward, and a value of
0 for γ denotes that the RL agent reflects only on instant rewards
The above Bellman equation declares that the Q-value generated from staying at state S and implementing an action A is the next reward R(S,A) plus the highest Q-value probable from the next state S’.
Also, Q(S’,A) is further dependent on Q(S”,A), and so on as shown in the below equation:
When we
adjust the γ value, it will decrease or enhance the contribution of the
expected rewards.
Since the
Bellman equation is recursive, we can make random hypotheses for all the
Q-values. By gaining exposure, the model will focalize to the optimal strategy.
Practically, it is implemented as follows:
where, t
denotes the iterations.
We can also make a ε-greedy policy for the chosen action. We do this by evaluating the Q-value.
The action,
for which the value of Q is large and probability 1-ε,is chosen. After that, the actions with probability ε is chosen
at random.
Presently, we have looked at all the theoretical concepts. Now, in this blog on ‘What is Reinforcement Learning?’ we will implement Q-learning in Python.
Implementing Q-learning for Reinforcement Learning in Python
For
implementing algorithms of reinforcement learning such as Q-learning, we use the
OpenAI Gym environment available in Python.
Now, let’s look at the steps to implement Q-learning:
Step 1: Importing Libraries
import gym
import itertools
import matplotlib
import matplotlib.style
import numpy as np
import pandas as pd
import sys
from collections import defaultdict
from windy_gridworld import WindyGridworldEnv
import plotting
matplotlib.style.use('ggplot')
Step 2: Creating the Gym Environment
env = WindyGridworldEnv()
Step 3: Constituting the Greedy Strategy
def createEpsilonGreedyPolicy(Q, epsilon, n_action):
def policyFunction(state):
Action_probabilities = np.ones(n_action,
dtype = float) * epsilon / n_action
best_step = np.argmax(Q[state])
Action_probabilities[best_step] += (1.0 - epsilon)
return Action_probabilities
return policyFunction
Step 4: Building the Q-learning Model
def qLearning(env, num_episodes, discount_factor = 1.0,
alpha = 0.6, epsilon = 0.1):
Q = defaultdict(lambda: np.zeros(env.action_space.n))
# Tracking the important statistics
stats = plotting.EpisodeStats(
episode_lengths = np.zeros(num_episodes),
episode_rewards = np.zeros(num_episodes))
# Creating function for an epsilon greedy policy
policy = createEpsilonGreedyPolicy(Q, epsilon, env.action_space.n)
for ith_episode in range(num_episodes):
state = env.reset()
for t in itertools.count():
action_probabilities = policy(state)
action = np.random.choice(np.arange(
len(action_probabilities)),
p = action_probabilities)
next_state, reward, done, _ = env.step(action)
stats.episode_rewards[i_episode] += reward
stats.episode_lengths[i_episode] = t
best_next_step = np.argmax(Q[next_state])
td_target = reward + discount_factor * Q[next_state][best_next_step]
td_delta = td_target - Q[state][action]
Q[state][action] += alpha * td_delta
if done:
break
state = next_state
return Q, stats
Step 5: Training the Model
Q, stats = qLearning(env, 1000)
Step 6: Plotting the Visualization Graph
plotting.plot_episode_stats(stats)
From the above graph we can infer that reward is increasing as the time increases. The maximum value of reward per episode shows that the RL agent learns to take right action by maximizing its total reward.
This is all about Reinforcement Learning and its implemented.