**Understanding Reinforcement Learning**

We are living in the 21st century, the era of automation. Machine Learning has been a rock band in the field of automation. The automated machines that we create using the techniques of Machine Learning carry out iterative tasks to reduce human effort and time.

However, the real-world tasks are way too complex for a machine to execute. It is a highly redundant task to program every course of action for a machine. There emerges the need for a technique that enables the machine to learn and improve itself. This Machine Learning technique is called **reinforcement learning**.

Reinforcement learning in Machine Learning is a technique where a machine learns to determine the right step based on the results of the previous steps in similar circumstances.

**Watch this video on Reinforcement Learning Tutorial:**

**Mechanism of Reinforcement Learning**

- Reinforcement learning works on the principle of feedback and improvement.
- In reinforcement learning, we do not use datasets for training the model.
- Instead, the machine takes certain steps on its own, analyzes the feedback, and then tries to improve its next step to get the best outcome.

.**For the best of career growth, check out Intellipaat’s Machine Learning Course and get certified**

**Reinforcement Learning Process**

Reinforcement learning is the craftsmanship of devising optimal judgments for a machine using experiences. Splitting it further, the method of reinforcement learning includes the following steps:

- Investigating circumstances
- Deciding an action by applying some tactics
- Performing the action
- Obtaining a reward or punishment
- Discovering new areas with the help of past experiences and improving the approach
- Iteratively sticking to the strategy and performing the action until the machine learns properly

Let’s now understand the theory behind reinforcement learning with the help of a use case to make the picture clearer.

*Learn more about use cases on Reinforcement Learning on this blog on Applications Of Reinforcement Learning!*

You have a chessboard in front of you. You don’t have any idea of playing chess. The game has started, and you have to make a move. Now, you randomly picked up a *Bishop* (the **RL agent**) and made a straight move as shown in the image below:

But, that’s a wrong move! A *Bishop *can only move *diagonally* either through white or black squares, backward or forward, given the way is empty. So, the learning outcome from this move is that next time you would probably try to make the right move. In a similar way, you would iteratively continue gaining a thorough knowledge of moves from the feedback you receive and try to learn the right moves.

This is nothing but reinforcement learning. With the help of this reinforcement learning example, we have understood the theory behind it. Now, we will look into the algorithm that is used to implement reinforcement learning.

**How do we implement Reinforcement Learning?**

So far, we have discussed the theoretical aspects of reinforcement learning. But, the question that arises is, how do we implement reinforcement learning on a model? Is there any method or a reinforcement learning algorithm to do so?

Yes! There is an algorithm named **Q-learning** that helps the RL (reinforcement learning) agent decide the actions it needs to take in different circumstances.

*Learn more about Artificial Intelligence from this Artificial Intelligence Course to get ahead in your career!*

**How does Q-learning work?**

The Q-learning technique acts as a crib sheet for the reinforcement learning agent. It enables the RL agent to use the feedback of the environment to learn the best actions it can take in different circumstances.

Q-learning
makes use of **Q-values **to track and improve the performance of the RL
agent. Initially, the Q-values are set to any arbitrary value. When the RL
agent performs different actions and receives the feedback (a reward or a
punishment) for the actions, the Q-values are updated.

To update the Q-values, we use the following Bellman equation:

The above equation can also be written as follows:

Here,

**S**: The present
condition (**state**) of the RL agent

**A**: The present
**action** to be performed

**S′**: The subsequent
state where the agent stops

**A′**: The next
most suitable step to be chosen using the present Q-value

**R**: The immediate
**reward** received from the environment
in response to the action performed

*α***:** The **learning rate**. Its value is greater
than 0 and less than or equal to 1. It is used to measure the degree at which the
updates in Q-values happen in each iteration

** γ**: The

**discount factor**. Its value lies between 0 and 1 (0 ≤

*γ*≤ 1). It determines the significance of future rewards. A high value for

*γ*(nearly 1) carries a long-term productive reward, and a value of 0 for γ denotes that the RL agent reflects only on instant rewards

The above Bellman equation declares that the Q-value generated from staying at state S and implementing an action A is the next reward R(S,A) plus the highest Q-value probable from the next state S’.

Also, Q(S’,A) is further dependent on Q(S”,A), and so on as shown in the below equation:

When we adjust the γ value, it will decrease or enhance the contribution of the expected rewards.

Since the Bellman equation is recursive, we can make random hypotheses for all the Q-values. By gaining exposure, the model will focalize to the optimal strategy.

Practically, it is implemented as follows:

where, **t**
denotes the iterations.

We can also make a **ε-greedy** policy for the chosen action. We do this by evaluating the Q-value.

The action,
for which the value of **Q is large** and probability **1-****ε**,is chosen. After that, the actions with probability **ε **is chosen
at random.

Presently, we have looked at all the theoretical concepts. Now, in this blog on ‘What is Reinforcement Learning?’ we will implement Q-learning in Python.

**Implementing Q-learning for Reinforcement Learning in Python**

For implementing algorithms of reinforcement learning such as Q-learning, we use the OpenAI Gym environment available in Python.

Now, let’s look at the **steps to implement Q-learning**:

**Step 1:** Importing Libraries

import gym

import itertools

import matplotlib

import matplotlib.style

import numpy as np

import pandas as pd

import sys

from collections import defaultdict

from windy_gridworld import WindyGridworldEnv

import plotting

matplotlib.style.use('ggplot')

**Step 2:** Creating the Gym Environment

env = WindyGridworldEnv()

**Step 3:** Constituting the Greedy Strategy

```
def createEpsilonGreedyPolicy(Q, epsilon, n_action):
def policyFunction(state):
Action_probabilities = np.ones(n_action,
dtype = float) * epsilon / n_action
best_step = np.argmax(Q[state])
Action_probabilities[best_step] += (1.0 - epsilon)
return Action_probabilities
return policyFunction
```

**Step 4:** Building the Q-learning Model

```
def qLearning(env, num_episodes, discount_factor = 1.0,
alpha = 0.6, epsilon = 0.1):
Q = defaultdict(lambda: np.zeros(env.action_space.n))
# Tracking the important statistics
stats = plotting.EpisodeStats(
episode_lengths = np.zeros(num_episodes),
episode_rewards = np.zeros(num_episodes))
# Creating function for an epsilon greedy policy
policy = createEpsilonGreedyPolicy(Q, epsilon, env.action_space.n)
for ith_episode in range(num_episodes):
state = env.reset()
for t in itertools.count():
action_probabilities = policy(state)
action = np.random.choice(np.arange(
len(action_probabilities)),
p = action_probabilities)
next_state, reward, done, _ = env.step(action)
stats.episode_rewards[i_episode] += reward
stats.episode_lengths[i_episode] = t
best_next_step = np.argmax(Q[next_state])
td_target = reward + discount_factor * Q[next_state][best_next_step]
td_delta = td_target - Q[state][action]
Q[state][action] += alpha * td_delta
if done:
break
state = next_state
return Q, stats
```

**Step 5:** Training the Model

Q, stats = qLearning(env, 1000)

**Step 6:** Plotting the Visualization Graph

plotting.plot_episode_stats(stats)

From the above graph we can infer that reward is increasing as the time increases. The maximum value of reward per episode shows that the RL agent learns to take right action by maximizing its total reward.

This is all about Reinforcement Learning and its implemented.

*Go through this Machine Learning Interview Questions And Answers to excel in your Machine Learning ** Interview*.

Course Schedule

Name | Date | |
---|---|---|

Machine Learning Course |
2022-09-24 2022-09-25 (Sat-Sun) Weekend batch |
View Details |

Machine Learning Course |
2022-10-01 2022-10-02 (Sat-Sun) Weekend batch |
View Details |

Machine Learning Course |
2022-10-08 2022-10-09 (Sat-Sun) Weekend batch |
View Details |