I know the basics of Reinforcement Learning, but what terms it's necessary to understand to be able to read arxiv PPO paper?

What is the roadmap to learn and use PPO?

Login

0 votes

**Proximal Policy Optimization (PPO)** is used to update the policy conservatively, without affecting its performance adversely between each policy update.

We use** KL divergence** between the updated policy and the old policy to measure how much the changed policy has been updated.

This is also known as a **constrained optimization problem** because we change the policy for maximum performance.

**KL divergence** between new and old policy cannot exceed some predefined threshold.

**Trust Region Policy Optimization (TRPO)**, we use to compute the KL constraint during the update and finds the learning rate for this problem

Using PPO, we can simplify the problem by turning the KL divergence from a constraint to a penalty term, its quite similar to L1, L2 weight penalty. PPO makes added modifications by removing the need to compute KL divergence all together, by hard clipping the policy ratio to be within a small range around 1.0, where 1.0 means the new policy is the same as old.

Hope this answer helps.