What is the way to understand Proximal Policy Optimization Algorithm in RL?

Question

1 Answer

Anurag · Answer 1 · 2019-07-15T07:18:35+0000

Proximal Policy Optimization (PPO) is used to update the policy conservatively, without affecting its performance adversely between each policy update.

We use KL divergence between the updated policy and the old policy to measure how much the changed policy has been updated.

This is also known as a constrained optimization problem because we change the policy for maximum performance.

KL divergence between new and old policy cannot exceed some predefined threshold.

Trust Region Policy Optimization (TRPO), we use to compute the KL constraint during the update and finds the learning rate for this problem

Using PPO, we can simplify the problem by turning the KL divergence from a constraint to a penalty term, its quite similar to L1, L2 weight penalty. PPO makes added modifications by removing the need to compute KL divergence all together, by hard clipping the policy ratio to be within a small range around 1.0, where 1.0 means the new policy is the same as old.

Hope this answer helps.

What is the way to understand Proximal Policy Optimization Algorithm in RL?

What is the way to understand Proximal Policy Optimization Algorithm in RL?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions