Proximal Policy Optimization (PPO) is used to update the policy conservatively, without affecting its performance adversely between each policy update.
We use KL divergence between the updated policy and the old policy to measure how much the changed policy has been updated.
This is also known as a constrained optimization problem because we change the policy for maximum performance.
KL divergence between new and old policy cannot exceed some predefined threshold.
Trust Region Policy Optimization (TRPO), we use to compute the KL constraint during the update and finds the learning rate for this problem
Using PPO, we can simplify the problem by turning the KL divergence from a constraint to a penalty term, its quite similar to L1, L2 weight penalty. PPO makes added modifications by removing the need to compute KL divergence all together, by hard clipping the policy ratio to be within a small range around 1.0, where 1.0 means the new policy is the same as old.
Hope this answer helps.