2 views

In reinforcement learning, what is the difference between policy iteration and value iteration?

As much as I understand, in value iteration, you use the Bellman equation to solve for the optimal policy, whereas, in policy iteration, you randomly select a policy π, and find the reward of that policy.

My doubt is that if you are selecting a random policy π in PI, how is it guaranteed to be the optimal policy, even if we are choosing several random policies.

by (33.1k points)

Policy iteration algorithms: These algorithms manipulate the policy directly, rather than finding it indirectly using the optimal value function. If you start with random policy, it finds the value function of that policy, then it finds the new improvised policy based on the previous value. In this process, each policy should be a improvement of the previous one.

Value iteration algorithm: If you start with a random value function and then find a new (improved) value function in an iterative process until reaching the optimal value function. It will give you an optimal policy from the optimal value function.

You can say that both algorithms share the same working principle. These two methods are cases of generalized policy iteration.