Unbounded increase in Q-Value, consequence of recurrent reward after repeating the same action in Q-Learning

Question

asked Jul 15, 2019 in AI and Deep Learning by ashely (50.2k points)

I'm in the process of development of a simple Q-Learning implementation over a trivial application, but there's something that keeps puzzling me.

Let's consider the standard formulation of Q-Learning

Q(S, A) = Q(S, A) + alpha * [R + MaxQ(S', A') - Q(S, A)]

Let's assume there's this state K that has two possible actions, both awarding our agent rewards Rand R' by A and A'.

If we follow an almost-totally-greedy approach (let's say we assume a 0.1 epsilon), I'll first randomly choose one of the actions, for instance A. The next time, I'll probably (90% of the times) choose again A and that will cause that Q(K, A) keeps growing and growing, being true the case that even if by chance I try A', as probably its reward is on the same magnitude as that of A, we'll have gotten into a situation where it's practically impossible to "recover" from our first guess, during the rest of the learning.

I guess this must not be so, otherwise, the agent would basically not learn -- it'd be just following a simple recipe: do everything as you did your first time.

Am I missing something? I know I can tweak the alpha value (generally, decreasing it over time), but that in no way improves our situation.

1 Answer

vinita · Answer 1 · 2019-07-15T12:35:03+0000

Due to the minus Q(S, A) term, Q(K, A) does not grow infinitely. This will be more clear if you rewrite the update rule to:

Q(S, A) <-- Q(S, A)(1 - a) + a(R + maxQ(S', A'))

This shows that Q(K, A) slowly moves towards its "actual" value of R + maxQ(S', A'), not to the infinite value. When it stops growing (has approximated its actual value), the Q(K, A) for other As can catch up.

Anyway, the whole point of epsilon is to control whether you want the learning process to be more greedy (heuristic) or explorative (random), so increase it if the learning process is too narrow.

Also note that one of the formal conditions for QL convergence is that each pair of (S, A) are visited an infinite number of times. So yes, at the end of the training process, you want each pair to have been visited a decent amount of times.

For more information regarding the Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results refer to the following link: https://link.springer.com/content/pdf/10.1007/BF00114727.pdf

If you wish to learn Reinforcement Learning then visit this Reinforcement Learning Certification Training.

Unbounded increase in Q-Value, consequence of recurrent reward after repeating the same action in Q-Learning

1 Answer

Related questions

Browse Categories