I'm in the process of development of a simple Q-Learning implementation over a trivial application, but there's something that keeps puzzling me.
Let's consider the standard formulation of Q-Learning
Q(S, A) = Q(S, A) + alpha * [R + MaxQ(S', A') - Q(S, A)]
Let's assume there's this state K that has two possible actions, both awarding our agent rewards Rand R' by A and A'.
If we follow an almost-totally-greedy approach (let's say we assume a 0.1 epsilon), I'll first randomly choose one of the actions, for instance A. The next time, I'll probably (90% of the times) choose again A and that will cause that Q(K, A) keeps growing and growing, being true the case that even if by chance I try A', as probably its reward is on the same magnitude as that of A, we'll have gotten into a situation where it's practically impossible to "recover" from our first guess, during the rest of the learning.
I guess this must not be so, otherwise, the agent would basically not learn -- it'd be just following a simple recipe: do everything as you did your first time.
Am I missing something? I know I can tweak the alpha value (generally, decreasing it over time), but that in no way improves our situation.