What difference to the algorithm does it make having a big or small gamma value? In my optic, as long as it is neither 0 or 1, it should work exactly the same. On the other side, whatever gamma I choose, it seems the Q values get pretty close to zero really quickly(I'm having here values on the order of 10^-300 just in a quick test). How do usually people plot Q values (I'm plotting an (x, y, best QValue for that state) given that problem? I'm trying to get around with logarithms but even then it feels kinda awkward.
Also, I don't get what is the reason behind having an alpha parameter in the Q Learning update function. It basically sets the magnitude of the update we are going to make to the Q value function. I have the idea that it is usually decreased over time. What is the interest in having it decrease over time? An updated value, in the beginning, should have more importance than 1000 episodes later?
Also, I was thinking that a good idea for exploring the state space every time the agent doesn't want to do the greedy action would be to explore any state that still has a zero Q value(this means, at least most of the times, a state never before done), but I don't see that referred in any literature. Are there any downsides to this? I know this can't be used with (at least some) generalization functions.
Another idea would be to keep a table of visited states/actions and try to do the actions that were tried fewer times before in that state. Of course, this can only be done in relatively small state spaces(in my case it is definitely possible).
A third idea for late in the exploration process would be to look not only to the selected action looking for the best q values but also look inside all those actions possible and that state, and then in the others of that state and so.
I know those questions are kinda unrelated but I'd like to hear the opinions of people that have worked before with this and (probably) struggled with some of them too.