2 views

What difference to the algorithm does it make having a big or small gamma value? In my optic, as long as it is neither 0 or 1, it should work exactly the same. On the other side, whatever gamma I choose, it seems the Q values get pretty close to zero really quickly(I'm having here values on the order of 10^-300 just in a quick test). How do usually people plot Q values (I'm plotting an (x, y, best QValue for that state) given that problem? I'm trying to get around with logarithms but even then it feels kinda awkward.

Also, I don't get what is the reason behind having an alpha parameter in the Q Learning update function. It basically sets the magnitude of the update we are going to make to the Q value function. I have the idea that it is usually decreased over time. What is the interest in having it decrease over time? An updated value, in the beginning, should have more importance than 1000 episodes later?

Also, I was thinking that a good idea for exploring the state space every time the agent doesn't want to do the greedy action would be to explore any state that still has a zero Q value(this means, at least most of the times, a state never before done), but I don't see that referred in any literature. Are there any downsides to this? I know this can't be used with (at least some) generalization functions.

Another idea would be to keep a table of visited states/actions and try to do the actions that were tried fewer times before in that state. Of course, this can only be done in relatively small state spaces(in my case it is definitely possible).

A third idea for late in the exploration process would be to look not only to the selected action looking for the best q values but also look inside all those actions possible and that state, and then in the others of that state and so.

I know those questions are kinda unrelated but I'd like to hear the opinions of people that have worked before with this and (probably) struggled with some of them too.

by (108k points)

Q-learning can be implemented as follows:

Q(s,a)+=α⋅[r+γ⋅maxαQ(s′)−Q(s,a)]
• s: is the previous state

• a: is the previous action

• Q(): is the Q-learning algorithm

• s’: is the current state

• alpha: is the learning rate, set generally between 0 and 1. Setting the alpha value to 0 means that the Q-values are never updated, thereby nothing is learned. If we set the alpha to a high value such as 0.9, it means that the learning can occur quickly.

• gamma: It is the discount factor that is set between 0 and 1. This model the fact that future rewards are worth less than immediate rewards.

• max: is the maximum reward that is attainable in the state following the current one (the reward for taking the optimal action thereafter).

The algorithm can be interpreted as:

1. Initialize the Q-values table, Q(s, a).

2. Observe the current state, s.

3. Take an action, a, for that state based on the selection policy.

4. Pick that action, and observe the reward, r, as well as the new state, s’.

5. Now update the Q-value for the state with the help of the observed reward and the maximum reward possible for the next state.

6. Place the state to the new state, and repeat the process until a terminal state is reached.

Thus, alpha is the learning rate. If the reward or transition function is random, then alpha should change over the period, approaching zero at infinity. This has to effect approximating the expected outcome of an inner product (T(transition)*R(reward)), when one of the two, or both, has random behavior.

Whereas, gamma is the value of future rewards. It can change the learning quite a bit and can be a dynamic or static value. If it is equal to one, the agent values future reward JUST AS MUCH as a current reward. This means, in ten actions, if an agent does something good this is JUST AS VALUABLE as doing this action directly. So learning doesn't work well at high gamma values.

Similarly, a gamma of zero will cause the agent to only value immediate rewards, which only works with very detailed reward functions.