Optimal epsilon (ϵ-greedy) value

Question

asked Aug 1, 2019 in Machine Learning by Clara Daisy (4.2k points)

ϵ-greedy policy

I know the Q-learning algorithm should try to balance between exploration and exploitation. Since I'm a beginner in this field, I wanted to implement a simple version of exploration/exploitation behavior.

Optimal epsilon value

My implementation uses the ϵ-greedy policy, but I'm at a loss when it comes to deciding the epsilon value. Should the epsilon be bounded by the number of times the algorithm have visited a given (state, action) pair, or should it be bounded by the number of iterations performed?

My suggestions:

Lower the epsilon value for each time a given (state, action) pair has been encountered.
Lower the epsilon value after a complete iteration has been performed.
Lower the epsilon value for each time we encounter a state s.

Much appreciated!

1 Answer

JaneShaw · Answer 1 · 2019-08-01T08:16:41+0000

Although in many simple cases the εk is kept as a fixed number in range 0 and 1, you should know that: Usually, the exploration diminishes over time, so that the policy used asymptotically becomes greedy and therefore (as Qk → Q∗) optimal. This can be achieved by making εk approach 0 as k grows. For instance, a ε -greedy exploration schedule of the form εk = 1/k diminishes to 0 as k → ∞, while still satisfying the second convergence condition of Q-learning, i.e., while allowing infinitely many visits to all the state-action pairs (Singh et al., 2000).

What I do usually is this: set the initial alpha = 1/k (consider the initial k = 1 or 2) after you go trial by trial as k increases the alpha will decrease. it also keeps the convergence guaranteed.

Studying Reinforcement Learning helps in making this course to be revised properly.

Optimal epsilon (ϵ-greedy) value

1 Answer

Related questions

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources