Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (4.2k points)
ϵ-greedy policy

I know the Q-learning algorithm should try to balance between exploration and exploitation. Since I'm a beginner in this field, I wanted to implement a simple version of exploration/exploitation behavior.

Optimal epsilon value

My implementation uses the ϵ-greedy policy, but I'm at a loss when it comes to deciding the epsilon value. Should the epsilon be bounded by the number of times the algorithm have visited a given (state, action) pair, or should it be bounded by the number of iterations performed?

My suggestions:

  1. Lower the epsilon value for each time a given (state, action) pair has been encountered.
  2. Lower the epsilon value after a complete iteration has been performed.
  3. Lower the epsilon value for each time we encounter a state s.

Much appreciated!

1 Answer

+1 vote
by (6.8k points)

Although in many simple cases the εk is kept as a fixed number in range 0 and 1, you should know that: Usually, the exploration diminishes over time, so that the policy used asymptotically becomes greedy and therefore (as Qk → Q) optimal. This can be achieved by making εk approach 0 as k grows. For instance, a ε -greedy exploration schedule of the form εk = 1/k diminishes to 0 as k → ∞, while still satisfying the second convergence condition of Q-learning, i.e., while allowing infinitely many visits to all the state-action pairs (Singh et al., 2000).

What I do usually is this: set the initial alpha = 1/k (consider the initial k = 1 or 2) after you go trial by trial as k increases the alpha will decrease. it also keeps the convergence guaranteed.

Studying Reinforcement Learning helps in making this course to be revised properly.

31k questions

32.9k answers

507 comments

693 users

...