0 votes
2 views

I'm studying Reinforcement Learning and reading Sutton's book for a university course. Besides the classic PD, MC, TD and Q-Learning algorithms, I'm reading about policy gradient methods and genetic algorithms for the resolution of decision problems. I have never had experience before in this topic and I'm having problems understanding when a technique should be preferred over another. I have a few ideas, but I'm not sure about them. Can someone briefly explain or tell me a source where I can find something about the typical situation where certain methods should be used? As far as I understand:

Dynamic Programming and Linear Programming should be used only when the MDP has few actions and states and the model is known since it's very expensive. But when DP is better than LP?

Monte Carlo methods are used when I don't have the model of the problem but I can generate samples. It does not have bias but has high variance.

Temporal Difference methods should be used when MC methods need too many samples to have low variance. But when should I use TD and when Q-Learning?

Policy Gradient and Genetic algorithms are good for continuous MDPs. But when one is better than the other?

More precisely, I think that to choose a learning method a programmer should ask himself the following questions:

does the agent learn online or offline?

can we separate the exploring and exploiting phases?

can we perform enough exploration?

is the horizon of the MDP finite or infinite?

are states and actions continuous?

But I don't know how these details of the problem affect the choice of a learning method. I hope that some programmer has already had some experience with RL methods and can help me to better understand their applications.

## 1 Answer

0 votes
by (33.1k points)

The first thing, you have to decide is whether your agent will learn online or offline. This decision helps you to select on-line or off-line algorithms (e.g. SARSA or Q-learning). On-line methods need more attention and computations to work with real-time data.

There are exploring and exploiting phases. These two helps to create balance in learning.

For example in the epsilon-greedy action selection algorithm, we use an epsilon probability for exploiting and 1-epsilon probability for exploring. You can also separate these two to just explore first (e.g. choosing random actions) and then exploit them, But this situation is possible when you are learning off-line and probably using a model for the dynamics of the system.  And it usually means collecting a lot of sample data in advance.

The level of exploration can be decided based on the problem. For example, using a simulation model of the problem in memory, then you can explore as you want. But really exploring is limited to the amount of data we have.

There are both discrete and continuous algorithms developed for RL. Some of "continuous" algorithms internally discretize the state or action spaces.

Hope this answer helps.

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer