Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in AI and Deep Learning by (50.2k points)

Let's assume we're in a room where our agent can move along the xx and yy axis. At each point, he can move up, down, right and left. So our state space can be defined by (x, y) and our actions at each point are given by (up, down, right, left). Let's assume that wherever our agent does an action that will make him hit a wall we will give him a negative reward of -1, and put him back in the state he was before. If he finds in the center of the room a puppet he wins +10 reward.

When we update our QValue for a given state/action pair, we are seeing what actions can be done in the new state and computing what is the maximum QValue that is possible to get there, so we can update our Q(s, a) value for our current state/action. What this means is that if we have a goal state in the point (10, 10), all states around it will have a QValue a bit smaller and smaller as they get farther. Now, in relationship to the walls, it seems to be the same is not true.

When the agent hits a wall(let's assume he's in the position (0, 0) and did the action UP), he will receive for that state/action a reward of -1, thus getting a QValue of -1.

Now, if I am in the state (0, 1), and assuming all the other actions of state (0,0 0) are zero when calculating the QValue of (0, 1) for the action LEFT, it will compute it the following way:

Q([0,1], LEFT) = 0 + gamma * (max { 0, 0, 0, -1 } ) = 0 + 0 = 0

This is, having hit the wall doesn't propagate to nearby states, contrary to what happens when you have positive reward states.

In my optic, this seems odd. At first, I thought finding state/action pairs giving negative rewards would be learning wise as good as positive rewards, but from the example, I have shown above, that statement doesn't seem to hold. There seems to be a bias in the algorithm for taking far more into consideration positive rewards than negative ones.

Is this the expected behavior of QLearning? Shouldn't bad rewards be just as important as positive ones? What are "workarounds" for this?

1 Answer

0 votes
by (108k points)

You can withdraw negative awards by increasing the default reward from 0 to 1, the goal reward from 10 to 11, and the penalty from -1 to 0.

There are many scientific publications on Q-learning, so I'm sure there are other formulations that would allow for negative feedback.

The reason for your observation is that you have no guesswork on the outcome of your actions or the state it is in, therefore your agent can always choose the action it considers that it has an optimal reward (thus, the max Q-value overall future actions). This is the reason why your negative feedback doesn't propagate: the agent will simply avoid that action in the future.

If your model contains doubt over the outcome over your actions (for instance: there is always a 10% possibility of moving in a random direction), your learning rule should integrate over all feasible future rewards (basically replacing the max by a weighted sum). In that case, negative feedback can be propagated too.

If you wish to learn about Q-Learning then visit this Machine Learning Course.

Browse Categories

...