As far as my understanding is, it's possible to replace a look-up-table for Q-values (state-action-pair-evaluation) by a neural network for estimating these state-action pairs. I programmed a small library, which can propagate and backpropagate through a self-built neural network for learning wanted target-values for a certain in-out-put.

For each action, there's an extra output neuron, and the activation-value of one of these outputs-"units" tells me, the estimated Q-value. (One question: Is the activation value the same as the "output" of the neuron or something different?)

I used the standard sigmoid-function as activation-function, so the range of the function-values x is


So I thought, my target value should always be from 0.0 to 1.0 -> Question: Is that point of my understanding correct? Or did I misunderstand something about that?

If yes, there comes following problem: The equation for calculating the target-reward / new Q-value is: q(s,a) = q(s,a) + learning rate * (reward + discount factor * q'(s,a) - q(s,a))

So how do I perform this equation to get the right target for the neural network, if targets should be from 0.0 to 1.0?! How do I calculate good reward-values? Is moving toward the aim more worth it, than going away from it? (more +reward when nearing the aim than -the reward for bigger distance to aim?)

I think there are some misunderstandings of mine. I hope, you can help me to answer those questions. Thank you very much!

1 Answer

Rather than beginning with a complex and heavy deep neural network, we will begin by implementing a simple lookup-table version of the algorithm, and then show how to implement a neural-network equivalent using Tensorflow.

In its purest implementation, Q-Learning is a table of values for every state (row) and action (column) possible in the environment. Within each cell of the table, we learn value for how good it is to take a given action within a given state. In the case of the FrozenLake environment, we have 16 possible states (one for each block), and 4 possible actions (the four directions of movement), giving us a 16x4 table of Q-values. We start by initializing the table to be uniform (all zeros), and then as we observe the rewards we obtain for various actions, we update the table accordingly.

