2 views

As far as my understanding is, it's possible to replace a look-up-table for Q-values (state-action-pair-evaluation) by a neural network for estimating these state-action pairs. I programmed a small library, which can propagate and backpropagate through a self-built neural network for learning wanted target-values for a certain in-out-put.

For each action, there's an extra output neuron, and the activation-value of one of these outputs-"units" tells me, the estimated Q-value. (One question: Is the activation value the same as the "output" of the neuron or something different?)

I used the standard sigmoid-function as activation-function, so the range of the function-values x is

0<x<1

So I thought, my target value should always be from 0.0 to 1.0 -> Question: Is that point of my understanding correct? Or did I misunderstand something about that?

If yes, there comes following problem: The equation for calculating the target-reward / new Q-value is: q(s,a) = q(s,a) + learning rate * (reward + discount factor * q'(s,a) - q(s,a))

So how do I perform this equation to get the right target for the neural network, if targets should be from 0.0 to 1.0?! How do I calculate good reward-values? Is moving toward the aim more worth it, than going away from it? (more +reward when nearing the aim than -the reward for bigger distance to aim?)

I think there are some misunderstandings of mine. I hope, you can help me to answer those questions. Thank you very much!

by (108k points)

Rather than beginning with a complex and heavy deep neural network, we will begin by implementing a simple lookup-table version of the algorithm, and then show how to implement a neural-network equivalent using Tensorflow.

In its purest implementation, Q-Learning is a table of values for every state (row) and action (column) possible in the environment. Within each cell of the table, we learn value for how good it is to take a given action within a given state. In the case of the FrozenLake environment, we have 16 possible states (one for each block), and 4 possible actions (the four directions of movement), giving us a 16x4 table of Q-values. We start by initializing the table to be uniform (all zeros), and then as we observe the rewards we obtain for various actions, we update the table accordingly.

For more information regarding same, refer to the following link: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0