As far as my understanding is, it's possible to replace a look-up-table for Q-values (state-action-pair-evaluation) by a neural network for estimating these state-action pairs. I programmed a small library, which can propagate and backpropagate through a self-built neural network for learning wanted target-values for a certain in-out-put.
For each action, there's an extra output neuron, and the activation-value of one of these outputs-"units" tells me, the estimated Q-value. (One question: Is the activation value the same as the "output" of the neuron or something different?)
I used the standard sigmoid-function as activation-function, so the range of the function-values x is
0<x<1
So I thought, my target value should always be from 0.0 to 1.0 -> Question: Is that point of my understanding correct? Or did I misunderstand something about that?
If yes, there comes following problem: The equation for calculating the target-reward / new Q-value is: q(s,a) = q(s,a) + learning rate * (reward + discount factor * q'(s,a) - q(s,a))
So how do I perform this equation to get the right target for the neural network, if targets should be from 0.0 to 1.0?! How do I calculate good reward-values? Is moving toward the aim more worth it, than going away from it? (more +reward when nearing the aim than -the reward for bigger distance to aim?)
I think there are some misunderstandings of mine. I hope, you can help me to answer those questions. Thank you very much!