I have implemented Q-Learning as described in,

__http://web.cs.swarthmore.edu/~meeden/cs81/s12/papers/MarkStevePaper.pdf__

In order to approx. Q(S, A) I use a neural network structure like the following,

Activation sigmoid

Inputs, number of inputs + 1 for Action neurons (All Inputs Scaled 0-1)

Outputs, single output. Q-Value

N number of M Hidden Layers.

Exploration method random 0 < rand() < propExplore

At each learning iteration using the following formula,

I calculate a Q-Target value then calculate an error using,

error = QTarget - LastQValueReturnedFromNN

and backpropagate the error through the neural network.

Q1, Am I on the right track? I have seen some papers that implement an NN with one output neuron for each action.

Q2, My reward function returns a number between -1 and 1. Is it ok to return a number between -1 and 1 when the activation function is sigmoid (0 1)

Q3, From my understanding of this method given enough training instances it should be quarantined to find an optimal policy wight? When training for XOR sometimes it learns it after 2k iterations sometimes it won't learn even after 40k 50k iterations.