1 view

I'm trying to implement the Deep q-learning algorithm for a pong game. I've already implemented Q-learning using a table as Q-function. It works very well and learns how to beat the naive AI within 10 minutes. But I can't make it work using neural networks as a Q-function approximator.

I want to know if I am on the right track, so here is a summary of what I am doing:

I'm storing the current state, action taken and reward as current Experience in the replay memory

I'm using a multi-layer perceptron as Q-function with 1 hidden layer with 512 hidden units. for the input -> hidden layer I am using a sigmoid activation function. For hidden -> output layer I'm using a linear activation function

A state is represented by the position of both players and the ball, as well as the velocity of the ball. Positions are remapped, to much smaller state space.

I am using an epsilon-greedy approach for exploring the state space where epsilon gradually goes down to 0.

When learning, a random batch of 32 subsequent experiences is selected. Then I compute the target q-values for all the current state and action Q(s, a).

For all Experience e in batch

if e == endOfEpisode

target = e.getReward

else

target = e.getReward + discountFactor*qMaxPostState

end

Now I have a set of 32 target Q values, I am training the neural network with those values using batch gradient descent. I am just doing 1 training step. How many should I do?

I am programming in Java and using Encog for the multilayer perceptron implementation. The problem is that training is very slow and performance is very weak. I think I am missing something but can't figure out what. I would expect at least a somewhat decent result as the table approach has no problems.

by (108k points)

In Q-Learning, where do we need Neural Networks?

If we have a very huge number of state-action pairs, it is not feasible to store every Q-factor separately.

Then, it makes sense to save the Q-factor for a given action within one neural network.

When a Q-factor is required, it is fetched from its neural network. When a Q-factor is to be refreshed, the new Q-factor is used to refresh the neural network itself. For any assigned action, Q(i, a) is a function of i, the state. Hence, we will declare it a Q-function in what follows.

http://web.mst.edu/~gosavia/neural_networks_RL.pdf