TD(λ) in Delphi/Pascal (Temporal Difference Learning)

Question

asked Aug 29, 2019 in AI and Deep Learning by ashely (50.2k points)

I have an artificial neural network that plays Tic-Tac-Toe - but it is not complete yet.

What I have yet:

the reward array "R[t]" with integer values for every timestep or move "t" (1=player A wins, 0=draw, -1=player B wins)
The input values are correctly propagated through the network.
the formula for adjusting the weights:

enter image description here

What is missing:

the TD learning: I still need a procedure which "backpropagates" the network's errors using the TD(λ) algorithm.

But I don't understand this algorithm.

My approach so far ...

The trace decay parameter λ should be "0.1" as distal states should not get that much of the reward.

The learning rate is "0.5" in both layers (input and hidden).

It's a case of delayed reward: The reward remains "0" until the game ends. Then the reward becomes "1" for the first player's win, "-1" for the second player's win or "0" in case of a draw.

My questions:

How and when do you calculate the net's error (TD error)?
How can you implement the "backpropagation" of the error?
How are the weights adjusted using TD(λ)?

Thank you so much in advance :)

1 Answer

vinita · Answer 1 · 2019-08-29T08:32:02+0000

TD-lambda helps in creating a map between a game state and the expected reward at the game's end. As games are played, states that are more likely to lead to winning states tend to get higher expected reward values.

For a simple game like tic-tac-toe, you're better off starting with a tabular mapping (just track an expected reward value for every possible game state). Then once you've got that working, you can try using a NN for the mapping instead.

TD(λ) in Delphi/Pascal (Temporal Difference Learning)

1 Answer

Related questions

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources