Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in AI and Deep Learning by (50.2k points)

I have a read few papers and lectures on temporal difference learning (some as they pertain to neural nets, such as the Sutton tutorial on TD-Gammon) but I am having a difficult time understanding the equations, which leads me to my questions.

-Where does the prediction value V_t come from? And subsequently, how do we get V_(t+1)?

-What exactly is getting back propagated when TD is used with a neural net? That is, where does the error that gets back propagated come from when using TD?

1 Answer

0 votes
by (108k points)

Temporal-Difference including function approximation can assemble the solutions that are worse than those achieved by Monte-Carlo regression, even in the simple case of on-policy evaluation.

To understand the additional of this problem, we tend to investigate the problem of approximation errors in areas of sharp discontinuities of the value function being more propagated by bootstrap updates.

We show empirical proof of this leakage propagation and show analytically that it should occur, in a simple Markov chain, when function approximation errors are present.

For reversible policies, the result can be interpreted as the tension between two terms of the loss function that TD minimizes, as recently described. We show that the upper bounds hold, but they do not imply that leakage propagation occurs and under what conditions.

Lastly, we examine whether or not the problem can be mitigated with a better state representation and whether it may be learned in an unsupervised manner, without rewards or privileged information.

For more information, refer to the following link: https://www.researchgate.net/publication/326290611_Temporal_Difference_Learning_with_Neural_Networks_-_Study_of_the_Leakage_Propagation_Problem

If you wish to learn the basics of Neural Network Tutorial then visit this Neural Network Tutorial.

Browse Categories

...