Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I'm looking at this SARSA-Lambda implementation (Ie: SARSA with eligibility traces) and there's a detail which I still don't get.

enter image description here

So I understand that all Q(s,a) are updated rather than only the one the agent has chosen for the given time-step. I also understand the E matrix is not reset at the start of each episode.

Let's assume for a minute that panel 3 of Figure 7.12 was the end-state of episode 1.

At the start of episode 2, the agent moves north instead of east, and let's assume this gives it a reward of -500. Wouldn't this affect also all states that were visited in the previous episode?

If the idea is to reward those states which have been visited in the current episode, then why isn't the matrix containing all e(s,a) values reset at the beginning of each episode? It just seems like with this implementation states that have been visited in the previous episode are 'punished' or 'rewarded' for actions done by the agent in this new episode.

1 Answer

0 votes
by (33.1k points)
edited by

You are failing to reset the e-matrix at the start of every episode. As far as I can tell, this is an error in the pseudocode. The e-matrix should be reinitialized between episodes. Check out this well-cited paper for this.

Here the eligibility traces are initialized to zero, and in episodic tasks, they can be reinitialized to zero after every episode.

The eligibility traces should be reset to zero at the start of each trial.

Hope this answer helps you! Studying Machine Learning Tutorial will also prove to be quite beneficial.

If you want to know more about Machine Learning then watch this video:

...