I'm looking at this SARSA-Lambda implementation (Ie: SARSA with eligibility traces) and there's a detail which I still don't get.
So I understand that all Q(s,a) are updated rather than only the one the agent has chosen for the given time-step. I also understand the E matrix is not reset at the start of each episode.
Let's assume for a minute that panel 3 of Figure 7.12 was the end-state of episode 1.
At the start of episode 2, the agent moves north instead of east, and let's assume this gives it a reward of -500. Wouldn't this affect also all states that were visited in the previous episode?
If the idea is to reward those states which have been visited in the current episode, then why isn't the matrix containing all e(s,a) values reset at the beginning of each episode? It just seems like with this implementation states that have been visited in the previous episode are 'punished' or 'rewarded' for actions done by the agent in this new episode.