Talk:SARSA
From Wikipedia, the free encyclopedia
[edit] Date
When did this algorithm get invented ? XApple 19:46, 7 May 2007 (UTC)
- First published 1994, added info. 220.253.135.178 16:50, 21 May 2007 (UTC)
- Hey, thanks a lot for contributing to wikipedia ! XApple 23:05, 27 May 2007 (UTC)
[edit] Updates
For updates, SARSA uses the next action chosen, not the best next action, to reflect the value of the last state/action under the current policy. If using the best next action, you'll end up with Watkin's Q-Learning which SARSA was an attempt to provide an alternative to. By updating with the value of the best next action (Watkin's Q-Learning) the update can possibly over-estimate values, as the control method used will not pick this action all the time (due to the need to balance exploration and exploitation). A comparison between Q-Learning and SARSA, perhaps Cliff World from Rich Sutton's 'Reinforcement Learning An Introduction' (1998), may be useful to clarify the differences and the resulting behaviour --131.217.6.6 08:17, 29 May 2007 (UTC)
this is the algorithm presented in Q-Learning:
![Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha_t(s_t,a_t) [r_t + \gamma \max_{a}Q(s_{t+1}, a) - Q(s_t,a_t)]](../../../../math/b/0/0/b0068187af2adba0828dc0d48f2db440.png)
SARSA:
![Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha [r_{t+1} + \phi Q(s_{t+1}, a_{t+1})-Q(s_t,a_t)]](../../../../math/3/1/4/3145ec14670c590df519fef972c48638.png)
Uses "backpropagation"? updates previous Q entry with future reward? Dspattison (talk) 19:20, 19 March 2008 (UTC)

