STD(λ): learning state differences with TD(λ)
BibTeX
@MISC{Weaver_std(λ):learning,
author = {Lex Weaver and Jonathan Baxter},
title = {STD(λ): learning state differences with TD(λ)},
year = {}
}
OpenURL
Abstract
TD() with function approximation has proved empirically successful for some complex reinforcement learning problems. For linear approximation, TD() has been shown to minimise the squared error between the approximate value of each state and the true value. However, as far as policy is concerned, it is error in the relative ordering of states that is important, rather than error in the state values. We illustrate this point with a simple two-state system in which TD() abandons the optimal policy to converge to a suboptimal policy. We also observe this trait of policy degradation by TD() in backgammon. We then present a modified form of TD(), called STD(), in which function approximators are trained with respect to relative state values. We characterise the limiting behaviour of this algorithm, and present a theorem guaranteeing optimality of the limiting parameter vector for twostate BMDPs. A comparison with Bertsekas' differential training method is also presented, which highlights a significant difference between the algorithms. This is followed by successful demonstrations of STD() on the two-state system and the well known acrobot problem. 1







