MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  BACKWARDS

Download:
pdf | ps
by Justin A. Boyan, Andrew W. Moore
http://www.ius.cs.cmu.edu/afs/cs/project/reinforcement/papers/boyan.acyclic.ps
Add To MetaCart

Abstract:

Some of the most successful recent applications of reinforcement learning have used neural networks and the TD() algorithm to learn evaluation functions. In this paper, we examine the intuition that TD() operates by approximating asynchronous value iteration. We note that on the important subclass of acyclic tasks, value iteration is inefficient compared with another graph algorithm, DAG-SP, which assigns values to states by working strictly backwards from the goal. We then present ROUT, an algorithm analogous to DAG-SP that can be used in large stochastic state spaces requiring function approximation. We close by comparing the behavior of ROUT and TD on a simple example domain and on two domains with much larger state spaces.

Citations

5825 Introduction to Algorithms – Cormen, Leiserson, et al. - 1992
1399 Dynamic Programming – Bellman - 1957
885 Learning to predict by the methods of temporal differences – Sutton - 1988
287 Practical issues in temporal difference learning – Tesauro - 1992
215 Improving elevator performance using reinforcement learning – Crites, Barto - 1996
170 Generalization in reinforcement learning: Safely approximating the value function – Boyan, Moore - 1995
147 Learning to predict by the methods of temporal di erences – Sutton - 1988
125 Stable function approximation in dynamic programming – Gordon - 1995
117 An analysis of temporal-difference learning with function approximation – Tsitsiklis
111 Bandit Problems: Sequential Allocation of Experiments – Berry, Fristedt - 1985
109 Real-time learning and control using asynchronous dynamic programming – Barto, Bradtke, et al. - 1991
90 Technical Note: Q-Learning – Watkins, Dayan - 1992
80 A Reinforcement Learning Approach to Job-shop Scheduling – Zhang, Dietterich - 1995
64 The convergence of TD(*) for general – Dayan - 1992
60 Practical issues in temporal di erence learning – Tesauro - 1992
30 A generalized reinforcement-learning model: Convergence and applications – Littman, Szepesvári - 1996
16 A counterexample to temporal differences learning – Bertsekas - 1995
16 An analysis of temporal-di erence learning with function approximation – Tsitsiklis, Roy - 1997
5 Q-learning for bandit problems – Duff - 1995
5 A counterexample to temporal di erences learning – Bertsekas - 1995
2 Du . Q-learning for bandit problems – O - 1995