by Justin A. Boyan, Andrew W. Moore
http://www.ius.cs.cmu.edu/afs/cs/project/reinforcement/papers/boyan.acyclic.ps
Add To MetaCart
Abstract:
Some of the most successful recent applications of reinforcement learning have used neural networks and the TD() algorithm to learn evaluation functions. In this paper, we examine the intuition that TD() operates by approximating asynchronous value iteration. We note that on the important subclass of acyclic tasks, value iteration is inefficient compared with another graph algorithm, DAG-SP, which assigns values to states by working strictly backwards from the goal. We then present ROUT, an algorithm analogous to DAG-SP that can be used in large stochastic state spaces requiring function approximation. We close by comparing the behavior of ROUT and TD on a simple example domain and on two domains with much larger state spaces.
Citations
|
5825
|
Introduction to Algorithms
– Cormen, Leiserson, et al.
- 1992
|
|
1399
|
Dynamic Programming
– Bellman
- 1957
|
|
885
|
Learning to predict by the methods of temporal differences
– Sutton
- 1988
|
|
287
|
Practical issues in temporal difference learning
– Tesauro
- 1992
|
|
215
|
Improving elevator performance using reinforcement learning
– Crites, Barto
- 1996
|
|
170
|
Generalization in reinforcement learning: Safely approximating the value function
– Boyan, Moore
- 1995
|
|
147
|
Learning to predict by the methods of temporal di erences
– Sutton
- 1988
|
|
125
|
Stable function approximation in dynamic programming
– Gordon
- 1995
|
|
117
|
An analysis of temporal-difference learning with function approximation
– Tsitsiklis
|
|
111
|
Bandit Problems: Sequential Allocation of Experiments
– Berry, Fristedt
- 1985
|
|
109
|
Real-time learning and control using asynchronous dynamic programming
– Barto, Bradtke, et al.
- 1991
|
|
90
|
Technical Note: Q-Learning
– Watkins, Dayan
- 1992
|
|
80
|
A Reinforcement Learning Approach to Job-shop Scheduling
– Zhang, Dietterich
- 1995
|
|
64
|
The convergence of TD(*) for general
– Dayan
- 1992
|
|
60
|
Practical issues in temporal di erence learning
– Tesauro
- 1992
|
|
30
|
A generalized reinforcement-learning model: Convergence and applications
– Littman, Szepesvári
- 1996
|
|
16
|
A counterexample to temporal differences learning
– Bertsekas
- 1995
|
|
16
|
An analysis of temporal-di erence learning with function approximation
– Tsitsiklis, Roy
- 1997
|
|
5
|
Q-learning for bandit problems
– Duff
- 1995
|
|
5
|
A counterexample to temporal di erences learning
– Bertsekas
- 1995
|
|
2
|
Du . Q-learning for bandit problems
– O
- 1995
|