We present new algorithms for reinforcement learning, and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the ExplorationExploitation trade-off. 1
|
1673
|
Reinforcement learning: An introduction
– Sutton, Barto
- 1998
|
|
938
|
Learning from Delayed Rewards
– Watkins
- 1989
|
|
885
|
Learning to Predict by the Methods of Temporal Differences
– Sutton
- 1988
|
|
548
|
Markov decision processes : discrete stochastic dynamic programming
– Puterman
- 1994
|
|
293
|
Dynamic Programming: Deterministic and Stochastic Models
– Bertsekas
- 1987
|
|
292
|
Nonlinear Programming
– Bertsekas
- 1995
|
|
239
|
Prioritized sweeping: Reinforcement learning with less data and less real time
– Moore, Atkeson
- 1993
|
|
239
|
Generalization in reinforcement learning: Successful examples using sparse tile coding
– Sutton
- 1996
|
|
175
|
Learning Policies for Partially Observable Environments: Scaling Up
– Littman, Cassandra, et al.
- 1995
|
|
162
|
On the convergence of stochastic iterative dynamic programming algorithms
– Jaakkola, Jordan, et al.
- 1994
|
|
156
|
On-line Q-learning using connectionist systems
– Rummery, Niranjan
- 1994
|
|
155
|
Reinforcement learning with perceptual aliasing: The Perceptual Distinctions Approach
– Chrisman
- 1992
|
|
147
|
Learning to predict by the methods of temporal di erences
– Sutton
- 1988
|
|
143
|
Algorithms for random generation and counting: A markov chain approach
– Sinclair
- 1993
|
|
136
|
Reinforcement learning with replacing eligibility traces
– Singh, Sutton
- 1996
|
|
125
|
Stable function approximation in dynamic programming
– Gordon
- 1995
|
|
122
|
I.: Reinforcement Learning Algorithm for Partially Observable
– Jaakkola, Singh, et al.
- 1994
|
|
108
|
The role of exploration in learning control
– Thrun
- 1992
|
|
105
|
Asynchronous stochastic approximation and Q-learning
– Tsitsiklis
- 1994
|
|
76
|
Reinforcement learning with soft state aggregation
– Singh, Jaakkola, et al.
- 1995
|
|
60
|
Convergence results for single-step on-policy reinforcement learning algorithms
– Singh, Jaakkola, et al.
- 1998
|
|
28
|
Efficient Reinforcement Learning in Factored MDPs
– Kearns, Koller
- 1999
|
|
27
|
Sequential decision problems and neural networks
– Barto, Sutton, et al.
- 1990
|
|
25
|
Efficient reinforcement learning
– Fiechter
- 1994
|
|
24
|
Convergence of indirect adaptive asynchronous value iteration algorithms
– Gullapalli, Barto
- 1994
|
|
19
|
Analytical mean squared error curves in temporal difference learning
– Singh, Dayan
- 1998
|
|
11
|
Expected mistake bound model for online reinforcement learning
– Fiechter
- 1997
|
|
8
|
On the worst-case analysis of temporal-difference learing algorithms
– Schapire, Warmuth
- 1994
|
|
3
|
A distributed asynchronous algorithm for expected average cost dynamic programming
– Jalali, Ferguson
- 1990
|
|
2
|
Learning curve bounds for markov decision processes with undiscounted rewards
– Saul, Singh
- 1996
|
|
1
|
On the worst-case analysis of temporal-dierence learning algorithms
– Schapire
- 1994
|
|
1
|
Feature-based methods for large scale dynamic programming
– Roy
- 1996
|