MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Near-optimal reinforcement learning in polynomial time (1998) [88 citations — 1 self]

Download:
Download as a PDF | Download as a PS
by Michael Kearns, Satinder Singh
Machine Learning
http://deas.harvard.edu/courses/cs281r/Readings/ks.reinforcement.ml98.ps
Add To MetaCart

Abstract:

We present new algorithms for reinforcement learning, and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the ExplorationExploitation trade-off. 1

Citations

1673 Reinforcement learning: An introduction – Sutton, Barto - 1998
938 Learning from Delayed Rewards – Watkins - 1989
885 Learning to Predict by the Methods of Temporal Differences – Sutton - 1988
548 Markov decision processes : discrete stochastic dynamic programming – Puterman - 1994
293 Dynamic Programming: Deterministic and Stochastic Models – Bertsekas - 1987
292 Nonlinear Programming – Bertsekas - 1995
239 Prioritized sweeping: Reinforcement learning with less data and less real time – Moore, Atkeson - 1993
239 Generalization in reinforcement learning: Successful examples using sparse tile coding – Sutton - 1996
175 Learning Policies for Partially Observable Environments: Scaling Up – Littman, Cassandra, et al. - 1995
162 On the convergence of stochastic iterative dynamic programming algorithms – Jaakkola, Jordan, et al. - 1994
156 On-line Q-learning using connectionist systems – Rummery, Niranjan - 1994
155 Reinforcement learning with perceptual aliasing: The Perceptual Distinctions Approach – Chrisman - 1992
147 Learning to predict by the methods of temporal di erences – Sutton - 1988
143 Algorithms for random generation and counting: A markov chain approach – Sinclair - 1993
136 Reinforcement learning with replacing eligibility traces – Singh, Sutton - 1996
125 Stable function approximation in dynamic programming – Gordon - 1995
122 I.: Reinforcement Learning Algorithm for Partially Observable – Jaakkola, Singh, et al. - 1994
108 The role of exploration in learning control – Thrun - 1992
105 Asynchronous stochastic approximation and Q-learning – Tsitsiklis - 1994
76 Reinforcement learning with soft state aggregation – Singh, Jaakkola, et al. - 1995
60 Convergence results for single-step on-policy reinforcement learning algorithms – Singh, Jaakkola, et al. - 1998
28 Efficient Reinforcement Learning in Factored MDPs – Kearns, Koller - 1999
27 Sequential decision problems and neural networks – Barto, Sutton, et al. - 1990
25 Efficient reinforcement learning – Fiechter - 1994
24 Convergence of indirect adaptive asynchronous value iteration algorithms – Gullapalli, Barto - 1994
19 Analytical mean squared error curves in temporal difference learning – Singh, Dayan - 1998
11 Expected mistake bound model for online reinforcement learning – Fiechter - 1997
8 On the worst-case analysis of temporal-difference learing algorithms – Schapire, Warmuth - 1994
3 A distributed asynchronous algorithm for expected average cost dynamic programming – Jalali, Ferguson - 1990
2 Learning curve bounds for markov decision processes with undiscounted rewards – Saul, Singh - 1996
1 On the worst-case analysis of temporal-dierence learning algorithms – Schapire - 1994
1 Feature-based methods for large scale dynamic programming – Roy - 1996