• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

DMCA

Incremental Natural Actor-Critic Algorithms

Cached

  • Download as a PDF

Download Links

  • [www.cs.ualberta.ca]
  • [webdocs.cs.ualberta.ca]
  • [www-anw.cs.umass.edu]
  • [books.nips.cc]
  • [incompleteideas.net]
  • [webdocs.cs.ualberta.ca]
  • [www-anw.cs.umass.edu]
  • [www.cs.ualberta.ca]
  • [webdocs.cs.ualberta.ca]
  • [webdocs.cs.ualberta.ca]
  • [chercheurs.lille.inria.fr]
  • [chercheurs.lille.inria.fr]
  • [hal.univ-lille3.fr]
  • [hal.inria.fr]
  • [hal.inria.fr]
  • [www.cs.ualberta.ca]

  • Other Repositories/Bibliography

  • DBLP
  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Shalabh Bhatnagar , Richard S. Sutton , Mohammad Ghavamzadeh , Mark Lee
Citations:71 - 8 self
  • Summary
  • Citations
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

Citations

5462 Reinforcement Learning. An Introduction - Sutton, Barto - 1998 (Show Context)

Citation Context

... the use of eligibility traces but we believe the extension to that case would be straightforward. 2 The Policy Gradient Framework We consider the standard reinforcement learning framework (e.g., see =-=Sutton & Barto, 1998-=-), in which a learning agent interacts with a stochastic environment and this interaction is modeled as a discrete-time Markov decision process. The state, action, and reward at each time t ∈ {0, 1, 2...

1849 Markov Decision Processes: Discrete Stochastic Dynamic Programming - Puterman - 1994
1488 Learning to predict by the methods of temporal differences - Sutton - 1988 (Show Context)

Citation Context

...A variety of methods can be used to solve the prediction problem, but the ones that have proved most effective in large applications are those based on some form of temporal difference (TD) learning (=-=Sutton, 1988-=-) in which estimates are updated on the basis of other estimates. Such bootstrapping methods can be viewed as a way of accelerating learning by trading bias for variance. Actor-critic methods were amo...

712 Dynamic programming and optimal control. Athena Scientific - Bertsekas, Bertsekas, et al. - 1995
526 Adaptive Algorithms and Stochastic Approximation. - Benveniste, Metivier, et al. - 1990
484 Stochastic Approximation Algorithms and Applications - Kushner, Yin - 1997
477 Temporal difference learning and TD-gammon - Tesauro - 1995
440 Simple statistical gradient-following algorithms for connectionist reinforcement learning - Williams - 1992
430 Generalization in reinforcement learning: Successful examples using sparse coarse coding - Sutton - 1996
427 Policy gradient methods for reinforcement learning with function approximation - Sutton, McAllester, et al. - 2000
378 Online Q-learning using connectionist systems (Tech - Rummery, Niranjan - 1994
364 Stochastic approximation methods for constrained and unconstrained systems - Kushner, Clark - 1978
311 An Analysis of Temporal-Difference Learning with Function Approximation - Tsitsiklis, Roy - 1997
307 Generalization in reinforcement learning : Safely approximating the value function - MUNOS, Boyan, et al. - 1995
305 Residual algorithms : Reinforcement learning with function approximation - Baird - 1995
271 Temporal Credit Assignment in Reinforcement Learning - Sutton - 1984 (Show Context)

Citation Context

...ng methods can be viewed as a way of accelerating learning by trading bias for variance. Actor-critic methods were among the earliest to be investigated in reinforcement learning (Barto et al., 1983; =-=Sutton, 1984-=-). They were largely supplanted in the 1990’s by methods that estimate action-value functions and use them directly to select actions without an explicit policy structure. This approach was appealing ...

261 Stable function approximation in dynamic programming - Gordon - 1995
254 Linear least-squares algorithms for temporal di®erence learning.Machine - Bradtke, Barto - 1996
240 Actor-Critic Algorithms - Konda, Tsitsiklis - 2003 (Show Context)

Citation Context

...re to converge. These problems led to renewed interest in methods with an explicit representation of the policy, which came to be known as policy gradient methods (Marbach, 1998; Sutton et al., 2000; =-=Konda & Tsitsiklis, 2000-=-; Baxter & Bartlett, 2001). Policy gradient methods without bootstrapping can be easily proved convergent, but converge slowly because of the high variance of their gradient estimates. Combining them ...

205 Infinite-horizon policy-gradient estimation - Baxter, Bartlett - 2001 (Show Context)

Citation Context

...lems led to renewed interest in methods with an explicit representation of the policy, which came to be known as policy gradient methods (Marbach, 1998; Sutton et al., 2000; Konda & Tsitsiklis, 2000; =-=Baxter & Bartlett, 2001-=-). Policy gradient methods without bootstrapping can be easily proved convergent, but converge slowly because of the high variance of their gradient estimates. Combining them with bootstrapping is a p...

203 Neuronlike elements that can solve difficult learning control problems - Barto, Sutton, et al. (Show Context)

Citation Context

...es. Such bootstrapping methods can be viewed as a way of accelerating learning by trading bias for variance. Actor-critic methods were among the earliest to be investigated in reinforcement learning (=-=Barto et al., 1983-=-; Sutton, 1984). They were largely supplanted in the 1990’s by methods that estimate action-value functions and use them directly to select actions without an explicit policy structure. This approach ...

170 Control Techniques for Complex Networks - Meyn
168 Inverted autonomous helicopter flight via reinforcement learning - Ng, Coates, et al. - 2004
145 Policy gradient reinforcement learning for fast quadrupedal locomotion - Kohl, Stone - 2004
142 A Natural Policy Gradient - Kakade - 2001
131 Reinforcement learning for humanoid robotics - Peters, Vijayakumar, et al. - 2003
123 Convergent activation dynamics in continuous time neural networks, Neural Networks 2 - Hirsch - 1989
122 2008c), ‘Reinforcement learning of motor skills with policy gradients - Peters, Schaal
116 Least-squares temporal difference learning - Boyan - 1999
104 Natural Gradient Works Efficiently - Amari - 1998
103 Simulation-based optimization of Markov reward processes - Marbach, Tsitsiklis - 1998
98 Elevator group control using multiple reinforcement learning agents - Crites, Barto - 1998
96 The O.D.E. method for convergence of stochastic approximation and reinforcement learning - Borkar, Meyn
96 Numerical Dynamic Programming in Economics - Rust - 1996
89 Natural actor-critic - Peters, Vijayakumar, et al. - 2005
85 Stochastic approximation with two time scales - Borkar - 1997
79 Experiments with Infinite-Horizon, Policy-Gradient Estimation - Baxter, Bartlett, et al. - 2001
74 Likelihood ratio gradient estimation for stochastic systems - Glynn - 1990
62 Covariant policy search - Bagnell, Schneider - 2003
59 Nonconvergence to unstable points in urn models and stochastic approximations. The Annals of Probability - Pemantle - 1990
58 An optimal one-way multigrid algorithm for discretetime stochastic control - Chow, Tsitsiklis - 1991
57 Variance reduction techniques for gradient estimates in reinforcement learning - Greensmith, Bartlett, et al. - 2002 (Show Context)

Citation Context

... of the average reward can be written as ∇J(π) = � d π (s) � ∇π(a|s)(Q π (s, a) ± b(s)). (3) s∈S a∈A The baseline can be chosen such in a way that the variance of the gradient estimates is minimized (=-=Greensmith et al., 2004-=-). The natural gradient, denoted ˜ ∇J(π), can be calculated by linearly transforming the regular gradient, using the inverse Fisher information matrix of the policy: ˜ ∇J(π) = G −1 (θ)∇J(π). The Fishe...

53 Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences - Bhatnagar, Fu, et al. - 2003
51 Advantage updating - Baird - 1993
51 Perturbation realization, potentials and sensitivity analysis of Markov processes - Cao, Chen - 1997
45 A simulated annealing algorithm with constant temperature for discrete stochastic optimization - Alrefaei, Andradóttir - 1999
42 Functional approximations and dynamic programming - Bellman, Dreyfus - 1959
42 KnightCap: A Chess Program That Learns by Combining TD(lambda) with Game-Tree Search - Baxter, Tridgell, et al. - 1998
38 A survey of applications of Markov decision processes - White - 1993
32 Improved Temporal Difference Methods with Linear Function Approximation - Bertsekas, Borkar, et al. - 2003
28 Simulated-Based Methods for Markov Decision Processes - Marbach - 1998 (Show Context)

Citation Context

...ties including in some cases a failure to converge. These problems led to renewed interest in methods with an explicit representation of the policy, which came to be known as policy gradient methods (=-=Marbach, 1998-=-; Sutton et al., 2000; Konda & Tsitsiklis, 2000; Baxter & Bartlett, 2001). Policy gradient methods without bootstrapping can be easily proved convergent, but converge slowly because of the high varian...

28 Stability by Lyapunov’s Direct Methodwith Applications - Lasalle, Lefschetz - 1961
25 Bayesian actor-critic algorithms - Ghavamzadeh, Yaakov - 2007
25 Analytical Mean Squared Error Curves for Temporal Difference Learning - Singh, Dayan - 1998
20 Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization - Bhatnagar - 2005
18 Natural actor-critic for road traffic optimisation - Richter, Aberdeen, et al. - 2007
16 Splines and efficiency in dynamic programming - Daniel - 1976
14 Actor–critic like learning algorithms for Markov decision processes - Konda, Borkar - 1999
12 Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization - Bhatnagar - 2007
10 Some pathological traps for stochastic approximation - Brandiere - 1998
5 Reinforcement learning based algorithms for average cost Markov decision processes”, Discrete Event Dynamic Systems: Theory and Applications - Abdulla, Bhatnagar - 2007
5 Average cost temporal-difference learning”, Automatica - Tsitsikis, Roy - 1999
3 Reinforcement learning – a bridge between numerical methods and Monte-Carlo - Borkar - 2008
3 Information theoretic justification of Boltzmann selection and its generalization to Tsallis case - Dukkipati, Murty, et al. - 2005
2 Learning Algorithms for Markov Decision - Abounadi, Bertsekas, et al. - 2001
2 Asynchronous stochastic approximation and Q-learning - Tsitsikis - 1994
1 On the Convergence of Temporal Difference Learning with Linear Function Approximation. Machine Learning 42(3):241–267 - Tadic - 2001
1 Asynchronous Stochastic Approximation and - Tsitsikis - 1994
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University