#### DMCA

## Incremental Natural Actor-Critic Algorithms

### Cached

### Download Links

- [www.cs.ualberta.ca]
- [webdocs.cs.ualberta.ca]
- [www-anw.cs.umass.edu]
- [books.nips.cc]
- [incompleteideas.net]
- [webdocs.cs.ualberta.ca]
- [www-anw.cs.umass.edu]
- [www.cs.ualberta.ca]
- [webdocs.cs.ualberta.ca]
- [webdocs.cs.ualberta.ca]
- [chercheurs.lille.inria.fr]
- [chercheurs.lille.inria.fr]
- [hal.univ-lille3.fr]
- [hal.inria.fr]
- [hal.inria.fr]
- [www.cs.ualberta.ca]
- DBLP

### Other Repositories/Bibliography

Citations: | 71 - 8 self |

### Citations

5462 | Reinforcement Learning. An Introduction
- Sutton, Barto
- 1998
(Show Context)
Citation Context ... the use of eligibility traces but we believe the extension to that case would be straightforward. 2 The Policy Gradient Framework We consider the standard reinforcement learning framework (e.g., see =-=Sutton & Barto, 1998-=-), in which a learning agent interacts with a stochastic environment and this interaction is modeled as a discrete-time Markov decision process. The state, action, and reward at each time t ∈ {0, 1, 2... |

1849 | Markov Decision Processes: Discrete Stochastic Dynamic Programming - Puterman - 1994 |

1488 | Learning to predict by the methods of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...A variety of methods can be used to solve the prediction problem, but the ones that have proved most effective in large applications are those based on some form of temporal difference (TD) learning (=-=Sutton, 1988-=-) in which estimates are updated on the basis of other estimates. Such bootstrapping methods can be viewed as a way of accelerating learning by trading bias for variance. Actor-critic methods were amo... |

712 | Dynamic programming and optimal control. Athena Scientific - Bertsekas, Bertsekas, et al. - 1995 |

526 | Adaptive Algorithms and Stochastic Approximation. - Benveniste, Metivier, et al. - 1990 |

484 | Stochastic Approximation Algorithms and Applications - Kushner, Yin - 1997 |

477 | Temporal difference learning and TD-gammon - Tesauro - 1995 |

440 | Simple statistical gradient-following algorithms for connectionist reinforcement learning - Williams - 1992 |

430 | Generalization in reinforcement learning: Successful examples using sparse coarse coding - Sutton - 1996 |

427 | Policy gradient methods for reinforcement learning with function approximation - Sutton, McAllester, et al. - 2000 |

378 | Online Q-learning using connectionist systems (Tech - Rummery, Niranjan - 1994 |

364 | Stochastic approximation methods for constrained and unconstrained systems - Kushner, Clark - 1978 |

311 | An Analysis of Temporal-Difference Learning with Function Approximation - Tsitsiklis, Roy - 1997 |

307 | Generalization in reinforcement learning : Safely approximating the value function - MUNOS, Boyan, et al. - 1995 |

305 | Residual algorithms : Reinforcement learning with function approximation - Baird - 1995 |

271 |
Temporal Credit Assignment in Reinforcement Learning
- Sutton
- 1984
(Show Context)
Citation Context ...ng methods can be viewed as a way of accelerating learning by trading bias for variance. Actor-critic methods were among the earliest to be investigated in reinforcement learning (Barto et al., 1983; =-=Sutton, 1984-=-). They were largely supplanted in the 1990’s by methods that estimate action-value functions and use them directly to select actions without an explicit policy structure. This approach was appealing ... |

261 | Stable function approximation in dynamic programming - Gordon - 1995 |

254 | Linear least-squares algorithms for temporal di®erence learning.Machine - Bradtke, Barto - 1996 |

240 | Actor-Critic Algorithms
- Konda, Tsitsiklis
- 2003
(Show Context)
Citation Context ...re to converge. These problems led to renewed interest in methods with an explicit representation of the policy, which came to be known as policy gradient methods (Marbach, 1998; Sutton et al., 2000; =-=Konda & Tsitsiklis, 2000-=-; Baxter & Bartlett, 2001). Policy gradient methods without bootstrapping can be easily proved convergent, but converge slowly because of the high variance of their gradient estimates. Combining them ... |

205 | Infinite-horizon policy-gradient estimation
- Baxter, Bartlett
- 2001
(Show Context)
Citation Context ...lems led to renewed interest in methods with an explicit representation of the policy, which came to be known as policy gradient methods (Marbach, 1998; Sutton et al., 2000; Konda & Tsitsiklis, 2000; =-=Baxter & Bartlett, 2001-=-). Policy gradient methods without bootstrapping can be easily proved convergent, but converge slowly because of the high variance of their gradient estimates. Combining them with bootstrapping is a p... |

203 |
Neuronlike elements that can solve difficult learning control problems
- Barto, Sutton, et al.
(Show Context)
Citation Context ...es. Such bootstrapping methods can be viewed as a way of accelerating learning by trading bias for variance. Actor-critic methods were among the earliest to be investigated in reinforcement learning (=-=Barto et al., 1983-=-; Sutton, 1984). They were largely supplanted in the 1990’s by methods that estimate action-value functions and use them directly to select actions without an explicit policy structure. This approach ... |

170 | Control Techniques for Complex Networks - Meyn |

168 | Inverted autonomous helicopter flight via reinforcement learning - Ng, Coates, et al. - 2004 |

145 | Policy gradient reinforcement learning for fast quadrupedal locomotion - Kohl, Stone - 2004 |

142 | A Natural Policy Gradient - Kakade - 2001 |

131 | Reinforcement learning for humanoid robotics - Peters, Vijayakumar, et al. - 2003 |

123 | Convergent activation dynamics in continuous time neural networks, Neural Networks 2 - Hirsch - 1989 |

122 | 2008c), ‘Reinforcement learning of motor skills with policy gradients - Peters, Schaal |

116 | Least-squares temporal difference learning - Boyan - 1999 |

104 | Natural Gradient Works Efficiently - Amari - 1998 |

103 | Simulation-based optimization of Markov reward processes - Marbach, Tsitsiklis - 1998 |

98 | Elevator group control using multiple reinforcement learning agents - Crites, Barto - 1998 |

96 | The O.D.E. method for convergence of stochastic approximation and reinforcement learning - Borkar, Meyn |

96 | Numerical Dynamic Programming in Economics - Rust - 1996 |

89 | Natural actor-critic - Peters, Vijayakumar, et al. - 2005 |

85 | Stochastic approximation with two time scales - Borkar - 1997 |

79 | Experiments with Infinite-Horizon, Policy-Gradient Estimation - Baxter, Bartlett, et al. - 2001 |

74 | Likelihood ratio gradient estimation for stochastic systems - Glynn - 1990 |

62 | Covariant policy search - Bagnell, Schneider - 2003 |

59 | Nonconvergence to unstable points in urn models and stochastic approximations. The Annals of Probability - Pemantle - 1990 |

58 | An optimal one-way multigrid algorithm for discretetime stochastic control - Chow, Tsitsiklis - 1991 |

57 | Variance reduction techniques for gradient estimates in reinforcement learning
- Greensmith, Bartlett, et al.
- 2002
(Show Context)
Citation Context ... of the average reward can be written as ∇J(π) = � d π (s) � ∇π(a|s)(Q π (s, a) ± b(s)). (3) s∈S a∈A The baseline can be chosen such in a way that the variance of the gradient estimates is minimized (=-=Greensmith et al., 2004-=-). The natural gradient, denoted ˜ ∇J(π), can be calculated by linearly transforming the regular gradient, using the inverse Fisher information matrix of the policy: ˜ ∇J(π) = G −1 (θ)∇J(π). The Fishe... |

53 | Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences - Bhatnagar, Fu, et al. - 2003 |

51 | Advantage updating - Baird - 1993 |

51 | Perturbation realization, potentials and sensitivity analysis of Markov processes - Cao, Chen - 1997 |

45 | A simulated annealing algorithm with constant temperature for discrete stochastic optimization - Alrefaei, Andradóttir - 1999 |

42 | Functional approximations and dynamic programming - Bellman, Dreyfus - 1959 |

42 | KnightCap: A Chess Program That Learns by Combining TD(lambda) with Game-Tree Search - Baxter, Tridgell, et al. - 1998 |

38 | A survey of applications of Markov decision processes - White - 1993 |

32 | Improved Temporal Difference Methods with Linear Function Approximation - Bertsekas, Borkar, et al. - 2003 |

28 | Simulated-Based Methods for Markov Decision Processes
- Marbach
- 1998
(Show Context)
Citation Context ...ties including in some cases a failure to converge. These problems led to renewed interest in methods with an explicit representation of the policy, which came to be known as policy gradient methods (=-=Marbach, 1998-=-; Sutton et al., 2000; Konda & Tsitsiklis, 2000; Baxter & Bartlett, 2001). Policy gradient methods without bootstrapping can be easily proved convergent, but converge slowly because of the high varian... |

28 | Stability by Lyapunov’s Direct Methodwith Applications - Lasalle, Lefschetz - 1961 |

25 | Bayesian actor-critic algorithms - Ghavamzadeh, Yaakov - 2007 |

25 | Analytical Mean Squared Error Curves for Temporal Difference Learning - Singh, Dayan - 1998 |

20 | Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization - Bhatnagar - 2005 |

18 | Natural actor-critic for road traffic optimisation - Richter, Aberdeen, et al. - 2007 |

16 | Splines and efficiency in dynamic programming - Daniel - 1976 |

14 | Actor–critic like learning algorithms for Markov decision processes - Konda, Borkar - 1999 |

12 | Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization - Bhatnagar - 2007 |

10 | Some pathological traps for stochastic approximation - Brandiere - 1998 |

5 | Reinforcement learning based algorithms for average cost Markov decision processes”, Discrete Event Dynamic Systems: Theory and Applications - Abdulla, Bhatnagar - 2007 |

5 | Average cost temporal-difference learning”, Automatica - Tsitsikis, Roy - 1999 |

3 | Reinforcement learning – a bridge between numerical methods and Monte-Carlo - Borkar - 2008 |

3 | Information theoretic justification of Boltzmann selection and its generalization to Tsallis case - Dukkipati, Murty, et al. - 2005 |

2 | Learning Algorithms for Markov Decision - Abounadi, Bertsekas, et al. - 2001 |

2 | Asynchronous stochastic approximation and Q-learning - Tsitsikis - 1994 |

1 | On the Convergence of Temporal Difference Learning with Linear Function Approximation. Machine Learning 42(3):241–267 - Tadic - 2001 |

1 | Asynchronous Stochastic Approximation and - Tsitsikis - 1994 |