#### DMCA

## Policy gradient methods for reinforcement learning with function approximation. (1999)

### Cached

### Download Links

- [web.eecs.umich.edu]
- [www.cis.upenn.edu]
- [www.math.tau.ac.il]
- [ftp.cs.umass.edu]
- [ftp.cs.umass.edu]
- [ftp.cs.umass.edu]
- [www.cs.ualberta.ca]
- [www-anw.cs.umass.edu]
- [webdocs.cs.ualberta.ca]
- [incompleteideas.net]
- [webdocs.cs.ualberta.ca]
- [www-anw.cs.umass.edu]
- [www.cs.ualberta.ca]
- [www.damas.ift.ulaval.ca]
- [webdocs.cs.ualberta.ca]
- [incompleteideas.net]
- [web.eecs.utk.edu]
- [homes.cs.washington.edu]
- [webdocs.cs.ualberta.ca]
- [homes.cs.washington.edu]

Venue: | In NIPS, |

Citations: | 438 - 20 self |

### Citations

5613 | Reinforcement Learning: An Introduction
- Sutton, Barto
- 1999
(Show Context)
Citation Context ... fitted value iteration is also convergent and value-based, but does not find a locally optimal policy. 1 Policy Gradient Theorem We consider the standard reinforcement learning framework (see, e.g., =-=Sutton and Barto, 1998-=-), in which a learning agent interacts with a Markov decision process (MDP). The state, action, and reward at each time t ∈{0, 1, 2,...} are denoted st ∈ S, at ∈A, and rt ∈ℜrespectively. The environme... |

1224 |
Nonlinear Programming. Athena Scientific,
- Bertsekas
- 1999
(Show Context)
Citation Context ...est estimated value). The value-function approach has worked well in many applications, but has several limitations. First, it is oriented toward finding deterministic policies, whereas the optimal policy is often stochastic, selecting di!erent actions with specific probabilities (e.g., see Singh, Jaakkola, and Jordan, 1994). Second, an arbitrarily small change in the estimated value of an action can cause it to be, or not be, selected. Such discontinuous changes have been identified as a key obstacle to establishing convergence assurances for algorithms following the value-function approach (Bertsekas and Tsitsiklis, 1996). For example, Q-learning, Sarsa, and dynamic programming methods have all been shown unable to converge to any policy for simple MDPs and simple function approximators (Gordon, 1995, 1996; Baird, 1995; Tsitsiklis and van Roy, 1996; Bertsekas and Tsitsiklis, 1996). This can occur even if the best approximation is found at each step before changing the policy, and whether the notion of “best” is in the mean-squared-error sense or the slightly di!erent senses of residual-gradient, temporal-di!erence, and dynamic-programming methods. In this paper we explore an alternative approach to function ap... |

306 | Residual algorithms: Reinforcement learning with function approximation
- Baird
- 1995
(Show Context)
Citation Context ...lis, 1996). For example, Q-learning, Sarsa, and dynamic programming methods have all been shown unable to converge to any policy for simple MDPs and simple function approximators (Gordon, 1995, 1996; =-=Baird, 1995-=-; Tsitsiklis and van Roy, 1996; Bertsekas and Tsitsiklis, 1996). This can occur even if the best approximation is found at each step before changing the policy, and whether the notion of “best” is in ... |

274 |
Temporal Credit Assignment in Reinforcement Learning.,
- Sutton
- 1984
(Show Context)
Citation Context ...s (1998). Our result also suggests a way of proving the convergence of a wide variety of algorithms based on “actor-critic” or policy-iteration architectures (e.g., Barto, Sutton, and Anderson, 1983; =-=Sutton, 1984-=-; Kimura and Kobayashi, 1998). In this paper we take the first step in this direction by proving for the first time that a version of policy iteration with general differentiable function approximatio... |

263 | Stable function approximation in dynamic programming.
- Gordon
- 1995
(Show Context)
Citation Context ...ertsekas and Tsitsiklis, 1996). For example, Q-learning, Sarsa, and dynamic programming methods have all been shown unable to converge to any policy for simple MDPs and simple function approximators (=-=Gordon, 1995-=-, 1996; Baird, 1995; Tsitsiklis and van Roy, 1996; Bertsekas and Tsitsiklis, 1996). This can occur even if the best approximation is found at each step before changing the policy, and whether the noti... |

205 |
Neuronlike Elements That Can Solve Difficult Learning Control Problems,"
- Barto, Sutton, et al.
- 1983
(Show Context)
Citation Context ...in prep.) and Marbach and Tsitsiklis (1998). Our result also suggests a way of proving the convergence of a wide variety of algorithms based on “actor-critic” or policy-iteration architectures (e.g., =-=Barto, Sutton, and Anderson, 1983-=-; Sutton, 1984; Kimura and Kobayashi, 1998). In this paper we take the first step in this direction by proving for the first time that a version of policy iteration with general differentiable functio... |

178 | Feature-based methods for large scale dynamic programming. - Tsitsiklis, Roy - 1996 |

171 | Reinforcement learning algorithms for partially observable Markov decision,”. - Jaakkola, Singh, et al. - 1995 |

158 | Learning Without State-Estimation in Partially Observable Markovian Decision Processes.
- Singh, Jaakkola, et al.
- 1994
(Show Context)
Citation Context ...as several limitations. First, it is oriented toward finding deterministic policies, whereas the optimal policy is often stochastic, selecting different actions with specific probabilities (e.g., see =-=Singh, Jaakkola, and Jordan, 1994-=-). Second, an arbitrarily small change in the estimated value of an action can cause it to be, or not be, selected. Such discontinuous changes have been identified as a key obstacle to establishing co... |

140 | Gradient descent for general reinforcement learning
- Baird, Moore
- 1999
(Show Context)
Citation Context ...nction approximation corresponding to tabular POMDPs. Our result strengthens theirs and generalizes it to arbitrary di!erentiable function approximators. Our result also suggests a way of proving the convergence of a wide variety of algorithms based on “actor-critic” or policy-iteration architectures (e.g., Barto, Sutton, and Anderson, 1983; Sutton, 1984; Kimura and Kobayashi, 1998). In this paper we take the first step in this direction by proving for the first time that a version of policy iteration with general di!erentiable function approximation is convergent to a locally optimal policy. Baird and Moore (1999) obtained a weaker but superficially similar result for their VAPS family of methods. Like policy-gradient methods, VAPS includes separately parameterized policy and value functions updated by gradient methods. However, VAPS methods do not climb the gradient of performance (expected long-term reward), but of a measure combining performance and valuefunction accuracy. As a result, VAPS does not converge to a locally optimal policy, except in the case that no weight is put upon value-function accuracy, in which case VAPS degenerates to REINFORCE. Similarly, Gordon’s (1995) fitted value iteration... |

103 | Tsitsiklis. Simulation-based optimization of Markov reward processes.
- Marbach, N
- 1999
(Show Context)
Citation Context ...#! t=1 &t$1rt #### s0,% $ and Q"(s, a) = E " #! k=1 &k$1rt+k #### st = s, at = a,% $ . where & " [0, 1] is a discount rate (& = 1 is allowed only in episodic tasks). In this formulation, we define d"(s) as a discounted weighting of states encountered starting at s0 and then following %: d"(s) = %# t=0 & tPr {st = s|s0,%}. Our first result concerns the gradient of the performance metric with respect to the policy parameter: Theorem 1 (Policy Gradient). For any MDP, in either the average-reward or start-state formulations, $" $! = ! s d"(s) ! a $%(s, a) $! Q"(s, a). (2) Proof: See the appendix. Marbach and Tsitsiklis (1998) describe a related but di!erent expression for the gradient in terms of the state-value function, citing Jaakkola, Singh, and Jordan (1995) and Cao and Chen (1997). In both that expression and ours, the key point is that their are no terms of the form !d !(s) !# : the e!ect of policy changes on the distribution of states does not appear. This is convenient for approximating the gradient by sampling. For example, if s was sampled from the distribution obtained by following %, then % a !"(s,a) !# Q "(s, a) would be an unbiased estimate of !$!# . Of course, Q"(s, a) is also not normally known an... |

50 | Perturbation realization, potentials, and sensitivity analysis of Markov Processes,
- Cao, Chen
- 1997
(Show Context)
Citation Context ...formulation, we define d"(s) as a discounted weighting of states encountered starting at s0 and then following %: d"(s) = %# t=0 & tPr {st = s|s0,%}. Our first result concerns the gradient of the performance metric with respect to the policy parameter: Theorem 1 (Policy Gradient). For any MDP, in either the average-reward or start-state formulations, $" $! = ! s d"(s) ! a $%(s, a) $! Q"(s, a). (2) Proof: See the appendix. Marbach and Tsitsiklis (1998) describe a related but di!erent expression for the gradient in terms of the state-value function, citing Jaakkola, Singh, and Jordan (1995) and Cao and Chen (1997). In both that expression and ours, the key point is that their are no terms of the form !d !(s) !# : the e!ect of policy changes on the distribution of states does not appear. This is convenient for approximating the gradient by sampling. For example, if s was sampled from the distribution obtained by following %, then % a !"(s,a) !# Q "(s, a) would be an unbiased estimate of !$!# . Of course, Q"(s, a) is also not normally known and must be estimated. One approach is to use the actual returns, Rt = %# k=1 rt+k % "(%) (or Rt = %# k=1 & k$1rt+k in the start-state formulation) as an approximatio... |

47 |
Toward a theory of reinforcement-learning connectionist systems
- Williams
- 1988
(Show Context)
Citation Context ...ation) as an approximation for each Qπ (st,at). This leads to Rt 1 π(st,at) (the 1 π(st,at) corrects for the oversampling of actions preferred by π), which is known to follow ∂ρ ∂θ in expected value (=-=Williams, 1988-=-, 1992). s k=1 Williams’s episodic REINFORCE algorithm, ∆θt ∝ ∂π(st,at) ∂θ 2 Policy Gradient with Approximation Now consider the case in which Q π is approximated by a learned function approximator. I... |

45 |
An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value functions.
- Kimura, Kobayashi
- 1998
(Show Context)
Citation Context ...result also suggests a way of proving the convergence of a wide variety of algorithms based on “actor-critic” or policy-iteration architectures (e.g., Barto, Sutton, and Anderson, 1983; Sutton, 1984; =-=Kimura and Kobayashi, 1998-=-). In this paper we take the first step in this direction by proving for the first time that a version of policy iteration with general differentiable function approximation is convergent to a locally... |

36 |
Neuronlike elements that can solve di!cult learning control problems.
- Barto, Sutton, et al.
- 1983
(Show Context)
Citation Context ... value functions and has received relatively little attention. Learning a value function and using it to reduce the variance of the gradient estimate appears to be essential for rapid learning. Jaakkola, Singh and Jordan (1995) proved a result very similar to ours for the special case of function approximation corresponding to tabular POMDPs. Our result strengthens theirs and generalizes it to arbitrary di!erentiable function approximators. Our result also suggests a way of proving the convergence of a wide variety of algorithms based on “actor-critic” or policy-iteration architectures (e.g., Barto, Sutton, and Anderson, 1983; Sutton, 1984; Kimura and Kobayashi, 1998). In this paper we take the first step in this direction by proving for the first time that a version of policy iteration with general di!erentiable function approximation is convergent to a locally optimal policy. Baird and Moore (1999) obtained a weaker but superficially similar result for their VAPS family of methods. Like policy-gradient methods, VAPS includes separately parameterized policy and value functions updated by gradient methods. However, VAPS methods do not climb the gradient of performance (expected long-term reward), but of a measure ... |

13 |
Reinforcement comparison.
- Dayan
- 1991
(Show Context)
Citation Context ...ems, but can substantially affect the variance of the gradient estimators. The issues here are entirely analogous to those in the use of reinforcement baselines in earlier work (e.g., Williams, 1992; =-=Dayan, 1991-=-; Sutton, 1984). In practice, v should presumably be set to the best available approximation of V π . Our results establish that that approximation process can proceed without affecting the expected e... |

11 | Chattering in SARSA(λ - Gordon - 1996 |

4 | Chattering - Gordon - 1996 |

2 |
Advantage Updating. Wright Lab.
- Baird
- 1993
(Show Context)
Citation Context ...it have zero mean for each state: ∑ a π(s, a)fw(s, a) = 0, ∀s ∈ S. In this sense it is better to think of fw as an approximation of the advantage function, Aπ (s, a) = Qπ (s, a) − V π (s) (much as in =-=Baird, 1993-=-), rather than of Qπ . Our convergence requirement (3) is really that fw get the relative value of the actions correct in each state, not the absolute value, nor the variation from state to state. Our... |

1 | Chattering in SARSA(!). CMU Learning Lab - Gordon - 1996 |