• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

DMCA

Policy gradient methods for reinforcement learning with function approximation. (1999)

Cached

  • Download as a PDF

Download Links

  • [web.eecs.umich.edu]
  • [www.cis.upenn.edu]
  • [www.math.tau.ac.il]
  • [ftp.cs.umass.edu]
  • [ftp.cs.umass.edu]
  • [ftp.cs.umass.edu]
  • [www.cs.ualberta.ca]
  • [www-anw.cs.umass.edu]
  • [webdocs.cs.ualberta.ca]
  • [incompleteideas.net]
  • [webdocs.cs.ualberta.ca]
  • [www-anw.cs.umass.edu]
  • [www.cs.ualberta.ca]
  • [www.damas.ift.ulaval.ca]
  • [webdocs.cs.ualberta.ca]
  • [incompleteideas.net]
  • [web.eecs.utk.edu]
  • [homes.cs.washington.edu]
  • [webdocs.cs.ualberta.ca]
  • [homes.cs.washington.edu]

  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Richard S Sutton , David Mcallester , Satinder Singh , Yishay Mansour
Venue:In NIPS,
Citations:434 - 20 self
  • Summary
  • Citations
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@INPROCEEDINGS{Sutton99policygradient,
    author = {Richard S Sutton and David Mcallester and Satinder Singh and Yishay Mansour},
    title = {Policy gradient methods for reinforcement learning with function approximation.},
    booktitle = {In NIPS,},
    year = {1999},
    pages = {1057--1063}
}

Share

Facebook Twitter Reddit Bibsonomy

OpenURL

 

Abstract

Abstract Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy. Large applications of reinforcement learning (RL) require the use of generalizing function approximators such neural networks, decision-trees, or instance-based methods. The dominant approach for the last decade has been the value-function approach, in which all function approximation effort goes into estimating a value function, with the action-selection policy represented implicitly as the "greedy" policy with respect to the estimated values (e.g., as the policy that selects in each state the action with highest estimated value). The value-function approach has worked well in many applications, but has several limitations. First, it is oriented toward finding deterministic policies, whereas the optimal policy is often stochastic, selecting different actions with specific probabilities (e.g., see In this paper we explore an alternative approach to function approximation in RL. Rather than approximating a value function and using that to compute a deterministic policy, we approximate a stochastic policy directly using an independent function approximator with its own parameters. For example, the policy might be represented by a neural network whose input is a representation of the state, whose output is action selection probabilities, and whose weights are the policy parameters. Let θ denote the vector of policy parameters and ρ the performance of the corresponding policy (e.g., the average reward per step). Then, in the policy gradient approach, the policy parameters are updated approximately proportional to the gradient: where α is a positive-definite step size. If the above can be achieved, then θ can usually be assured to converge to a locally optimal policy in the performance measure ρ. Unlike the value-function approach, here small changes in θ can cause only small changes in the policy and in the state-visitation distribution. In this paper we prove that an unbiased estimate of the gradient (1) can be obtained from experience using an approximate value function satisfying certain properties. Our result also suggests a way of proving the convergence of a wide variety of algorithms based on "actor-critic" or policy-iteration architectures (e.g., Policy Gradient Theorem We consider the standard reinforcement learning framework (see, e.g., Sutton and Barto, 1998), in which a learning agent interacts with a Markov decision process (MDP). The state, action, and reward at each time t ∈ {0, 1, 2, . . .} are denoted s t ∈ S, a t ∈ A, and r t ∈ respectively. The environment's dynamics are characterized by state transition probabilities, P a ss = P r {s t+1 = s | s t = s, a t = a}, and expected rewards R a s = E {r t+1 | s t = s, a t = a}, ∀s, s ∈ S, a ∈ A. The agent's decision making procedure at each time is characterized by a policy, π(s, a, θ) = P r {a t = a|s t = s, θ}, ∀s ∈ S, a ∈ A, where θ ∈ l , for l << |S|, is a parameter vector. We assume that π is diffentiable with respect to its parameter, i.e., that ∂π(s,a) ∂θ exists. We also usually write just π(s, a) for π(s, a, θ). With function approximation, two ways of formulating the agent's objective are useful. One is the average reward formulation, in which policies are ranked according to their long-term expected reward per step, ρ(π): where d π (s) = lim t→∞ P r {s t = s|s 0 , π} is the stationary distribution of states under π, which we assume exists and is independent of s 0 for all policies. In the average reward formulation, the value of a state-action pair given a policy is defined as The second formulation we cover is that in which there is a designated start state s 0 , and we care only about the long-term reward obtained from it. We will give our results only once, but they will apply to this formulation as well under the definitions where γ ∈ [0, 1] is a discount rate (γ = 1 is allowed only in episodic tasks). In this formulation, we define d π (s) as a discounted weighting of states encountered starting at s 0 and then following π: d π (s) = ∞ t=0 γ t P r {s t = s|s 0 , π}. Our first result concerns the gradient of the performance metric with respect to the policy parameter: Theorem 1 (Policy Gradient). For any MDP, in either the average-reward or start-state formulations, Proof: See the appendix. Marbach and Tsitsiklis (1998) describe a related but different expression for the gradient in terms of the state-value function, citing Jaakkola, ∂θ : the effect of policy changes on the distribution of states does not appear. This is convenient for approximating the gradient by sampling. For example, if s was sampled from the distribution obtained by following π, then a ∂π(s,a) ∂θ Q π (s, a) would be an unbiased estimate of ∂ρ ∂θ . Of course, Q π (s, a) is also not normally known and must be estimated. One approach is to use the actual returns, corrects for the oversampling of actions preferred by π), which is known to follow ∂ρ ∂θ in expected value Policy Gradient with Approximation Now consider the case in which Q π is approximated by a learned function approximator. If the approximation is sufficiently good, we might hope to use it in place of Q π in (2) and still point roughly in the direction of the gradient. For example, Jaakkola, Let f w : S × A → be our approximation to Q π , with parameter w. It is natural to learn f w by following π and updating w by a rule such as ∆w t ∝

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University