Results 1  10
of
31
LeastSquares Policy Iteration
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. This new approach ..."
Abstract

Cited by 461 (12 self)
 Add to MetaCart
We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. This new approach
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
 Journal of Machine Learning Research
, 2001
"... We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learning problems. The first approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relat ..."
Abstract

Cited by 57 (1 self)
 Add to MetaCart
We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learning problems. The first approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relate the performance of these methods, which use sample paths, to the variance of estimates based on iid data. We derive the baseline function that minimizes this variance, and we show that the variance for any baseline is the sum of the optimal variance and a weighted squared distance to the optimal baseline. We show that the widely used average discounted value baseline (where the reward is replaced by the difference between the reward and its expectation) is suboptimal. The second approach we consider is the actorcritic method, which uses an approximate value function. We give bounds on the expected squared error of its estimates. We show that minimizing distance to the true value function is suboptimal in general; we provide an example for which the true value function gives an estimate with positive variance, but the optimal value function gives an unbiased estimate with zero variance. Our bounds suggest algorithms to estimate the gradient of the performance of parameterized baseline or value functions. We present preliminary experiments that illustrate the performance improvements on a simple control problem.
Sensitivity analysis using ItôMalliavin calculus and application to stochastic optimal control
, 2002
"... We consider a multidimensional diffusion process (X t ) 0tT whose dynamics depends on parameters . Our rst purpose is to give representation formulae of the sensitivity rJ() for the expected cost J() = E(f(X T )) as an expectation: this issue is motivated by stochastic control problems (where th ..."
Abstract

Cited by 30 (9 self)
 Add to MetaCart
(Show Context)
We consider a multidimensional diffusion process (X t ) 0tT whose dynamics depends on parameters . Our rst purpose is to give representation formulae of the sensitivity rJ() for the expected cost J() = E(f(X T )) as an expectation: this issue is motivated by stochastic control problems (where the controller is parameterized and the optimization problem is then reduced to a parametric optimization one) or by model misspecifications in nance. Known results concerning the evaluation of rJ() by simulations concern the case of smooth cost functions f or of diusion coecients not depending on (see Kushner and Yang, SIAM J. Control Optim. 29 (5), pp. 12161249, 1991). Here, we handle the general case removing these two restrictions, deriving three new type formulae to evaluate rJ(): we call them Malliavin calculus approach, adjoint approach and martingale approach. For this, our basic tools are Itô calculus, Malliavin calculus and martingale arguments. In the second part of this work, we provide discretization procedures to simulate the relevant random variables and we analyze the associated weak error: the nature of the results are new in that context. We prove that the discretization error is essentially linear w.r.t. the time step. Finally, some numerical experiments deal with some examples in random mechanics and finance: we compare different methods in terms of variance, complexity, computational time and inuence of the time discretization step.
Policy Gradient in Continuous Time
 Journal of Machine Learning Research
, 2006
"... Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the perfo ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the socalled policy gradient. This paper is concerned with the estimation of the policy gradient for continuoustime, deterministic state dynamics, in a reinforcement learning framework, that is, when the decision maker does not have a model of the state dynamics.
Policy Search via the Signed Derivative
"... Abstract — We consider policy search for reinforcement learning: learning policy parameters, for some fixed policy class, that optimize performance of a system. In this paper, we propose a novel policy gradient method based on an approximation we call the Signed Derivative; the approximation is base ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
(Show Context)
Abstract — We consider policy search for reinforcement learning: learning policy parameters, for some fixed policy class, that optimize performance of a system. In this paper, we propose a novel policy gradient method based on an approximation we call the Signed Derivative; the approximation is based on the intuition that it is often very easy to guess the direction in which control inputs affect future state variables, even if we do not have an accurate model of the system. The resulting algorithm is very simple, requires no model of the environment, and we show that it can outperform standard stochastic estimators of the gradient; indeed we show that Signed Derivative algorithm can in fact perform as well as the true (modelbased) policy gradient, but without knowledge of the model. We evaluate the algorithm’s performance on both a simulated task and two realworld tasks — driving an RC car along a specified trajectory, and jumping onto obstacles with an quadruped robot — and in all cases achieve good performance after very little training. I.
Stochastic Optimization of Controlled Partially Observable Markov Decision Processes
 In Proceedings of the 39th IEEE Conference on Decision and Control (CDC00
"... We introduce an online algorithm for finding local maxima of the average reward in a Partially Observable Markov Decision Process (POMPD) controlled by a parameterized policy. Optimization is over the parameters of the policy. The algorithm's chief advantages are that it requires only a single ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
(Show Context)
We introduce an online algorithm for finding local maxima of the average reward in a Partially Observable Markov Decision Process (POMPD) controlled by a parameterized policy. Optimization is over the parameters of the policy. The algorithm's chief advantages are that it requires only a single sample path of the POMPD, it uses only one free parameter # [0; 1), which has a natural interpretation in terms of a biasvariance tradeoff, and it requires no knowledge of the underlying state. In addition, the algorithm can be applied to infinite state, control and observation spaces. We prove almostsure convergence of our algorithm, and show how the correct setting of beta is related to the mixing time of the Markov chain induced by the POMPD.
Distributed optimization in adaptive networks
 Advances in Neural Information Processing Systems 16
, 2004
"... We develop a protocol for optimizing dynamic behavior of a network of simple electronic components, such as a sensor network, an ad hoc network of mobile devices, or a network of communication switches. This protocol requires only local communication and simple computations which are distributed amo ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
We develop a protocol for optimizing dynamic behavior of a network of simple electronic components, such as a sensor network, an ad hoc network of mobile devices, or a network of communication switches. This protocol requires only local communication and simple computations which are distributed among devices. The protocol is scalable to large networks. As a motivating example, we discuss a problem involving optimization of power consumption, delay, and buffer overflow in a sensor network. Our approach builds on policy gradient methods for optimization of Markov decision processes. The protocol can be viewed as an extension of policy gradient methods to a context involving a team of agents optimizing aggregate performance through asynchronous distributed communication and computation. We establish that the dynamics of the protocol approximate the solution to an ordinary differential equation that follows the gradient of the performance objective. 1
Automated design of adaptive controllers for modular robots using reinforcement learning’, accepted for publication
 in International Journal of Robotics Research, Special Issue on SelfReconfigurable Modular Robots
, 2007
"... Designing distributed controllers for selfreconfiguring modular robots has been consistently challenging. We have developed a reinforcement learning approach which can be used both to automate controller design and to adapt robot behavior online. In this paper, we report on our study of reinforceme ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
(Show Context)
Designing distributed controllers for selfreconfiguring modular robots has been consistently challenging. We have developed a reinforcement learning approach which can be used both to automate controller design and to adapt robot behavior online. In this paper, we report on our study of reinforcement learning in the domain of selfreconfigurable modular robots: the underlying assumptions, the applicable algorithms, and the issues of partial observability, large search spaces and local optima. We propose and validate experimentally in simulation a number of techniques designed to address these and other scalability issues that arise in applying machine learning to distributed systems such as modular robots. We discuss ways to make learning faster, more robust and amenable to online application by giving scaffolding to the learning agents in the form of policy representation, structured experience and additional information. With enough structure modular robots can run learning algorithms to both automate the generation of distributed controllers, and adapt to the changing environment and deliver on the selforganization promise with less interference from human designers, programmers and operators.
Geometric Variance Reduction in Markov chains. Application to Value Function and Gradient Estimation
 Journal of Machine Learning Research
, 2006
"... We study a sequential variance reduction technique for Monte Carlo estimation of functionals in Markov Chains. The method is based on designing sequential control variates using successive approximations of the function of interest V. Regular Monte Carlo estimates have a variance of O(1=N), where N ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
We study a sequential variance reduction technique for Monte Carlo estimation of functionals in Markov Chains. The method is based on designing sequential control variates using successive approximations of the function of interest V. Regular Monte Carlo estimates have a variance of O(1=N), where N is the number of samples. Here, we obtain a geometric variance reduction O(N) (with < 1) up to a threshold that depends on the approximation error V AV, where A is an approximation operator linear in the values. Thus, if V belongs to the right approximation space (i.e. AV = V), the variance decreases geometrically to zero. An immediate application is value function estimation in Markov chains, which may be used for policy evaluation in policy iteration for Markov Decision Processes. Another important domain, for which variance reduction is highly needed, is gradient estimation, that is computing the sensitivity @V of the performance measure V with respect to some parameter of the transition probabilities. For example, in parametric optimization of the policy, an estimate of the policy gradient is required to perform a gradient optimization method. We show that, using two approximations, the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations.
Stochastic direct reinforcement: Application to simple games with recurrence
 In Proceedings of Artificial Multiagent Learning. Papers from the 2004 AAAI Fall Symposium
, 2004
"... We investigate repeated matrix games with stochastic players as a microcosm for studying dynamic, multiagent interactions using the Stochastic Direct Reinforcement (SDR) policy gradient algorithm. SDR is a generalization of Recurrent Reinforcement Learning (RRL) that supports stochastic policies. U ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
We investigate repeated matrix games with stochastic players as a microcosm for studying dynamic, multiagent interactions using the Stochastic Direct Reinforcement (SDR) policy gradient algorithm. SDR is a generalization of Recurrent Reinforcement Learning (RRL) that supports stochastic policies. Unlike other RL algorithms, SDR and RRL use recurrent policy gradients to properly address temporal credit assignment resulting from recurrent structure. Our main goals in this paper are to (1) distinguish recurrent memory from standard, nonrecurrent memory for policy gradient RL, (2) compare SDR with Qtype learning methods for simple games, (3) distinguish reactive from endogenous dynamical agent behavior and (4) explore the use of recurrent learning for interacting, dynamic agents. We find that SDR players learn much faster and hence outperform recentlyproposed Qtype learners for the simple game Rock, Paper, Scissors (RPS). With more complex, dynamic SDR players and opponents, we demonstrate that recurrent representations and SDR’s recurrent policy gradients yield better performance than nonrecurrent players. For the Iterated Prisoners Dilemma, we show that nonrecurrent SDR agents learn only to defect (Nash equilibrium), while SDR agents with recurrent gradients can learn a variety of interesting behaviors, including cooperation. 1