Results 1  10
of
436
LeastSquares Policy Iteration
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. This new approach ..."
Abstract

Cited by 462 (12 self)
 Add to MetaCart
(Show Context)
We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. This new approach
ActorCritic Algorithms
 SIAM JOURNAL ON CONTROL AND OPTIMIZATION
, 2001
"... In this paper, we propose and analyze a class of actorcritic algorithms. These are twotimescale algorithms in which the critic uses temporal difference (TD) learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction based on in ..."
Abstract

Cited by 244 (1 self)
 Add to MetaCart
(Show Context)
In this paper, we propose and analyze a class of actorcritic algorithms. These are twotimescale algorithms in which the critic uses temporal difference (TD) learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction based on information provided by the critic. We show that the features for the critic should ideally span a subspace prescribed by the choice of parameterization of the actor. We study actorcritic algorithms for Markov decision processes with general state and action spaces. We state and prove two results regarding their convergence.
Multiagent Learning Using a Variable Learning Rate
 Artificial Intelligence
, 2002
"... Learning to act in a multiagent environment is a difficult problem since the normal definition of an optimal policy no longer applies. The optimal policy at any moment depends on the policies of the other agents and so creates a situation of learning a moving target. Previous learning algorithms hav ..."
Abstract

Cited by 225 (8 self)
 Add to MetaCart
Learning to act in a multiagent environment is a difficult problem since the normal definition of an optimal policy no longer applies. The optimal policy at any moment depends on the policies of the other agents and so creates a situation of learning a moving target. Previous learning algorithms have one of two shortcomings depending on their approach. They either converge to a policy that may not be optimal against the specific opponents' policies, or they may not converge at all. In this article we examine this learning problem in the framework of stochastic games. We look at a number of previous learning algorithms showing how they fail at one of the above criteria. We then contribute a new reinforcement learning technique using a variable learning rate to overcome these shortcomings. Specifically, we introduce the WoLF principle, "Win or Learn Fast", for varying the learning rate. We examine this technique theoretically, proving convergence in selfplay on a restricted class of iterated matrix games. We also present empirical results on a variety of more general stochastic games, in situations of selfplay and otherwise, demonstrating the wide applicability of this method.
Infinitehorizon policygradient estimation
 Journal of Artificial Intelligence Research
, 2001
"... Gradientbased approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in valuefunction methods. In this paper we introduce � � , a si ..."
Abstract

Cited by 208 (5 self)
 Add to MetaCart
Gradientbased approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in valuefunction methods. In this paper we introduce � � , a simulationbased algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes ( � s) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm’s chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter � � (which has a natural interpretation in terms of biasvariance tradeoff), and requires no knowledge of the underlying state. We prove convergence of � � , and show how the correct choice of the parameter is related to the mixing time of the controlled �. We briefly describe extensions of � � to controlled Markov chains, continuous state, observation and control spaces, multipleagents, higherorder derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by � � can be used in both a traditional stochastic gradient algorithm and a conjugategradient procedure to find local optima of the average reward. 1.
A Tutorial on the CrossEntropy Method
, 2005
"... The crossentropy (CE) method is a new generic approach to combinatorial and multiextremal optimization and rare event simulation. The purpose of this tutorial is to give a gentle introduction to the CE method. We present the CE methodology, the basic algorithm and its modifications, and discuss a ..."
Abstract

Cited by 181 (18 self)
 Add to MetaCart
(Show Context)
The crossentropy (CE) method is a new generic approach to combinatorial and multiextremal optimization and rare event simulation. The purpose of this tutorial is to give a gentle introduction to the CE method. We present the CE methodology, the basic algorithm and its modifications, and discuss applications in combinatorial optimization and machine learning.
KernelBased Reinforcement Learning
 Machine Learning
, 1999
"... We present a kernelbased approach to reinforcement learning that overcomes the stability problems of temporaldifference learning in continuous statespaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second ..."
Abstract

Cited by 153 (1 self)
 Add to MetaCart
(Show Context)
We present a kernelbased approach to reinforcement learning that overcomes the stability problems of temporaldifference learning in continuous statespaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernelbased approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the biasvariance tradeo in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or nonparametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem.
Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion
 in Proceedings of the IEEE International Conference on Robotics and Automation
, 2004
"... This paper presents a machine learning approach to optimizing a quadrupedal trot gait for forward speed. Given a parameterized walk designed for a specific robot, we propose using a form of policy gradient reinforcement learning to automatically search the set of possible parameters with the goal of ..."
Abstract

Cited by 149 (15 self)
 Add to MetaCart
(Show Context)
This paper presents a machine learning approach to optimizing a quadrupedal trot gait for forward speed. Given a parameterized walk designed for a specific robot, we propose using a form of policy gradient reinforcement learning to automatically search the set of possible parameters with the goal of finding the fastest possible walk. We implement and test our approach on a commercially available quadrupedal robot platform, namely the Sony Aibo robot. After about three hours of learning, all on the physical robots and with no human intervention other than to change the batteries, the robots achieved a gait faster than any previously known gait for the Aibo, significantly outperforming a variety of existing handcoded and learned solutions.
A Natural Policy Gradient
"... We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy opt ..."
Abstract

Cited by 148 (0 self)
 Add to MetaCart
We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as deo/ned by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.
Reinforcement learning for RoboCupsoccer keepaway
 Adaptive Behavior
, 2005
"... 1 RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple independent agents learning simultaneously, and long and variable delays in the effects of actions. We describe our application of episodic SMD ..."
Abstract

Cited by 134 (36 self)
 Add to MetaCart
(Show Context)
1 RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple independent agents learning simultaneously, and long and variable delays in the effects of actions. We describe our application of episodic SMDP Sarsa(λ) with linear tilecoding function approximation and variable λ to learning higherlevel decisions in a keepaway subtask of RoboCup soccer. In keepaway, one team, “the keepers, ” tries to keep control of the ball for as long as possible despite the efforts of “the takers. ” The keepers learn individually when to hold the ball and when to pass to a teammate. Our agents learned policies that significantly outperform a range of benchmark policies. We demonstrate the generality of our approach by applying it to a number of task variations including different field sizes and different numbers of players on each team.
Reinforcement learning for humanoid robotics
 Autonomous Robot
, 2003
"... Abstract. The complexity of the kinematic and dynamic structure of humanoid robots make conventional analytical approaches to control increasingly unsuitable for such systems. Learning techniques offer a possible way to aid controller design if insufficient analytical knowledge is available, and lea ..."
Abstract

Cited by 132 (21 self)
 Add to MetaCart
Abstract. The complexity of the kinematic and dynamic structure of humanoid robots make conventional analytical approaches to control increasingly unsuitable for such systems. Learning techniques offer a possible way to aid controller design if insufficient analytical knowledge is available, and learning approaches seem mandatory when humanoid systems are supposed to become completely autonomous. While recent research in neural networks and statistical learning has focused mostly on learning from finite data sets without stringent constraints on computational efficiency, learning for humanoid robots requires a different setting, characterized by the need for realtime learning performance from an essentially infinite stream of incrementally arriving data. This paper demonstrates how even highdimensional learning problems of this kind can successfully be dealt with by techniques from nonparametric regression and locally weighted learning. As an example, we describe the application of one of the most advanced of such algorithms, Locally Weighted Projection Regression (LWPR), to the online learning of three problems in humanoid motor control: the learning of inverse dynamics models for modelbased control, the learning of inverse kinematics of redundant manipulators, and the learning of oculomotor reflexes. All these examples demonstrate fast, i.e., within seconds or minutes, learning convergence with highly accurate final peformance. We conclude that realtime learning for complex motor system like humanoid robots is possible with appropriately tailored algorithms, such that increasingly autonomous robots with massive learning abilities should be achievable in the near future. 1.