Results 1 
9 of
9
LeastSquares Policy Iteration
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. This new approach ..."
Abstract

Cited by 461 (12 self)
 Add to MetaCart
(Show Context)
We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. This new approach
Reinforcement Learning as Classification: Leveraging Modern Classifiers
 in Proceedings of the Twentieth International Conference on Machine Learning
, 2003
"... The basic tools of machine learning appear in the inner loop of most reinforcement learning algorithms, typically in the form of Monte Carlo methods or function approximation techniques. ..."
Abstract

Cited by 84 (5 self)
 Add to MetaCart
The basic tools of machine learning appear in the inner loop of most reinforcement learning algorithms, typically in the form of Monte Carlo methods or function approximation techniques.
Policy Search via Density Estimation
 In Advances in Neural Information Processing Systems 12
, 1999
"... We propose a new approach to the problem of searching a space of stochastic controllers for a Markov decision process (MDP) or a partially observable Markov decision process (POMDP). Following several other authors, our approach is based on searching in parameterized families of policies (for ex ..."
Abstract

Cited by 34 (3 self)
 Add to MetaCart
(Show Context)
We propose a new approach to the problem of searching a space of stochastic controllers for a Markov decision process (MDP) or a partially observable Markov decision process (POMDP). Following several other authors, our approach is based on searching in parameterized families of policies (for example, via gradient descent) to optimize solution quality. However, rather than trying to estimate the values and derivatives of a policy directly, we do so indirectly using estimates for the probability densities that the policy induces on states at the different points in time. This enables our algorithms to exploit the many techniques for efficient and robust approximate density propagation in stochastic systems. We show how our techniques can be applied both to deterministic propagation schemes (where the MDP's dynamics are given explicitly in compact form,) and to stochastic propagation schemes (where we have access only to a generative model, or simulator, of the MDP). We pres...
ModelFree Least Squares Policy Iteration
 Advances in Neural Information Processing Systems
, 2001
"... We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is modelfree and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
(Show Context)
We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is modelfree and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems.
Leastsquares methods in reinforcement learning for control
 In SETN ’02: Proceedings of the Second Hellenic Conference on AI
, 2002
"... Abstract. Leastsquares methods have been successfully used for prediction problems in the context of reinforcement learning, but little has been done in extending these methods to control problems. This paper presents an overview of our research efforts in using leastsquares techniques for control ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Leastsquares methods have been successfully used for prediction problems in the context of reinforcement learning, but little has been done in extending these methods to control problems. This paper presents an overview of our research efforts in using leastsquares techniques for control. In our early attempts, we considered a direct extension of the LeastSquares Temporal Difference (LSTD) algorithm in the spirit of Qlearning. Later, an effort to remedy some limitations of this algorithm (approximation bias, poor sample utilization) led to the LeastSquares Policy Iteration (LSPI) algorithm, which is a form of modelfree approximate policy iteration and makes efficient use of training samples collected in any arbitrary manner. The algorithms are demonstrated on a variety of learning domains, including algorithm selection, inverted pendulum balancing, bicycle balancing and riding, multiagent learning in factored domains, and, recently, on twoplayer zerosum Markov games and the game of Tetris. 1
SelfGrowth of Basic Behaviors in an Action Selection Based Agent, in « From animal to animats 8
 Eighth International Conference on the Simulation of Adaptive Behaviour 2004  SAB’04
, 2004
"... We investigate on designing agents facing multiple objectives simultaneously, that creates difficult situations, even if each objective is of low complexity. The present paper builds on an existing action selection process based on basic behaviors (resulting in a modular architecture) and proposes a ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We investigate on designing agents facing multiple objectives simultaneously, that creates difficult situations, even if each objective is of low complexity. The present paper builds on an existing action selection process based on basic behaviors (resulting in a modular architecture) and proposes an algorithm for automatically selecting and learning the required basic behaviors through an incremental Reinforcement Learning approach. This leads to a very autonomous architecture, as the handcoding is here reduced to its minimum. 1.
Policy invariance under reward transformations: Theory and application to reward shaping
 In Proceedings of the Sixteenth International Conference on Machine Learning
, 1999
"... This paper investigates conditions under which modifications to the reward function of a Markov decision process preserve the optimal policy. It is shown that, besides the positive linear transformation familiar from utility theory, one can add a reward for transitions between states that is e ..."
Abstract
 Add to MetaCart
This paper investigates conditions under which modifications to the reward function of a Markov decision process preserve the optimal policy. It is shown that, besides the positive linear transformation familiar from utility theory, one can add a reward for transitions between states that is expressible as the difference in value of an arbitrary potential function applied to those states. Furthermore, this is shown to be a necessary condition for invariance, in the sense that any other transformation may yield suboptimal policies unless further assumptions are made about the underlying MDP. These results shed light on the practice of reward shaping, a method used in reinforcement learning whereby additional training rewards are used to guide the learning agent. In particular, some wellknown "bugs" in reward shaping procedures are shown to arise from nonpotentialbased rewards, and methods are given for constructing shaping potentials corresponding to distanceba...
Learning and Control in a Chaotic System
, 1999
"... Most real world problems can be solved at least partially by simple, nonadaptive means. When we want to use an adaptive learning technique like reinforcement learning for practical purposes, we need to be able to cooperate with one or several of these partial solutions. In this paper we consider the ..."
Abstract
 Add to MetaCart
Most real world problems can be solved at least partially by simple, nonadaptive means. When we want to use an adaptive learning technique like reinforcement learning for practical purposes, we need to be able to cooperate with one or several of these partial solutions. In this paper we consider the problem of making a reinforcement learning agent cooperate with a handcrafted local controller and a global chaotic controller, and designing a regime that improves upon these controller. 1 Introduction There exist a lot of methods for finding handcrafted solutions to problems or methods that work locally. It is reasonable to believe that a general learning algorithm could benefit from those methods when trying to learn to solve the problem. We could think of local methods as a limited part of the state space that has a suggestion for a solution, while the agent is on its own in the rest of the state space. The situation could also be inverted: it could be required that the agent coopera...
Using Markovk Memory for Problems with
 School of Computer Science and Software Engineering, Monash University. In Machine Learning: Models, Technologies and Applications
, 2003
"... TRACA (Temporal Reinforcement learning and Classification Architecture) is a new learning system developed for robotnavigation problems. One difficulty in this area is dealing with problems which contain hiddenstate. ..."
Abstract
 Add to MetaCart
TRACA (Temporal Reinforcement learning and Classification Architecture) is a new learning system developed for robotnavigation problems. One difficulty in this area is dealing with problems which contain hiddenstate.