Results 1  10
of
10
Recursive LeastSquares Learning with Eligibility Traces
 EUROPEAN WROKSHOP ON REINFORCEMENT LEARNING (EWRL 11)
, 2011
"... In the framework of Markov Decision Processes, we consider the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We describe a systematic approach for adapting onpolicy learning least squares algorithms ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
In the framework of Markov Decision Processes, we consider the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We describe a systematic approach for adapting onpolicy learning least squares algorithms of the literature (LSTD [5], LSPE [15], FPKF [7] and GPTD [8]/KTD [10]) to offpolicy learning with eligibility traces. This leads to two known algorithms, LSTD(λ)/LSPE(λ) [21] and suggests new extensions of FPKF and GPTD/KTD. We describe their recursive implementation, discuss their convergence properties, and illustrate their behavior experimentally. Overall, our study suggests that the stateofart LSTD(λ) [21] remains the best leastsquares algorithm.
Montecarlo swarm policy search
 In Symposium on Swarm Intelligence and Differential Evolution, Lecture Notes in Artificial Intelligence (LNAI
, 2012
"... Abstract. Finding optimal controllers of stochastic systems is a particularly challenging problem tackled by the optimal control and reinforcement learning communities. A classic paradigm for handling such problems is provided by Markov Decision Processes. However, the resulting underlying optimizat ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Finding optimal controllers of stochastic systems is a particularly challenging problem tackled by the optimal control and reinforcement learning communities. A classic paradigm for handling such problems is provided by Markov Decision Processes. However, the resulting underlying optimization problem is difficult to solve. In this paper, we explore the possible use of Particle Swarm Optimization to learn optimal controllers and show through some nontrivial experiments that it is a particularly promising lead.
Algorithms for Fast Gradient Temporal Difference Learning
"... Temporal difference learning is one of the oldest and most used techniques in reinforcement learning to estimate value functions. Many modifications and extension of the classical TD methods have been proposed. Recent examples are TDC and GTD(2) ([Sutton et al., 2009b]), the first approaches that ar ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Temporal difference learning is one of the oldest and most used techniques in reinforcement learning to estimate value functions. Many modifications and extension of the classical TD methods have been proposed. Recent examples are TDC and GTD(2) ([Sutton et al., 2009b]), the first approaches that are as fast as classical TD and have proven convergence for linear function approximation in on and offpolicy cases. This paper introduces these methods to novices of TD learning by presenting the important concepts of the new algorithms. Moreover the methods are compared against each other and alternative approaches both theoretically and empirically. Eventually, experimental results give rise to question the practical relevance of convergence guarantees for offpolicy prediction by TDC and GTD(2). 1
Behavior Specific User Simulation in Spoken Dialogue Systems
"... Spoken dialogue systems provide an opportunity for man machine interaction using spoken language as the medium of interaction. In recent years reinforcement learningbased dialogue policy optimization has evolved to be state of the art. In order to cope with the data requirement for policy optimizat ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Spoken dialogue systems provide an opportunity for man machine interaction using spoken language as the medium of interaction. In recent years reinforcement learningbased dialogue policy optimization has evolved to be state of the art. In order to cope with the data requirement for policy optimization and also to evaluate dialogue policies user simulators are introduced. Almost all existing data driven methods for user modelling aims at simulating some generic user behavior from some reference dialogue corpus. However, this corpus consists of dialogues from multiple users and thus exhibit different user behaviors. In this paper we explore the possibility of identifying and simulating different user behaviors observed in the corpus. For this purpose inverse reinforcement learningbased user simulation method is employed. Using experimental results, we validate the effectiveness of the proposed method for building multiple behavior specific user simulators. 1
Recursive LeastSquares Offpolicy Learning with Eligibility Traces
"... In the framework of Markov Decision Processes, we consider the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We review onpolicy learning leastsquares algorithms of the literature ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
In the framework of Markov Decision Processes, we consider the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We review onpolicy learning leastsquares algorithms of the literature
1 Distributed Policy Evaluation Under Multiple Behavior Strategies
, 2014
"... ar ..."
(Show Context)
Cooperative offpolicy prediction of Markov decision processes in adaptive networks
 in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP
, 2013
"... All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
COOPERATIVE OFFPOLICY PREDICTION OF MARKOV DECISION PROCESSES IN ADAPTIVE NETWORKS
"... We apply diffusion strategies to propose a cooperative reinforcement learning algorithm, in which agents in a network communicate with their neighbors to improve predictions about their environment. The algorithm is suitable to learn offpolicy even in large state spaces. We provide a meansquareer ..."
Abstract
 Add to MetaCart
(Show Context)
We apply diffusion strategies to propose a cooperative reinforcement learning algorithm, in which agents in a network communicate with their neighbors to improve predictions about their environment. The algorithm is suitable to learn offpolicy even in large state spaces. We provide a meansquareerror performance analysis under constant stepsizes. The gain of cooperation in the form of more stability and less bias and variance in the prediction error, is illustrated in the context of a classical model. We show that the improvement in performance is especially significant when the behavior policy of the agents is different from the target policy under evaluation. Index Terms — adaptive networks, dynamic programming, diffusion strategies, gradient temporal difference, meansquareerror, reinforcement learning. 1.
Author manuscript, published in "Symposium on Swarm Intelligence and Differential Evolution, Zakopane: Poland (2012)" DOI: 10.1007/9783642293535_9 MonteCarlo Swarm Policy Search
, 2012
"... Abstract. Finding optimal controllers of stochastic systems is a particularly challenging problem tackled by the optimal control and reinforcement learning communities. A classic paradigm for handling such problems is provided by Markov Decision Processes. However, the resulting underlying optimizat ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. Finding optimal controllers of stochastic systems is a particularly challenging problem tackled by the optimal control and reinforcement learning communities. A classic paradigm for handling such problems is provided by Markov Decision Processes. However, the resulting underlying optimization problem is difficult to solve. In this paper, we explore the possible use of Particle Swarm Optimization to learn optimal controllers and show through some nontrivial experiments that it is a particularly promising lead.
A NonParametric Approach to Approximate Dynamic Programming
"... Abstract—Approximate Dynamic Programming (ADP) is a machine learning method aiming at learning an optimal control policy for a dynamic and stochastic system from a logged set of observed interactions between the system and one or several nonoptimal controlers. It defines a class of particular Reinf ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—Approximate Dynamic Programming (ADP) is a machine learning method aiming at learning an optimal control policy for a dynamic and stochastic system from a logged set of observed interactions between the system and one or several nonoptimal controlers. It defines a class of particular Reinforcement Learning (RL) algorithms which is a general paradigm for learning such a control policy from interactions. ADP addresses the problem of systems exhibiting a state space which is too large to be enumerated in the memory of a computer. Because of this, approximation schemes are used to generalize estimates over continuous state spaces. Nevertheless, RL still suffers from a lack of scalability to multidimensional continuous state spaces. In this paper, we propose the use of the Locally Weighted Projection Regression (LWPR) method to handle this scalability problem. We prove the efficacy of our approach on two standard benchmarks modified to exhibit larger state spaces. I.