Results 1  10
of
51
Reinforcement learning with Gaussian processes
 In Proc. of the 22nd International Conference on Machine Learning
, 2005
"... Gaussian Process Temporal Difference (GPTD) learning offers a Bayesian solution to the policy evaluation problem of reinforcement learning. In this paper we extend the GPTD framework by addressing two pressing issues, which were not adequately treated in the original GPTD paper (Engel et al., 2003). ..."
Abstract

Cited by 134 (11 self)
 Add to MetaCart
(Show Context)
Gaussian Process Temporal Difference (GPTD) learning offers a Bayesian solution to the policy evaluation problem of reinforcement learning. In this paper we extend the GPTD framework by addressing two pressing issues, which were not adequately treated in the original GPTD paper (Engel et al., 2003). The first is the issue of stochasticity in the state transitions, and the second is concerned with action selection and policy improvement. We present a new generative model for the value function, deduced from its relation with the discounted return. We derive a corresponding online algorithm for learning the posterior moments of the value Gaussian process. We also present a SARSA based extension of GPTD, termed GPSARSA, that allows the selection of actions and the gradual improvement of policies without requiring a worldmodel.
Sparse temporal difference learning using lasso
 In IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning
, 2007
"... All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract

Cited by 35 (1 self)
 Add to MetaCart
(Show Context)
All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
Bayesian ActorCritic Algorithms
"... We 1 present a new actorcritic learning model in which a Bayesian class of nonparametric critics, using Gaussian process temporal difference learning is used. Such critics model the stateaction value function as a Gaussian process, allowing Bayes ’ rule to be used in computing the posterior distr ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
(Show Context)
We 1 present a new actorcritic learning model in which a Bayesian class of nonparametric critics, using Gaussian process temporal difference learning is used. Such critics model the stateaction value function as a Gaussian process, allowing Bayes ’ rule to be used in computing the posterior distribution over stateaction value functions, conditioned on the observed data. Appropriate choices of the prior covariance (kernel) between stateaction values and of the parametrization of the policy allow us to obtain closedform expressions for the posterior distribution of the gradient of the average discounted return with respect to the policy parameters. The posterior mean, which serves as our estimate of the policy gradient, is used to update the policy, while the posterior covariance allows us to gauge the reliability of the update. 1.
Kalman Temporal Differences
 Journal of Artificial Intelligence Research (JAIR
, 2010
"... Because reinforcement learning suffers from a lack of scalability, online value (and Q) function approximation has received increasing interest this last decade. This contribution introduces a novel approximation scheme, namely the Kalman Temporal Differences (KTD) framework, that exhibits the foll ..."
Abstract

Cited by 25 (18 self)
 Add to MetaCart
(Show Context)
Because reinforcement learning suffers from a lack of scalability, online value (and Q) function approximation has received increasing interest this last decade. This contribution introduces a novel approximation scheme, namely the Kalman Temporal Differences (KTD) framework, that exhibits the following features: sampleefficiency, nonlinear approximation, nonstationarity handling and uncertainty management. A first KTDbased algorithm is provided for deterministic Markov Decision Processes (MDP) which produces biased estimates in the case of stochastic transitions. Than the eXtended KTD framework (XKTD), solving stochastic MDP, is described. Convergence is analyzed for special cases for both deterministic and stochastic transitions. Related algorithms are experimented on classical benchmarks. They compare favorably to the state of the art while exhibiting the announced features. 1.
Bayesian policy gradient algorithms
 Advances in Neural Information Processing Systems 19
, 2007
"... Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use MonteCarlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large numbe ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
(Show Context)
Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use MonteCarlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost.
Kalman Temporal Differences: the deterministic case
 In IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2009
, 2009
"... Abstract — This paper deals with value function and Qfunction approximation in deterministic Markovian decision processes. A general statistical framework based on the Kalman filtering paradigm is introduced. Its principle is to adopt a parametric representation of the value function, to model the ..."
Abstract

Cited by 18 (13 self)
 Add to MetaCart
(Show Context)
Abstract — This paper deals with value function and Qfunction approximation in deterministic Markovian decision processes. A general statistical framework based on the Kalman filtering paradigm is introduced. Its principle is to adopt a parametric representation of the value function, to model the associated parameter vector as a random variable and to minimize the meansquared error of the parameters conditioned on past observed transitions. From this general framework, which will be called Kalman Temporal Differences (KTD), and using an approximation scheme called the unscented transform, a family of algorithms is derived, namely KTDV, KTDSARSA and KTDQ, which aim respectively at estimating the value function of a given policy, the Qfunction of a given policy and the optimal Qfunction. The proposed approach holds for linear and nonlinear parameterization. This framework is discussed and potential advantages and shortcomings are highlighted.
Bayesian MultiTask Reinforcement Learning
"... We consider the problem of multitask reinforcement learning where the learner is provided with a set of tasks, for which only a smallnumber ofsamplescanbe generatedfor any given policy. As the number of samples may not be enough to learn an accurate evaluation of the policy, it would be necessary t ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
(Show Context)
We consider the problem of multitask reinforcement learning where the learner is provided with a set of tasks, for which only a smallnumber ofsamplescanbe generatedfor any given policy. As the number of samples may not be enough to learn an accurate evaluation of the policy, it would be necessary to identify classesoftaskswith similarstructure and to learn them jointly. We consider the case where the tasks share structure in their value functions, and model this by assuming that the value functions are all sampled from acommonprior. Weadopt the Gaussianprocesstemporaldifferencevaluefunctionmodel and use a hierarchical Bayesian approach to model the distribution over the value functions. We study two cases, where all the value functions belong to the same class and where they belong to an undefined number of classes. For each case, we present a hierarchical Bayesian model, and derive inference algorithms for (i) joint learning of the value functions, and (ii) efficient transfer of the information gained in (i) to assist learning the value function of a newly observed task. 1.
Pomdpbased statistical spoken dialogue systems: a review
 PROC IEEE
, 2013
"... Statistical dialogue systems are motivated by the need for a datadriven framework that reduces the cost of laboriously handcrafting complex dialogue managers and that provides robustness against the errors created by speech recognisers operating in noisy environments. By including an explicit Baye ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
Statistical dialogue systems are motivated by the need for a datadriven framework that reduces the cost of laboriously handcrafting complex dialogue managers and that provides robustness against the errors created by speech recognisers operating in noisy environments. By including an explicit Bayesian model of uncertainty and by optimising the policy via a rewarddriven process, partially observable Markov decision processes (POMDPs) provide such a framework. However, exact model representation and optimisation is computationally intractable. Hence, the practical application of POMDPbased systems requires efficient algorithms and carefully constructed approximations. This review article provides an overview of the current state of the art in the development of POMDPbased spoken dialogue systems.
Adaptive Hamiltonian and Riemann Manifold Monte Carlo Samplers
"... In this paper we address the widelyexperienced difficulty in tuning Monte Carlo sampler based on simulating Hamiltonian dynamics. We develop an algorithm that allows for the adaptation of Hamiltonian and Riemann manifold Hamiltonian Monte Carlo samplers using Bayesian optimization that allows for in ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
In this paper we address the widelyexperienced difficulty in tuning Monte Carlo sampler based on simulating Hamiltonian dynamics. We develop an algorithm that allows for the adaptation of Hamiltonian and Riemann manifold Hamiltonian Monte Carlo samplers using Bayesian optimization that allows for infinite adaptation of the parameters of these samplers. We show that the resulting samplers are ergodic, and that the use of our adaptive algorithms makes it easy to obtain more efficient samplers, in some cases precluding the need for more complex solutions. Hamiltonianbased Monte Carlo samplers are widely known to be an excellent choice of MCMC method, and we aim with this paper to remove a key obstacle towards the more widespread use of these samplers in practice. 1.