Results 1  10
of
10
FiniteSample Analysis of LeastSquares Policy Iteration
 Journal of Machine learning Research (JMLR
, 2011
"... In this paper, we report a performance bound for the widely used leastsquares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the leastsquares temporaldifference (LSTD) l ..."
Abstract

Cited by 19 (6 self)
 Add to MetaCart
(Show Context)
In this paper, we report a performance bound for the widely used leastsquares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the leastsquares temporaldifference (LSTD) learning method, and report finitesample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is βmixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.
Regularized OffPolicy TDLearning
, 2012
"... We present a novel l1 regularized offpolicy convergent TDlearning method (termed ROTD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying ROTD integrates two key ideas: offpolicy convergent gradient TD method ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
We present a novel l1 regularized offpolicy convergent TDlearning method (termed ROTD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying ROTD integrates two key ideas: offpolicy convergent gradient TD methods, such as TDC, and a convexconcave saddlepoint formulation of nonsmooth convex optimization, which enables firstorder solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of ROTD is presented. A variety of experiments are presented to illustrate the offpolicy convergence, sparse feature selection capability and low computational cost of the ROTD algorithm.
R.: Regularized least squares temporal difference learning with nested ℓ2 and ℓ1 penalization
 In: European Workshop on Reinforcement Learning (2011
"... novel scheme (L21) which uses ℓ2 Abstract. The construction of a suitable set of features to approximate value functions is a central problem in reinforcement learning (RL). A popular approach to this problem is to use highdimensional feature spaces together with leastsquares temporal difference l ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
novel scheme (L21) which uses ℓ2 Abstract. The construction of a suitable set of features to approximate value functions is a central problem in reinforcement learning (RL). A popular approach to this problem is to use highdimensional feature spaces together with leastsquares temporal difference learning (LSTD). Although this combination allows for very accurate approximations, it often exhibits poor prediction performance because of overfitting when the number of samples is small compared to the number of features in the approximation space. In the linear regression setting, regularization is commonly used to overcome this problem. In this paper, we review some regularized approaches to policy evaluation and we introduce a regularization in the projection operator and an ℓ1 penalty in the fixedpoint step. We show that such formulation reduces to a standard Lasso problem. As a result, any offtheshelf solver can be used to compute its solution and standardization techniques can be applied to the data. We report experimental results showing that L21 is effective in avoiding overfitting and that it compares favorably to existing ℓ1 regularized methods. 1
L1 regularized linear temporal difference learning
, 2012
"... Several recent efforts in the field of reinforcement learning have focused attention on the importance of regularization, but the techniques for incorporating regularization into reinforcement learning algorithms, and the effects of these changes upon the convergence of these algorithms, are ongoing ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Several recent efforts in the field of reinforcement learning have focused attention on the importance of regularization, but the techniques for incorporating regularization into reinforcement learning algorithms, and the effects of these changes upon the convergence of these algorithms, are ongoing areas of research. In particular, little has been written about the use of regularization in online reinforcement learning. In this paper, we describe a novel online stochastic approximation algorithm for reinforcement learning. We prove convergence of the online algorithm and show that the L1 regularized linear fixed point of LARSTD and LCTD is an equilibrium fixed point of the algorithm. 1
D.: Learning from limited demonstrations
 In: In Proc. of NIPS (2013
"... We propose a Learning from Demonstration (LfD) algorithm which leverages expert data, even if they are very few or inaccurate. We achieve this by using both expert data, as well as reinforcement signals gathered through trialanderror interactions with the environment. The key idea of our approac ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
We propose a Learning from Demonstration (LfD) algorithm which leverages expert data, even if they are very few or inaccurate. We achieve this by using both expert data, as well as reinforcement signals gathered through trialanderror interactions with the environment. The key idea of our approach, Approximate Policy Iteration with Demonstration (APID), is that expert’s suggestions are used to define linear constraints which guide the optimization performed by Approximate Policy Iteration. We prove an upper bound on the Bellman error of the estimate computed by APID at each iteration. Moreover, we show empirically that APID outperforms pure Approximate Policy Iteration, a stateoftheart LfD algorithm, and supervised learning in a variety of scenarios, including when very few and/or suboptimal demonstrations are available. Our experiments include simulations as well as a real robot pathfinding task. 1
Qlearning for historybased reinforcement learning
"... We extend the Qlearning algorithm from the Markov Decision Process setting to problems where observations are nonMarkov and do not reveal the full state of the world i.e. to POMDPs. We do this in a natural manner by adding `0 regularisation to the pathwise squared Qlearning objective function and ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
We extend the Qlearning algorithm from the Markov Decision Process setting to problems where observations are nonMarkov and do not reveal the full state of the world i.e. to POMDPs. We do this in a natural manner by adding `0 regularisation to the pathwise squared Qlearning objective function and then optimise this over both a choice of map from history to states and the resulting MDP parameters. The optimisation procedure involves a stochastic search over the map class nested with classical Qlearning of the parameters. This algorithm fits perfectly into the feature reinforcement learning framework, which chooses maps based on a cost criteria. The cost criterion used so far for feature reinforcement learning has been modelbased and aimed at predicting future states and rewards. Instead we directly predict the return, which is what is needed for choosing optimal actions. Our Qlearning criteria also lends itself immediately to a function approximation setting where features are chosen based on the history. This algorithm is somewhat similar to the recent line of work on lasso temporal difference learning which aims at finding a small feature set with which one can perform policy evaluation. The distinction is that we aim directly for learning the Qfunction of the optimal policy and we use `0 instead of `1 regularisation. We perform an experimental evaluation on classical benchmark domains and find improvement in convergence speed as well as in economy of the state representation. We also compare against MCAIXI on the large Pocman domain and achieve competitive performance in average reward. We use less than half the CPU time and 36 times less memory. Overall, our algorithm hQL provides a better combination of computational, memory and data efficiency than existing algorithms in this setting. 1.
University of Alberta Statistical analysis of L1penalized linear estimation with applications
"... Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of ..."
Abstract
 Add to MetaCart
Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author’s prior written permission. We study linear estimation based on perturbed data when performance is measured by a matrix norm of the expected residual error, in particular, the case in which there are many unknowns, but the “best ” estimator is sparse, or has small `1norm. We propose a Lassolike procedure that finds the minimizer of an `1penalized squared norm of the residual. For linear regression we show O 1
FiniteSample Analysis of Proximal Gradient TD Algorithms
"... In this paper, we show for the first time how gradient TD (GTD) reinforcement learning methods can be formally derived as true stochastic gradient algorithms, not with respect to their original objective functions as previously attempted, but rather using derived primaldual saddlepoint objectiv ..."
Abstract
 Add to MetaCart
In this paper, we show for the first time how gradient TD (GTD) reinforcement learning methods can be formally derived as true stochastic gradient algorithms, not with respect to their original objective functions as previously attempted, but rather using derived primaldual saddlepoint objective functions. We then conduct a saddlepoint error analysis to obtain finitesample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and no finitesample analysis had been attempted. Two novel GTD algorithms are also proposed, namely projected GTD2 and GTD2MP, which use proximal “mirror maps ” to yield improved convergence guarantees and acceleration, respectively. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for offpolicy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods. 1
Acknowledgments
, 2014
"... Hiermit versichere ich, die vorliegende MasterThesis ohne Hilfe Dritter nur mit den angegebenen Quellen und Hilfsmitteln angefertigt zu haben. Alle Stellen, die aus Quellen entnommen wurden, sind als solche kenntlich gemacht. Diese Arbeit hat in gleicher oder ähnlicher Form noch keiner Prüfungsbeh ..."
Abstract
 Add to MetaCart
Hiermit versichere ich, die vorliegende MasterThesis ohne Hilfe Dritter nur mit den angegebenen Quellen und Hilfsmitteln angefertigt zu haben. Alle Stellen, die aus Quellen entnommen wurden, sind als solche kenntlich gemacht. Diese Arbeit hat in gleicher oder ähnlicher Form noch keiner Prüfungsbehörde vorgelegen.
ℓ1penalized projected Bellman residual
 EUROPEAN WROKSHOP ON REINFORCEMENT LEARNING (EWRL 11)
, 2011
"... We consider the task of feature selection for value function approximation in reinforcement learning. A promising approach consists in combining the LeastSquares Temporal Difference (LSTD) algorithm with ℓ1regularization, which has proven to be effective in the supervised learning community. This ..."
Abstract
 Add to MetaCart
We consider the task of feature selection for value function approximation in reinforcement learning. A promising approach consists in combining the LeastSquares Temporal Difference (LSTD) algorithm with ℓ1regularization, which has proven to be effective in the supervised learning community. This has been done recently whit the LARSTD algorithm, which replaces the projection operator of LSTD with an ℓ1penalized projection and solves the corresponding fixedpoint problem. However, this approach is not guaranteed to be correct in the general offpolicy setting. We take a different route by adding an ℓ1penalty term to the projected Bellman residual, which requires weaker assumptions while offering a comparable performance. However, this comes at the cost of a higher computational complexity if only a part of the regularization path is computed. Nevertheless, our approach ends up to a supervised learning problem, which let envision easy extensions to other penalties.