Results 1  10
of
45
Regularization and feature selection in leastsquares temporal difference learning
, 2009
"... We consider the task of reinforcement learning with linear value function approximation. Temporal difference algorithms, and in particular the LeastSquares Temporal Difference (LSTD) algorithm, provide a method for learning the parameters of the value function, but when the number of features is la ..."
Abstract

Cited by 79 (1 self)
 Add to MetaCart
(Show Context)
We consider the task of reinforcement learning with linear value function approximation. Temporal difference algorithms, and in particular the LeastSquares Temporal Difference (LSTD) algorithm, provide a method for learning the parameters of the value function, but when the number of features is large this algorithm can overfit to the data and is computationally expensive. In this paper, we propose a regularization framework for the LSTD algorithm that overcomes these difficulties. In particular, we focus on the case of l1 regularization, which is robust to irrelevant features and also serves as a method for feature selection. Although the l1 regularized LSTD solution cannot be expressed as a convex optimization problem, we present an algorithm similar to the Least Angle Regression (LARS) algorithm that can efficiently compute the optimal solution. Finally, we demonstrate the performance of the algorithm experimentally.
Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view
"... We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the onestep Temporal Difference fixpoint computation (TD(0)) and the Bellman Residual (BR) minimization. We describe ..."
Abstract

Cited by 30 (5 self)
 Add to MetaCart
We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the onestep Temporal Difference fixpoint computation (TD(0)) and the Bellman Residual (BR) minimization. We describe examples, where each method outperforms the other. We highlight a simple relation between the objective function they minimize, and show that while BR enjoys a performance guarantee, TD(0) does not in general. We then propose a unified view in terms of oblique projections of the Bellman equation, which substantially simplifies and extends the characterization of Schoknecht (2002) and the recent analysis of Yu & Bertsekas (2008). Eventually, we describe some simulations that suggest that if the TD(0) solution is usually slightly better than the BR solution, its inherent numerical instability makes it very bad in some cases, and thus worse on average.
Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes
"... Approximate dynamic programming has been used successfully in a large variety of domains, but it relies on a small set of provided approximation features to calculate solutions reliably. Large and rich sets of features can cause existing algorithms to overfit because of a limited number of samples. ..."
Abstract

Cited by 28 (10 self)
 Add to MetaCart
(Show Context)
Approximate dynamic programming has been used successfully in a large variety of domains, but it relies on a small set of provided approximation features to calculate solutions reliably. Large and rich sets of features can cause existing algorithms to overfit because of a limited number of samples. We address this shortcoming using L1 regularization in approximate linear programming. Because the proposed method can automatically select the appropriate richness of features, its performance does not degrade with an increasing number of features. These results rely on new and stronger sampling bounds for regularized approximate linear programs. We also propose a computationally efficient homotopy method. The empirical evaluation of the approach shows that the proposed method performs well on simple MDPs and standard benchmark problems. 1.
FiniteSample Analysis of LeastSquares Policy Iteration
 Journal of Machine learning Research (JMLR
, 2011
"... In this paper, we report a performance bound for the widely used leastsquares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the leastsquares temporaldifference (LSTD) l ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
In this paper, we report a performance bound for the widely used leastsquares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the leastsquares temporaldifference (LSTD) learning method, and report finitesample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is βmixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.
LSTD with random projections
 In Advances in Neural Information Processing Systems
, 2010
"... We consider the problem of reinforcement learning in highdimensional spaces when the number of features is bigger than the number of samples. In particular, we study the leastsquares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection f ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
(Show Context)
We consider the problem of reinforcement learning in highdimensional spaces when the number of features is bigger than the number of samples. In particular, we study the leastsquares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting leastsquares policy iteration (LSPI) algorithm. 1
On the Use of NonStationary Policies for Stationary InfiniteHorizon Markov Decision Processes
 In: Advances in Neural Information Processing Systems (NIPS
, 2012
"... HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract

Cited by 12 (9 self)
 Add to MetaCart
(Show Context)
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
M.: Finitesample analysis of LassoTD
 In: Proceedings of the International Conference on Machine Learning
, 2011
"... Abstract. In this paper, we analyze the performance of LassoTD, a modification of LSTD in which the projection operator is defined as a Lasso problem. We first show that LassoTD is guaranteed to have a unique fixed point and its algorithmic implementation coincides with the recently presented LARS ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
Abstract. In this paper, we analyze the performance of LassoTD, a modification of LSTD in which the projection operator is defined as a Lasso problem. We first show that LassoTD is guaranteed to have a unique fixed point and its algorithmic implementation coincides with the recently presented LARSTD and LCTD methods. We then derive two bounds on the prediction error of LassoTD in the Markov design setting, i.e., when the performance is evaluated on the same states used by the method. The first bound makes no assumption, but has a slow rate w.r.t. the number of samples. The second bound is under an assumption on the empirical Gram matrix, called the compatibility condition, but has an improved rate and directly relates the prediction error to the sparsity of the value function in the feature space at hand. For the full
Parametric Value Function Approximation: a Unified View
"... Abstract—Reinforcement learning (RL) is a machine learning answer to the optimal control problem. It consists of learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important RL subt ..."
Abstract

Cited by 10 (6 self)
 Add to MetaCart
Abstract—Reinforcement learning (RL) is a machine learning answer to the optimal control problem. It consists of learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important RL subtopic is to approximate this function when the system is too large for an exact representation. This survey reviews and unifies state of the art methods for parametric value function approximation by grouping them into three main categories: bootstrapping, residuals and projected fixedpoint approaches. Related algorithms are derived by considering one of the associated cost functions and a specific way to minimize it, almost always a stochastic gradient descent or a recursive leastsquares approach. Index Terms—Reinforcement learning, value function approximation, survey. I.
Greedy algorithms for sparse reinforcement learning
 In International Conference on Machine Learning
, 2012
"... Feature selection and regularization are becoming increasingly prominent tools in the efforts of the reinforcement learning (RL) community to expand the reach and applicability of RL. One approach to the problem of feature selection is to impose a sparsityinducing form of regularization on the lear ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
Feature selection and regularization are becoming increasingly prominent tools in the efforts of the reinforcement learning (RL) community to expand the reach and applicability of RL. One approach to the problem of feature selection is to impose a sparsityinducing form of regularization on the learning method. Recent work on L1 regularization has adapted techniques from the supervised learning literature for use with RL. Another approach that has received renewed attention in the supervised learning community is that of using a simple algorithm that greedily adds new features. Such algorithms have many of the good properties of the L1 regularization methods, while also being extremely efficient and, in some cases, allowing theoretical guarantees on recovery of the true form of a sparse target function from sampled data. This paper considers variants of orthogonal matching pursuit (OMP) applied to reinforcement learning. The resulting algorithms are analyzed and compared experimentally with existing L1 regularized approaches. We demonstrate that perhaps the most natural scenario in which one might hope to achieve sparse recovery fails; however, one variant, OMPBRM, provides promising theoretical guarantees under certain assumptions on the feature dictionary. Another variant, OMPTD, empirically outperforms prior methods both in approximation accuracy and efficiency on several benchmark problems. 1.
Regularized Fitted Qiteration: Application to Planning
"... Abstract. We consider planning in a Markovian decision problem, i.e., the problem of finding a good policy given access to a generative model of the environment. We propose to use fitted Qiteration with penalized (or regularized) leastsquares regression as the regression subroutine to address the ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We consider planning in a Markovian decision problem, i.e., the problem of finding a good policy given access to a generative model of the environment. We propose to use fitted Qiteration with penalized (or regularized) leastsquares regression as the regression subroutine to address the problem of controlling modelcomplexity. The algorithm is presented in detail for the case when the function space is a reproducingkernel Hilbert space underlying a userchosen kernel function. We derive bounds on the quality of the solution and argue that datadependent penalties can lead to almost optimal performance. A simple example is used to illustrate the benefits of using a penalized procedure. 1