Results 1  10
of
17
FiniteSample Analysis of LeastSquares Policy Iteration
 Journal of Machine learning Research (JMLR
, 2011
"... In this paper, we report a performance bound for the widely used leastsquares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the leastsquares temporaldifference (LSTD) l ..."
Abstract

Cited by 19 (6 self)
 Add to MetaCart
(Show Context)
In this paper, we report a performance bound for the widely used leastsquares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the leastsquares temporaldifference (LSTD) learning method, and report finitesample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is βmixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.
M.: Finitesample analysis of LassoTD
 In: Proceedings of the International Conference on Machine Learning
, 2011
"... Abstract. In this paper, we analyze the performance of LassoTD, a modification of LSTD in which the projection operator is defined as a Lasso problem. We first show that LassoTD is guaranteed to have a unique fixed point and its algorithmic implementation coincides with the recently presented LARS ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
Abstract. In this paper, we analyze the performance of LassoTD, a modification of LSTD in which the projection operator is defined as a Lasso problem. We first show that LassoTD is guaranteed to have a unique fixed point and its algorithmic implementation coincides with the recently presented LARSTD and LCTD methods. We then derive two bounds on the prediction error of LassoTD in the Markov design setting, i.e., when the performance is evaluated on the same states used by the method. The first bound makes no assumption, but has a slow rate w.r.t. the number of samples. The second bound is under an assumption on the empirical Gram matrix, called the compatibility condition, but has an improved rate and directly relates the prediction error to the sparsity of the value function in the feature space at hand. For the full
SketchBased Linear Value Function Approximation
"... Hashing is a common method to reduce large, potentially infinite feature vectors to a fixedsize table. In reinforcement learning, hashing is often used in conjunction with tile coding to represent states in continuous spaces. Hashing is also a promising approach to value function approximation in l ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
Hashing is a common method to reduce large, potentially infinite feature vectors to a fixedsize table. In reinforcement learning, hashing is often used in conjunction with tile coding to represent states in continuous spaces. Hashing is also a promising approach to value function approximation in large discrete domains such as Go and Hearts, where feature vectors can be constructed by exhaustively combining a set of atomic features. Unfortunately, the typical use of hashing in value function approximation results in biased value estimates due to the possibility of collisions. Recent work in data stream summaries has led to the development of the tugofwar sketch, an unbiased estimator for approximating inner products. Our work investigates the application of this new data structure to linear value function approximation. Although in the reinforcement learning setting the use of the tugofwar sketch leads to biased value estimates, we show that this bias can be orders of magnitude less than that of standard hashing. We provide empirical results on two RL benchmark domains and fiftyfive Atari 2600 games to highlight the superior learning performance obtained when using tugofwar hashing. 1
Bellman Error Based Feature Generation Using Random Projections
"... The accuracy of parametrized policy evaluation depends on the quality of the features used for estimating the value function. Hence, feature generation/selection in reinforcement learning (RL) has received a lot of attention (Di Castro and Mannor, 2010). We focus on methods that aim to generate feat ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
The accuracy of parametrized policy evaluation depends on the quality of the features used for estimating the value function. Hence, feature generation/selection in reinforcement learning (RL) has received a lot of attention (Di Castro and Mannor, 2010). We focus on methods that aim to generate features in the direction of the Bellman error of the current value estimates (Bellman Error
Incremental Basis Construction from Temporal Difference Error
, 2011
"... In many reinforcement learning (RL) systems, the value function is approximated as a linear combination of a fixed set of basis functions. Performance can be improved by adding to this set. Previous approaches construct a series of basis functions that in sufficient number can eventually represent t ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
In many reinforcement learning (RL) systems, the value function is approximated as a linear combination of a fixed set of basis functions. Performance can be improved by adding to this set. Previous approaches construct a series of basis functions that in sufficient number can eventually represent the value function. In contrast, we show that there is a single, ideal basis function, which can directly represent the value function. Its addition to the set immediately reduces the error to zero—without changing existing weights. Moreover, this ideal basis function is simply the value function that results from replacing the MDP’s reward function with its Bellman error. This result suggests a novel method for improving valuefunction estimation: a primary reinforcement learner estimates its value function using its present basis functions; it then sends its TD error to a secondary learner, which interprets that error as a reward function and estimates the corresponding value function; the resulting value function then becomes the primary learner’s new basis function. We present both batch and online versions in combination with incremental basis projection, and demonstrate that the performance is superior to existing methods, especially in the case of large discount factors.
Statistical linear estimation with penalized estimators: an application to reinforcement
"... learning ..."
(Show Context)
LSTD on sparse spaces
 In NIPS Workshop on New Frontiers in Model Order Selection
, 2011
"... Efficient model selection and value function approximation are tricky tasks in reinforcement learning (RL), when dealing with large feature spaces. Even in batch settings, when the number of observed trajectories is small and the feature set is highdimensional, there is little hope that we can lear ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Efficient model selection and value function approximation are tricky tasks in reinforcement learning (RL), when dealing with large feature spaces. Even in batch settings, when the number of observed trajectories is small and the feature set is highdimensional, there is little hope that we can learn a good value function directly based on all the features. To get better convergence and handle the overfitting
Author manuscript, published in "International Conference on Machine Learning, United States (2011)" FiniteSample Analysis of LassoTD
"... In value function approximation in RL, however, the objective is not to recover a target function given its noisy observations, but is instead to approximate the fixedpoint of the Bellman operator given sample trajectories. This creates some difficulties in applying Lasso and ridge to this problem. ..."
Abstract
 Add to MetaCart
In value function approximation in RL, however, the objective is not to recover a target function given its noisy observations, but is instead to approximate the fixedpoint of the Bellman operator given sample trajectories. This creates some difficulties in applying Lasso and ridge to this problem. Despite these difficulties, both ℓ1 and ℓ2 regularizations have been previously studied in value function approximation in RL. Farahmand et al. presented several such algorithms wherein ℓ2regularization was added to LSTD and modified Bellman residual minimization (Farahmand et al., 2008), to fitted Qiteration (Farahmand et al., 2009), and finitesample performance bounds for these algorithms were proved. There has also been alhal00830149,
Compressive Reinforcement Learning with Oblique Random Projections
, 2011
"... Compressive sensing has been rapidly growing as a nonadaptive dimensionality reduction framework, wherein highdimensional data is projected onto a randomly generated subspace. In this paper we explore a paradigm called compressive reinforcement learning, where approximately optimal policies are co ..."
Abstract
 Add to MetaCart
(Show Context)
Compressive sensing has been rapidly growing as a nonadaptive dimensionality reduction framework, wherein highdimensional data is projected onto a randomly generated subspace. In this paper we explore a paradigm called compressive reinforcement learning, where approximately optimal policies are computed in a lowdimensional subspace generated from a highdimensional feature space through random projections. We use the framework of oblique projections that unifies two popular methods to approximately solve MDPs – fixed point (FP) and Bellman residual (BR) methods, and derive error bounds on the quality of approximations obtained from combining random projections and oblique projections on a finite set of samples. We investigate the effectiveness of fixed point, Bellman residual, as well as hybrid leastsquares methods in feature spaces generated by random projections. Finally, we present simulation results in various continuous MDPs, which show both gains in computation time and effectiveness in problems with large feature spaces and small sample sets. 1
University of Alberta Statistical analysis of L1penalized linear estimation with applications
"... Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of ..."
Abstract
 Add to MetaCart
Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author’s prior written permission. We study linear estimation based on perturbed data when performance is measured by a matrix norm of the expected residual error, in particular, the case in which there are many unknowns, but the “best ” estimator is sparse, or has small `1norm. We propose a Lassolike procedure that finds the minimizer of an `1penalized squared norm of the residual. For linear regression we show O 1