Results 1  10
of
29
Linear Complementarity for Regularized Policy Evaluation and Improvement
, 2010
"... Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over th ..."
Abstract

Cited by 23 (4 self)
 Add to MetaCart
(Show Context)
Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARSinspired formulation, LARSTD. The LCP formulation allows the use of efficient offtheshelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efficiency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modified policy iteration that can be used to approximate a “greedy” homotopy path, a generalization of the LARSTD homotopy path that combines policy evaluation and optimization.
LSTD with random projections
 In Advances in Neural Information Processing Systems
, 2010
"... We consider the problem of reinforcement learning in highdimensional spaces when the number of features is bigger than the number of samples. In particular, we study the leastsquares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection f ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
(Show Context)
We consider the problem of reinforcement learning in highdimensional spaces when the number of features is bigger than the number of samples. In particular, we study the leastsquares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting leastsquares policy iteration (LSPI) algorithm. 1
Greedy algorithms for sparse reinforcement learning
 In International Conference on Machine Learning
, 2012
"... Feature selection and regularization are becoming increasingly prominent tools in the efforts of the reinforcement learning (RL) community to expand the reach and applicability of RL. One approach to the problem of feature selection is to impose a sparsityinducing form of regularization on the lear ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
Feature selection and regularization are becoming increasingly prominent tools in the efforts of the reinforcement learning (RL) community to expand the reach and applicability of RL. One approach to the problem of feature selection is to impose a sparsityinducing form of regularization on the learning method. Recent work on L1 regularization has adapted techniques from the supervised learning literature for use with RL. Another approach that has received renewed attention in the supervised learning community is that of using a simple algorithm that greedily adds new features. Such algorithms have many of the good properties of the L1 regularization methods, while also being extremely efficient and, in some cases, allowing theoretical guarantees on recovery of the true form of a sparse target function from sampled data. This paper considers variants of orthogonal matching pursuit (OMP) applied to reinforcement learning. The resulting algorithms are analyzed and compared experimentally with existing L1 regularized approaches. We demonstrate that perhaps the most natural scenario in which one might hope to achieve sparse recovery fails; however, one variant, OMPBRM, provides promising theoretical guarantees under certain assumptions on the feature dictionary. Another variant, OMPTD, empirically outperforms prior methods both in approximation accuracy and efficiency on several benchmark problems. 1.
Regularized OffPolicy TDLearning
, 2012
"... We present a novel l1 regularized offpolicy convergent TDlearning method (termed ROTD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying ROTD integrates two key ideas: offpolicy convergent gradient TD method ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
We present a novel l1 regularized offpolicy convergent TDlearning method (termed ROTD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying ROTD integrates two key ideas: offpolicy convergent gradient TD methods, such as TDC, and a convexconcave saddlepoint formulation of nonsmooth convex optimization, which enables firstorder solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of ROTD is presented. A variety of experiments are presented to illustrate the offpolicy convergence, sparse feature selection capability and low computational cost of the ROTD algorithm.
Generalized Value Functions for Large Action Sets
"... The majority of value function approximation based reinforcement learning algorithms available today, focus on approximating the state (V) or stateaction (Q) value function and efficient action selection comes as an afterthought. On the other hand, realworld problems tend to have large action spac ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
The majority of value function approximation based reinforcement learning algorithms available today, focus on approximating the state (V) or stateaction (Q) value function and efficient action selection comes as an afterthought. On the other hand, realworld problems tend to have large action spaces, where evaluating every possible action becomes impractical. This mismatch presents a major obstacle in successfully applying reinforcement learning to realworld problems. In this paper we present a unified view of V and Q functions and arrive at a new spaceefficient representation, where action selection can be done exponentially faster, without the use of a model. We then describe how to calculate this new value function efficiently via approximate linear programming and provide experimental results that demonstrate the effectiveness of the proposed approach. 1. Introduction and
MultiTask Reinforcement Learning: Shaping and Feature Selection
"... Abstract. Shaping functions can be used in multitask reinforcement learning (RL) to incorporate knowledge from previously experienced source tasks to speed up learning on a new target task. Earlier work has not clearly motivated choices for the shaping function. This paper discusses and empirically ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Shaping functions can be used in multitask reinforcement learning (RL) to incorporate knowledge from previously experienced source tasks to speed up learning on a new target task. Earlier work has not clearly motivated choices for the shaping function. This paper discusses and empirically compares several alternatives, and demonstrates that the most intuive one may not always be the best option. In addition, we extend previous work on identifying good representations for the value and shaping functions, and show that selecting the right representation results in improved generalization over tasks. 1
Robust Approximate Bilinear Programming for Value Function Approximation
"... Value function approximation methods have been successfully used in many applications, but the prevailing techniques often lack useful a priori error bounds. We propose a new approximate bilinear programming formulation of value function approximation, which employs global optimization. The formulat ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Value function approximation methods have been successfully used in many applications, but the prevailing techniques often lack useful a priori error bounds. We propose a new approximate bilinear programming formulation of value function approximation, which employs global optimization. The formulation provides strong a priori guarantees on both robust and expected policy loss by minimizing specific norms of the Bellman residual. Solving a bilinear program optimally is NPhard, but this worstcase complexity is unavoidable because the Bellmanresidual minimization itself is NPhard. We describe and analyze the formulation as well as a simple approximate algorithm for solving bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. We also briefly analyze the behavior of bilinear programming algorithms under incomplete samples. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on simple benchmark problems.
Characterizing Reinforcement Learning Methods through Parameterized Learning Problems
, 2011
"... The field of reinforcement learning (RL) has been energized in the past few decades by elegant theoretical results indicating under what conditions, and how quickly, certain algorithms are guaranteed to converge to optimal policies. However, in practical problems, these conditions are seldom met. ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
The field of reinforcement learning (RL) has been energized in the past few decades by elegant theoretical results indicating under what conditions, and how quickly, certain algorithms are guaranteed to converge to optimal policies. However, in practical problems, these conditions are seldom met. When we cannot achieve optimality, the performance of RL algorithms must be measured empirically. Consequently, in order to meaningfully differentiate learning methods, it becomes necessary to characterize their performance on different problems, taking into account factors such as state estimation, exploration, function approximation, and constraints on computation and memory. To this end, we propose parameterized learning problems, in which such factors can be controlled systematically and their effects on learning methods characterized through targeted studies. Apart from providing very precise control of the parameters that affect learning, our parameterized learning problems enable benchmarking against optimal behavior; their relatively small sizes facilitate extensive experimentation. Based on a survey of existing RL applications, in this article, we focus our attention on two predominant, “first order ” factors: partial observability and function approximation. We design
Nonparametric Approximate Linear Programming for MDPs
"... The Approximate Linear Programming (ALP) approach to value function approximation for MDPs is a parametric value function approximation method, in that it represents the value function as a linear combination of features which are chosen a priori. Choosing these features can be a difficult challenge ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
The Approximate Linear Programming (ALP) approach to value function approximation for MDPs is a parametric value function approximation method, in that it represents the value function as a linear combination of features which are chosen a priori. Choosing these features can be a difficult challenge in itself. One recent effort, Regularized Approximate Linear Programming (RALP), uses L1 regularization to address this issue by combining a large initial set of features with a regularization penalty that favors a smooth value function with few nonzero weights. Rather than using smoothness as a backhanded way of addressing the feature selection problem, this paper starts with smoothness and develops a nonparametric approach to ALP that is consistent with the smoothness assumption. We show that this new approach has some favorable practical and analytical properties in comparison to (R)ALP. 1 Introduction and