Results 1  10
of
57
Regularization and feature selection in leastsquares temporal difference learning
, 2009
"... We consider the task of reinforcement learning with linear value function approximation. Temporal difference algorithms, and in particular the LeastSquares Temporal Difference (LSTD) algorithm, provide a method for learning the parameters of the value function, but when the number of features is la ..."
Abstract

Cited by 80 (1 self)
 Add to MetaCart
(Show Context)
We consider the task of reinforcement learning with linear value function approximation. Temporal difference algorithms, and in particular the LeastSquares Temporal Difference (LSTD) algorithm, provide a method for learning the parameters of the value function, but when the number of features is large this algorithm can overfit to the data and is computationally expensive. In this paper, we propose a regularization framework for the LSTD algorithm that overcomes these difficulties. In particular, we focus on the case of l1 regularization, which is robust to irrelevant features and also serves as a method for feature selection. Although the l1 regularized LSTD solution cannot be expressed as a convex optimization problem, we present an algorithm similar to the Least Angle Regression (LARS) algorithm that can efficiently compute the optimal solution. Finally, we demonstrate the performance of the algorithm experimentally.
Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes
"... Approximate dynamic programming has been used successfully in a large variety of domains, but it relies on a small set of provided approximation features to calculate solutions reliably. Large and rich sets of features can cause existing algorithms to overfit because of a limited number of samples. ..."
Abstract

Cited by 29 (11 self)
 Add to MetaCart
(Show Context)
Approximate dynamic programming has been used successfully in a large variety of domains, but it relies on a small set of provided approximation features to calculate solutions reliably. Large and rich sets of features can cause existing algorithms to overfit because of a limited number of samples. We address this shortcoming using L1 regularization in approximate linear programming. Because the proposed method can automatically select the appropriate richness of features, its performance does not degrade with an increasing number of features. These results rely on new and stronger sampling bounds for regularized approximate linear programs. We also propose a computationally efficient homotopy method. The empirical evaluation of the approach shows that the proposed method performs well on simple MDPs and standard benchmark problems. 1.
Projected equation methods for approximate solution of large linear systems
 JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS
"... ..."
A unifying framework for computational reinforcement learning theory
, 2009
"... Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understand ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understanding the nature of supervised learning, they have not been as successful in reinforcement learning (RL). Here, the fundamental barrier is the need for active exploration in sequential decision problems. An RL agent tries to maximize longterm utility by exploiting its knowledge about the problem, but this knowledge has to be acquired by the agent itself through exploring the problem that may reduce shortterm utility. The need for active exploration is common in many problems in daily life, engineering, and sciences. For example, a Backgammon program strives to take good moves to maximize the probability of winning a game, but sometimes it may try novel and possibly harmful moves to discover how the opponent reacts in the hope of discovering a better gameplaying strategy. It has been known since the early days of RL that a good tradeoff between exploration and exploitation is critical for the agent to learn fast (i.e., to reach nearoptimal strategies
Reinforcement Learning for Dialog Management using LeastSquares Policy Iteration and Fast Feature Selection
"... Reinforcement learning (RL) is a promising technique for creating a dialog manager. RL accepts features of the current dialog state and seeks to find the best action given those features. Although it is often easy to posit a large set of potentially useful features, in practice, it is difficult to f ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
(Show Context)
Reinforcement learning (RL) is a promising technique for creating a dialog manager. RL accepts features of the current dialog state and seeks to find the best action given those features. Although it is often easy to posit a large set of potentially useful features, in practice, it is difficult to find the subset which is large enough to contain useful information yet compact enough to reliably learn a good policy. In this paper, we propose a method for RL optimization which automatically performs feature selection. The algorithm is based on leastsquares policy iteration, a stateoftheart RL algorithm which is highly sampleefficient and can learn from a static corpus or online. Experiments in dialog simulation show it is more stable than a baseline RL algorithm taken from a working dialog system.
LSTD with random projections
 In Advances in Neural Information Processing Systems
, 2010
"... We consider the problem of reinforcement learning in highdimensional spaces when the number of features is bigger than the number of samples. In particular, we study the leastsquares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection f ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
(Show Context)
We consider the problem of reinforcement learning in highdimensional spaces when the number of features is bigger than the number of samples. In particular, we study the leastsquares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting leastsquares policy iteration (LSPI) algorithm. 1
Online Discovery of Feature Dependencies
"... Online representational expansion techniques have improved the learning speed of existing reinforcement learning (RL) algorithms in low dimensional domains, yet existing online expansion methods do not scale well to high dimensional problems. We conjecture that one of the main difficulties limiting ..."
Abstract

Cited by 14 (7 self)
 Add to MetaCart
(Show Context)
Online representational expansion techniques have improved the learning speed of existing reinforcement learning (RL) algorithms in low dimensional domains, yet existing online expansion methods do not scale well to high dimensional problems. We conjecture that one of the main difficulties limiting this scaling is that features defined over the fulldimensional state space often generalize poorly. Hence, we introduce incremental Feature Dependency Discovery (iFDD) as a computationallyinexpensive method for representational expansion that can be combined with any online, valuebased RL method that uses binary features. Unlike other online expansion techniques, iFDD creates new features in low dimensional subspaces of the full state space where feedback errors persist. We provide convergence and computational complexity guarantees for iFDD, as well as showing empirically that iFDD scales well to high dimensional multiagent planning domains with hundreds of millions of stateaction pairs. 1.
Reinforcement learning algorithms for MDPs
, 2009
"... This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare increment ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare incremental and batch algorithmic variants and discuss the impact of the choice of the function approximation method on the success of learning. In the second half, we describe methods that target the problem of learning to control an MDP. Here online and active learning are discussed first, followed by a description of direct and actorcritic methods.
Regularized fitted qiteration for planning in continuousspace markovian decision problems
 In Proceedings of the 2009 American Control Conference
, 2009
"... AbstractReinforcement learning with linear and nonlinear function approximation has been studied extensively in the last decade. However, as opposed to other fields of machine learning such as supervised learning, the effect of finite sample has not been thoroughly addressed within the reinforcem ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
AbstractReinforcement learning with linear and nonlinear function approximation has been studied extensively in the last decade. However, as opposed to other fields of machine learning such as supervised learning, the effect of finite sample has not been thoroughly addressed within the reinforcement learning framework. In this paper we propose to use L 2 regularization to control the complexity of the value function in reinforcement learning and planning problems. We consider the Regularized Fitted QIteration algorithm and provide generalization bounds that account for small sample sizes. Finally, a realistic visualservoing problem is used to illustrate the benefits of using the regularization procedure.