Results 1  10
of
45
A unifying framework for computational reinforcement learning theory
, 2009
"... Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understand ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understanding the nature of supervised learning, they have not been as successful in reinforcement learning (RL). Here, the fundamental barrier is the need for active exploration in sequential decision problems. An RL agent tries to maximize longterm utility by exploiting its knowledge about the problem, but this knowledge has to be acquired by the agent itself through exploring the problem that may reduce shortterm utility. The need for active exploration is common in many problems in daily life, engineering, and sciences. For example, a Backgammon program strives to take good moves to maximize the probability of winning a game, but sometimes it may try novel and possibly harmful moves to discover how the opponent reacts in the hope of discovering a better gameplaying strategy. It has been known since the early days of RL that a good tradeoff between exploration and exploitation is critical for the agent to learn fast (i.e., to reach nearoptimal strategies
Linear Complementarity for Regularized Policy Evaluation and Improvement
, 2010
"... Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over th ..."
Abstract

Cited by 23 (4 self)
 Add to MetaCart
(Show Context)
Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARSinspired formulation, LARSTD. The LCP formulation allows the use of efficient offtheshelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efficiency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modified policy iteration that can be used to approximate a “greedy” homotopy path, a generalization of the LARSTD homotopy path that combines policy evaluation and optimization.
Modelling transition dynamics in mdps with rkhs embeddings
 In arXiv
, 2012
"... We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of ..."
Abstract

Cited by 20 (9 self)
 Add to MetaCart
(Show Context)
We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of conditional distributions as embeddings in a reproducing kernel Hilbert space (RKHS). Such representations bypass the need for estimating transition probabilities or densities, and apply to any domain on which kernels can be defined. This avoids the need to calculate intractable integrals, since expectations are represented as RKHS inner products whose computation has linear complexity in the number of points used to represent the embedding. We provide guarantees for the proposed applications in MDPs: in the context of a value iteration algorithm, we prove convergence to either the optimal policy, or to the closest projection of the optimal policy in our model class (an RKHS), under reasonable assumptions. In experiments, we investigate a learning task in a typical classical control setting (the underactuated pendulum), and on a navigation problem where only images from a sensor are observed. For policy optimisation we compare with leastsquares policy iteration where a Gaussian process is used for value function estimation. For value estimation we also compare to the NPDP method. Our approach achieves better performance in all experiments.
Predictive state temporal difference learning
"... We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either highdimensional or partially observable. Th ..."
Abstract

Cited by 17 (7 self)
 Add to MetaCart
(Show Context)
We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either highdimensional or partially observable. Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. By comparison, subspace identification (SSID) methods are designed to select a feature set which preserves as much information as possible about state. In this paper we connect the two approaches, looking at the problem of reinforcement learning with a large set of features, each of which may only be marginally useful for value function approximation. We introduce a new algorithm for this situation, called Predictive State Temporal Difference (PSTD) learning. As in SSID for predictive state representations, PSTD finds a linear compression operator that projects a large set of features down to a small set that preserves the maximum amount of predictive information. As in RL, PSTD then uses a Bellman recursion to estimate a value function. We discuss the connection between PSTD and prior approaches in RL and SSID. We prove that PSTD is statistically consistent, perform several experiments that illustrate its properties, and demonstrate its potential on a difficult optimal stopping problem. 1
Informing sequential clinical decisionmaking through Reinforcement learning: an empirical study
 Machine Learning
, 2011
"... Abstract This paper highlights the role that reinforcement learning can play in the optimization of treatment policies for chronic illnesses. Before applying any offtheshelf reinforcement learning methods in this setting, we must first tackle a number of challenges. We outline some of these challe ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
(Show Context)
Abstract This paper highlights the role that reinforcement learning can play in the optimization of treatment policies for chronic illnesses. Before applying any offtheshelf reinforcement learning methods in this setting, we must first tackle a number of challenges. We outline some of these challenges and present methods for overcoming them. First, we describe a multiple imputation approach to overcome the problem of missing data. Second, we discuss the use of function approximation in the context of a highly variable observation set. Finally, we discuss approaches to summarizing the evidence in the data for recommending a particular action and quantifying the uncertainty around the Qfunction of the recommended policy. We present the results of applying these methods to real clinical trial data of patients with schizophrenia.
Reinforcement learning algorithms for MDPs
, 2009
"... This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare increment ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare incremental and batch algorithmic variants and discuss the impact of the choice of the function approximation method on the success of learning. In the second half, we describe methods that target the problem of learning to control an MDP. Here online and active learning are discussed first, followed by a description of direct and actorcritic methods.
Automatic Feature Selection for ModelBased Reinforcement Learning in Factored MDPs
"... Abstract—Feature selection is an important challenge in machine learning. Unfortunately, most methods for automating feature selection are designed for supervised learning tasks and are thus either inapplicable or impractical for reinforcement learning. This paper presents a new approach to feature ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
Abstract—Feature selection is an important challenge in machine learning. Unfortunately, most methods for automating feature selection are designed for supervised learning tasks and are thus either inapplicable or impractical for reinforcement learning. This paper presents a new approach to feature selection specifically designed for the challenges of reinforcement learning. In our method, the agent learns a model, represented as a dynamic Bayesian network, of a factored Markov decision process, deduces a minimal feature set from this network, and efficiently computes a policy on this feature set using dynamic programming methods. Experiments in a stocktrading benchmark task demonstrate that this approach can reliably deduce minimal feature sets and that doing so can substantially improve performance and reduce the computational expense of planning. KeywordsReinforcement learning; feature selection; factored MDPs I.
Feature selection for reinforcement learning: Evaluating implicit statereward dependency via conditional mutual information
 In ECML/PKDD
, 2010
"... Abstract. Modelfree reinforcement learning (RL) is a machine learning approach to decision making in unknown environments. However, realworld RL tasks often involve highdimensional state spaces, and then standard RL methods do not perform well. In this paper, we propose a new feature selection fra ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Modelfree reinforcement learning (RL) is a machine learning approach to decision making in unknown environments. However, realworld RL tasks often involve highdimensional state spaces, and then standard RL methods do not perform well. In this paper, we propose a new feature selection framework for coping with high dimensionality. Our proposed framework adopts conditional mutual information between return and statefeature sequences as a feature selection criterion, allowing the evaluation of implicit statereward dependency. The conditional mutual information is approximated by a leastsquares method, which results in a computationally efficient feature selection procedure. The usefulness of the proposed method is demonstrated on gridworld navigation problems. 1