Results 1  10
of
30
Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view
"... We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the onestep Temporal Difference fixpoint computation (TD(0)) and the Bellman Residual (BR) minimization. We describe ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the onestep Temporal Difference fixpoint computation (TD(0)) and the Bellman Residual (BR) minimization. We describe examples, where each method outperforms the other. We highlight a simple relation between the objective function they minimize, and show that while BR enjoys a performance guarantee, TD(0) does not in general. We then propose a unified view in terms of oblique projections of the Bellman equation, which substantially simplifies and extends the characterization of Schoknecht (2002) and the recent analysis of Yu & Bertsekas (2008). Eventually, we describe some simulations that suggest that if the TD(0) solution is usually slightly better than the BR solution, its inherent numerical instability makes it very bad in some cases, and thus worse on average.
Random sampling of states in dynamic programming
 in Proc. NIPS Conf., 2007
"... Abstract—We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding s ..."
Abstract

Cited by 23 (4 self)
 Add to MetaCart
Abstract—We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding steadystate policies for deterministic timeinvariant discrete time control problems with continuous states and actions often found in robotics. In this paper, we describe our approach and provide initial results on several simulated robotics problems. Index Terms—Dynamic programming, optimal control, random sampling. I.
Optimal approximate dynamic programming algorithms for a general class of storage problems
, 2007
"... informs doi 10.1287/moor.1080.0360 ..."
(Show Context)
Approximate Modified Policy Iteration
"... In this paper, we propose three implementations of AMPI (Sec. 3) that generalize the AVI implementations of Ernst et al. (2005); Antos et al. (2007); Munos & Szepesvári (2008) and the classificationbased API algorithm of Lagoudakis & Parr (2003); Fern et al. (2006); Lazaric et al. (2010); G ..."
Abstract

Cited by 18 (14 self)
 Add to MetaCart
In this paper, we propose three implementations of AMPI (Sec. 3) that generalize the AVI implementations of Ernst et al. (2005); Antos et al. (2007); Munos & Szepesvári (2008) and the classificationbased API algorithm of Lagoudakis & Parr (2003); Fern et al. (2006); Lazaric et al. (2010); Gabillon et al. (2011). We then provide an error propagation analysis of AMPI (Sec. 4), which shows how the Lpnorm of its performance loss can be controlled by the error at each iteration of the algorithm. We show that the error propagation analysis of AMPI is more involved than that of AVI and API. This is due to the fact that neither the contraction nor monotonicity arguments, that the error propagation analysis of these two algorithms rely on, hold for AMPI. The analysis of this section unifies those for AVI and API and is applied to the AMPI implementations presented in Sec. 3. We detail the analysis of the classificationbased implementation of MPI (CBMPI) of Sec. 3 by providing its finite sample analysis in Sec. 5. Our analysis indicates that the parameter m allows us to balance the estimation error of the classifier with the overall quality of the value approximahal00697169,
Reinforcement learning algorithms for MDPs
, 2009
"... This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare increment ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare incremental and batch algorithmic variants and discuss the impact of the choice of the function approximation method on the success of learning. In the second half, we describe methods that target the problem of learning to control an MDP. Here online and active learning are discussed first, followed by a description of direct and actorcritic methods.
Batch Mode Reinforcement Learning based on the Synthesis of Artificial Trajectories
 ANN OPER RES
, 2012
"... ..."
Multistage stochastic programming: A scenario tree based approach to planning under uncertainty
 APPLICATIONS IN ARTIFICIAL INTELLIGENCE: CONCEPTS AND SOLUTIONS, CHAPTER 6
, 2011
"... In this chapter, we present the multistage stochastic programming framework for sequential decision making under uncertainty and stress its differences with Markov Decision Processes. We describe the main approximation technique used for solving problems formulated in the multistage stochastic progr ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
In this chapter, we present the multistage stochastic programming framework for sequential decision making under uncertainty and stress its differences with Markov Decision Processes. We describe the main approximation technique used for solving problems formulated in the multistage stochastic programming framework, which is based on a discretization of the disturbance space. We explain that one issue of the approach is that the discretization scheme leads in practice to illposed problems, because the complexity of the numerical optimization algorithms used for computing the decisions restricts the number of samples and optimization variables that one can use for approximating expectations, and therefore makes the numerical solutions very sensitive to the parameters of the discretization. As the framework is weak in the absence of efficient tools for evaluating and eventually selecting competing approximate solutions, we show how one can extend it by using machine learning based techniques, so as to yield a sound and generic method to solve approximately a large class of multistage decision problems under uncertainty. The framework and solution techniques presented in the chapter are explained and illustrated on several examples. Along the way, we describe notions from decision theory that are relevant to sequential decision making under uncertainty in general.
Adaptiveresolution reinforcement learning with polynomial exploration in deterministic domains
 Mach. Learn
"... We propose a modelbased learning algorithm, the Adaptiveresolution Reinforcement Learning (ARL) algorithm, that aims to solve the online, continuous state space reinforcement learning problem in a deterministic domain. Our goal is to combine adaptiveresolution approximation scheme with efficient ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
We propose a modelbased learning algorithm, the Adaptiveresolution Reinforcement Learning (ARL) algorithm, that aims to solve the online, continuous state space reinforcement learning problem in a deterministic domain. Our goal is to combine adaptiveresolution approximation scheme with efficient exploration in order to obtain fast (polynomial) learning rates. The proposed algorithm uses an adaptive approximation of the optimal value function using kernelbased averaging, going from coarse to fine kernelbased representation of the state space, which enables to use finer resolution in the “important ” areas of the state space, and coarser resolution elsewhere. We consider an online learning approach, in which we discover these important areas online, using an uncertainty intervals exploration technique. Polynomial learning rates in terms of mistake bound (in a PAC framework) are established for this algorithm, under appropriate continuity assumptions. 1
Approximate dynamic programming using bellman residual elimination and gaussian process regression
 In Proceedings of the American Control Conference
, 2009
"... The overarching goal of the thesis is to devise new strategies for multiagent planning and control problems, especially in the case where the agents are subject to random failures, maintenance needs, or other health management concerns, or in cases where the system model is not perfectly known. We ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
(Show Context)
The overarching goal of the thesis is to devise new strategies for multiagent planning and control problems, especially in the case where the agents are subject to random failures, maintenance needs, or other health management concerns, or in cases where the system model is not perfectly known. We argue that dynamic programming techniques, in particular Markov Decision Processes (MDPs), are a natural framework for addressing these planning problems, and present an MDP problem formulation for a persistent surveillance mission that incorporates stochastic fuel usage dynamics and the possibility for randomlyoccurring failures into the planning process. We show that this problem formulation and its optimal policy lead to good mission performance in a number of realworld scenarios. Furthermore, an online, adaptive solution framework is developed that allows the planning system to improve its performance over time, even in the case where the true system model is uncertain or timevarying. Motivated by the difficulty of solving the persistent mission problem exactly when the number of agents becomes large, we then develop a new family of approximate dynamic programming algorithms, called Bellman Residual Elimination (BRE) methods, which can be employed to approximately solve largescale MDPs. We analyze these methods and prove a number of desirable theoretical properties about them, including reduction to exact policy iteration under certain conditions. Finally, we apply these BRE methods to largescale persistent surveillance problems and show that they yield good performance, and furthermore, that they can be successfully integrated into the adaptive planning framework. 2 1