Results 1 - 10
of
12
Online planning algorithms for POMDPs
- Journal of Artificial Intelligence Research
, 2008
"... Partially Observable Markov Decision Processes (POMDPs) provide a rich framework for sequential decision-making under uncertainty in stochastic domains. However, solving a POMDP is often intractable except for small problems due to their complexity. Here, we focus on online approaches that alleviate ..."
Abstract
-
Cited by 42 (0 self)
- Add to MetaCart
Partially Observable Markov Decision Processes (POMDPs) provide a rich framework for sequential decision-making under uncertainty in stochastic domains. However, solving a POMDP is often intractable except for small problems due to their complexity. Here, we focus on online approaches that alleviate the computational complexity by computing good local policies at each decision step during the execution. Online algorithms generally consist of a lookahead search to find the best action to execute at each time step in an environment. Our objectives here are to survey the various existing online POMDP methods, analyze their properties and discuss their advantages and disadvantages; and to thoroughly evaluate these online approaches in different environments under various metrics (return, error bound reduction, lower bound improvement). Our experimental results indicate that state-of-the-art online heuristic search methods can handle large POMDP domains efficiently. 1.
Evolutionary policy iteration for solving Markov decision processes
- IEEE Transactions on Automatic Control
, 2005
"... [12] J. Schumacher, “Finite-dimensional regulators for a class of infinite-dimensionalsystems,” ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
[12] J. Schumacher, “Finite-dimensional regulators for a class of infinite-dimensionalsystems,”
An asymptotically efficient simulation-based algorithm for finite horizon stochastic dynamic programming
- IEEE Trans. on Automatic Control
, 2007
"... Abstract — We present a simulation-based algorithm called “Simulated Annealing Multiplicative Weights ” (SAMW) for solving large finitehorizon stochastic dynamic programming problems. At each iteration of the algorithm, a probability distribution over candidate policies is updated by a simple multip ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Abstract — We present a simulation-based algorithm called “Simulated Annealing Multiplicative Weights ” (SAMW) for solving large finitehorizon stochastic dynamic programming problems. At each iteration of the algorithm, a probability distribution over candidate policies is updated by a simple multiplicative weight rule, and with proper annealing of a control parameter, the generated sequence of distributions converges to a distribution concentrated only on the best policies. The algorithm is “asymptotically efficient, ” in the sense that for the goal of estimating the value of an optimal policy, a provably convergent finite-time upper bound for the sample mean is obtained. Index Terms — stochastic dynamic programming, Markov decision processes, simulation, learning algorithms, simulated annealing I.
Multi-time Scale Markov Decision Processes
- IEEE Transactions on Automatic Control
, 2002
"... This paper proposes a simple analytical model called time-scale Markov Decision Process (MMDP) for hierarchically structured sequential decision making processes, where decisions in each level in the -level hierarchy are made in different discrete time-scales. In this model, the state space and the ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper proposes a simple analytical model called time-scale Markov Decision Process (MMDP) for hierarchically structured sequential decision making processes, where decisions in each level in the -level hierarchy are made in different discrete time-scales. In this model, the state space and the control space of each level in the hierarchy are non-overlapping with those of the other levels, respectively, and the hierarchy is structured in a "pyramid" sense such that a decision made at level (slower time-scale) state and/or the state will affect the evolutionary decision making process of the lower level (faster time-scale) until a new decision is made at the higher level but the lower level decisions themselves do not affect the transition dynamics of higher levels. The performance produced by the lower level decisions will affect the higher level decisions. A hierarchical objective function is defined such that the finite-horizon value of following a (nonstationary) policy at level over a decision epoch of level plus an immediate reward at level is the single-step reward for the decision making process at level . From this we define "multi-level optimal value function" and derive "multi-level optimality equation". We discuss how to solve MMDPs exactly and study some approximation methods, along with heuristic sampling-based schemes, to solve MMDPs.
Approximate receding horizon approach for markov decision processes: average reward case
- Journal of Mathematical Analysis and Applications
, 2003
"... We consider an approximation scheme for solving Markov Decision Processes (MDPs) with countable state space, finite action space, and bounded rewards that uses an approximate solution of a fixed finite-horizon sub-MDP of a given infinite-horizon MDP to create a stationary policy, which we call “appr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We consider an approximation scheme for solving Markov Decision Processes (MDPs) with countable state space, finite action space, and bounded rewards that uses an approximate solution of a fixed finite-horizon sub-MDP of a given infinite-horizon MDP to create a stationary policy, which we call “approximate receding horizon control”. We first analyze the performance of the approximate receding horizon control for infinite-horizon average reward under an ergodicity assumption, which also generalizes the result obtained by White [36]. We then study two examples of the approximate receding horizon control via lower bounds to the exact solution to the sub-MDP. The first control policy is based on a finite-horizon approximation of Howard’s policy improvement of a single policy and the second policy is based on a generalization of the single policy improvement for multiple policies. Along the study, we also provide a simple alternative proof on the policy improvement for countable state space. We finally discuss practical implementations of these schemes via simulation.
A SURVEY OF SOME SIMULATION-BASED ALGORITHMS FOR MARKOV DECISION PROCESSES
"... Abstract. Many problems modeled by Markov decision processes (MDPs) have very large state and/or action spaces, leading to the well-known curse of dimensionality that makes solution of the resulting models intractable. In other cases, the system of interest is complex enough that it is not feasible ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. Many problems modeled by Markov decision processes (MDPs) have very large state and/or action spaces, leading to the well-known curse of dimensionality that makes solution of the resulting models intractable. In other cases, the system of interest is complex enough that it is not feasible to explicitly specify some of the MDP model parameters, but simulated sample paths can be readily generated (e.g., for random state transitions and rewards), albeit at a non-trivial computa-tional cost. For these settings, we have developed various sampling and population-based numerical algorithms to overcome the computational difficulties of computing an optimal solution in terms of a policy and/or value function. Specific approaches presented in this survey include multi-stage adaptive sampling, evolutionary policy iteration and evolutionary random policy search. Key words: (adaptive) sampling, Markov decision process, population-based algorithms
Monte-Carlo-Based Partially Observable Markov Decision Process Approximations for Adaptive Sensing
"... Abstract — Adaptive sensing involves actively managing sensor resources to achieve a sensing task, such as object detection, classification, and tracking, and represents a promising direction for new applications of discrete event system methods. We describe an approach to adaptive sensing based on ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract — Adaptive sensing involves actively managing sensor resources to achieve a sensing task, such as object detection, classification, and tracking, and represents a promising direction for new applications of discrete event system methods. We describe an approach to adaptive sensing based on approximately solving a partially observable Markov decision process (POMDP) formulation of the problem. Such approximations are necessary because of the very large state space involved in practical adaptive sensing problems, precluding exact computation of optimal solutions. We review the theory of POMDPs and show how the theory applies to adaptive sensing problems. We then describe Monte-Carlo-based approximation methods, with an example to illustrate their application in adaptive sensing. The example also demonstrates the gains that are possible from nonmyopic methods relative to myopic methods. I.
MULTI-ARMED BANDIT PROBLEMS
"... Multi-armed bandit (MAB) problems are a class of sequential resource allocation problems concerned with allocating one or more resources among several alternative (competing) projects. Such problems are paradigms of a fundamental conflict between making decisions (allocating resources) that yield ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Multi-armed bandit (MAB) problems are a class of sequential resource allocation problems concerned with allocating one or more resources among several alternative (competing) projects. Such problems are paradigms of a fundamental conflict between making decisions (allocating resources) that yield
An Evolutionary Random Policy Search Algorithm for Solving Markov Decision Processes
- INFORMS Journal on Computing
, 2007
"... This paper presents a new randomized search method called evolutionary random policy search (ERPS) for solving infinite-horizon discounted-cost Markov-decision-process (MDP) problems. The algorithm is particularly targeted at problems with large or uncountable action spaces. ERPS approaches a given ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper presents a new randomized search method called evolutionary random policy search (ERPS) for solving infinite-horizon discounted-cost Markov-decision-process (MDP) problems. The algorithm is particularly targeted at problems with large or uncountable action spaces. ERPS approaches a given MDP by iteratively dividing it into a sequence of smaller, random, sub-MDP problems based on information obtained from random sampling of the entire action space and local search. Each sub-MDP is then solved approximately by using a variant of the standard policy-improvement technique, where an elite policy is obtained. We show that the sequence of elite policies converges to an optimal policy with probability one. Some numerical studies are carried out to illustrate the algorithm and compare it with existing procedures. Key words: dynamic programming, Markov, finite state; analysis of algorithms; program-ming, nonlinear; queues
Optimization of Joint Replacement Policies for Multipart Systems by a Rollout Framework
"... Abstract—Maintaining an asset with life-limited parts, e.g., a jet engine or an electric generator, may be costly. Certain costs, e.g., setup cost, can be shared if some parts of the asset are replaced jointly. Reducing the maintenance cost by good joint replacement policies is difficult in view of ..."
Abstract
- Add to MetaCart
Abstract—Maintaining an asset with life-limited parts, e.g., a jet engine or an electric generator, may be costly. Certain costs, e.g., setup cost, can be shared if some parts of the asset are replaced jointly. Reducing the maintenance cost by good joint replacement policies is difficult in view of complicate asset dynamics, large problem sizes and the irregular optimal policy structures. This paper addresses these difficulties by using a rollout optimization framework. Based on a novel application of time-aggregated Markov decision processes, the “One-Stage Analysis ” method is first developed. The policies obtained from the method are investigated and their effectiveness is demonstrated by examples. This method and the existing threshold method are then improved by the “rollout algorithm ” for the total cost case and the average cost case. Based on ordinal optimization, it is shown that excessive simulations are not necessary for the rollout algorithm. Numerical

