#### DMCA

## Planning and acting in partially observable stochastic domains (1998)

### Cached

### Download Links

- [www.cis.upenn.edu]
- [www.cs.tufts.edu]
- [ftp.cs.brown.edu]
- [www.cs.duke.edu]
- [www.cis.upenn.edu]
- [damas.ift.ulaval.ca]
- [www.damas.ift.ulaval.ca]
- [www.cs.duke.edu]
- [www.cs.brown.edu]
- [msl.cs.uiuc.edu]
- [www.cs.ubc.ca]
- [www.ai.mit.edu]
- [csail.mit.edu]
- [people.csail.mit.edu]
- [staff.science.uva.nl]
- [www.cs.rutgers.edu]
- [people.csail.mit.edu]
- [www.cs.rutgers.edu]
- [classes.engr.oregonstate.edu]
- [elite.polito.it]
- DBLP

### Other Repositories/Bibliography

Venue: | ARTIFICIAL INTELLIGENCE |

Citations: | 1095 - 38 self |

### Citations

5887 | A tutorial on hidden markov models and selected applications in speech recognition
- Rabiner
- 1989
(Show Context)
Citation Context ...o get good solutions to large problems. Another area that is not addressed in this paper is the acquisition of a world model. One approach is to extend techniques for learning hidden Markov mod35sels =-=[43,53]-=- to learn pomdp models. Then, we could apply algorithms of the type described in this paper to the learned models. Another approach is to combine the learning of the model with the computation of the ... |

3854 |
A new approach to linear filtering and prediction problems
- Kalman
- 1960
(Show Context)
Citation Context ...the underlying dynamics of the world (the map and other information), to maintain an estimate of its location. Many engineering applications follow this approach, using methods like the Kalman filter =-=[20]-=- to maintain a running estimate of the robot's spatial uncertainty, expressed as an ellipsoid or normal distribution in Cartesian space. This approach will not do for our robot, though. Its uncertaint... |

3732 |
Genetic programming: on the programming of computers by means of natural selection.
- Koza
- 1992
(Show Context)
Citation Context ...resent optimal plans in general. This argues that, in the limit, a plan is actually a program. Several techniques have been proposed recently for searching for good program-like controllers in pomdps =-=[46,23]-=- We restrict our attention to the simpler nite-horizon case and a small set of in nite-horizon problems that have optimal nite-state plans. 7 Extensions and Conclusions The pomdp model provides a rm f... |

1920 |
Theory of Linear and Integer Programming
- Schrijver
- 1998
(Show Context)
Citation Context ...he number of bits of precision used in specifying the model is polynomial in these quantities since the polynomial running time of linear programming is expressed as a function of the input precision =-=[48]-=-. 4.5 Alternative Approaches One paragraph each on Cheng, Sondik 1 and 2, Incremental Pruning??. And a short discussion of their relative e ciencies. 22s4.6 The In nite Horizon Be sure this is right, ... |

1171 | Fast planning through planning graph analysis
- Blum, Furst
- 1997
(Show Context)
Citation Context ... when the initial state is known and all actions are deterministic. A slightly more elaborate structure is the partially ordered plan (generated, for example, by snlp and ucpop), or the parallel plan =-=[4]-=-. In this type of plan, actions can be left unordered if all orderings are equivalent under the performance metric. When actions are stochastic, partially ordered plans can still be used (as in Burida... |

788 |
Markov Decision Processes
- Puterman
- 1994
(Show Context)
Citation Context ...ct perceptual abilities. AGENT Actions States WORLD Figure 1: An mdp models the synchronous interaction between agent and world. Markov decision processes are described in depth in a variety of texts =-=[2, 20]-=-; we will just briefly cover the necessary background. 2.1 Basic Framework A Markov decision process can be described as a tuple hS; A;T;Ri, where ffl S is a finite set of states of the world; ffl A i... |

751 |
Dynamic Programming and Optimal Control. Athena Scientific
- Bertsekas
- 2000
(Show Context)
Citation Context ...gent's actions, there is never any uncertainty about the agent's current state|it has complete and perfect perceptual abilities. Markov decision processes are described in depth in a variety of texts =-=[3,42]-=-; we will just brie y cover the necessary background. 3sStates WORLD AGENT Actions Fig. 1. An mdp models the synchronous interaction between agent and world. 2.1 Basic Framework A Markov decision proc... |

742 |
Dynamic Programming and Markov Processes,
- Howard
- 1960
(Show Context)
Citation Context ...erived from t,1 and V t,2. 2 3 5 : 4R(s; a)+ X T (s; a; s 0 )Vt,1(s 0 ) 5 ; In the in nite-horizon discounted case, for any initial state s, we want to execute the policy that maximizes V (s). Howard =-=[18]-=- showed that there exists a stationary policy, , that is optimal for every starting state. The value function for this policy, V equations , also written V , is de ned by the set of V (s) = max a 2 s ... |

493 | UCPOP: A sound, complete, partial order planner for ADL
- Penberthy, Weld
- 1992
(Show Context)
Citation Context ...s will be known with certainty during plan execution. In the mdp framework, the agent is informed of the current state each time it takes an action. In many classical planners (e.g., snlp [32], ucpop =-=[38]-=-), the current state can be calculated trivially from the known initial state and knowledge of the deterministic operators. The assumption of perfect knowledge is not valid in many domains. Research o... |

449 | Systematic Nonlinear Planning .
- McAllester, D
- 1991
(Show Context)
Citation Context ...f the process will be known with certainty during plan execution. In the mdp framework, the agent is informed of the current state each time it takes an action. In many classical planners (e.g., snlp =-=[32]-=-, ucpop [38]), the current state can be calculated trivially from the known initial state and knowledge of the deterministic operators. The assumption of perfect knowledge is not valid in many domains... |

424 |
The optimal control of partially observable Markov processes over a finite horizon
- Smallwood, Sondik
- 1973
(Show Context)
Citation Context ... amount of reward they produce, and how they change the state of the world. This paper is intended to make two contributions. The first is to recapitulate work from the operations-research literature =-=[30,35,50,52,57]-=- and to describe its connection to closely related work in AI. The second is to describe a novel algorithmic approach for solving pomdps exactly. We begin by introducing the theory of Markov decision ... |

420 |
The Optimal Control of Partially Observable Markov Processes
- Sondik
- 1971
(Show Context)
Citation Context ...he amount of reward they produce, and how they change the state of the world. This paper is intended to make two contributions. The rst is to recapitulate work from the operations-research literature =-=[30,35,50,52,57]-=- and to describe its connection to closely related work in AI. The second is to describe a novel algorithmic approach for solving pomdps exactly.We begin byintroducing the theory of Markov decision pr... |

378 | Universal plans for reactive robots in unpredictable environments
- Schoppers
- 1987
(Show Context)
Citation Context ...entation is a policy which maps the current state (situation) toachoice of action. Because there is an action choice speci ed for all possible initial states, policies are also called universal plans =-=[47]-=-. This representation is not appropriate for pomdps, since the underlying state is not fully observable. However, pomdp policies can be viewed as universal plans over belief space. It is interesting t... |

356 |
A formal theory of knowledge and action. In
- Moore
- 1985
(Show Context)
Citation Context ...d an optimal way to behave. In the arti cial intelligence (AI) literature, a deterministic version of this problem has been addressed by adding knowledge preconditions to traditional planning systems =-=[36]-=-. Because we are interested in stochastic domains, however, we must depart from the traditional AI planning model. Rather than taking plans to be sequences of actions, which may only rarely execute as... |

327 | Acting optimally in partially observable stochastic domains.
- Cassandra, Kaelbling, et al.
- 1994
(Show Context)
Citation Context ...mately optimal. 5 Understanding Policies In this section we introduce a very simple example and use it to illustrate some properties of pomdp policies. Other examples are explored in an earlier paper =-=[7]-=-. 5.1 The Tiger Problem Imagine an agent standing in front of two closed doors. Behind one of the doors is a tiger and behind the other is a large reward. If the agent opens the door with the tiger, t... |

296 | Learning Policies for Partially Observable Environments: Scaling Up.
- Littman, Cassandra, et al.
- 1995
(Show Context)
Citation Context ...res the use of function-approximation methods for representing value functions and the use of simulation in order to concentrate the approximations on the frequently visited parts of the belief space =-=[27]-=-. The results of this work are encouraging and have allowed us to get a very good solution to an 89 state, 16 observation instance of a hallway navigation problem similar to the one described in the i... |

286 | An algorithm for probabilistic planning.
- Kushmerick, Hanks, et al.
- 1993
(Show Context)
Citation Context ...objective of planning, the representation 29 TR listen TL TR TL listen TL listen TR TRsof domains, and plan structures. The most closely related work to our own is that of Kushmerick, Hanks, and Weld =-=[24]-=- on the Buridan system, and Draper, Hanks and Weld [13] on the C-Buridan system. 6.1 Imperfect Knowledge Plans generated using standard mdp algorithms and classical (strips-like or partial-order) plan... |

256 |
A survey of partially observable Markov decision processes: Theory, models, and algorithms
- Monahan
- 1982
(Show Context)
Citation Context ...he amount of reward they produce, and how they change the state of the world. This paper is intended to make two contributions. The rst is to recapitulate work from the operations-research literature =-=[30,35,50,52,57]-=- and to describe its connection to closely related work in AI. The second is to describe a novel algorithmic approach for solving pomdps exactly.We begin byintroducing the theory of Markov decision pr... |

239 | Conditional Nonlinear Planning
- Peot, Smith
- 1992
(Show Context)
Citation Context ...rovide makes e cient reasoning very di cult. A step towards building a working planning system that reasons about knowledge is to relax the generality of the logic-based schemes. The approach of cnlp =-=[39]-=- uses three-valued propositions where, in addition to true and false, there is a value unknown, which represents the state when the truth of the proposition is not known. Operators can then refer to w... |

224 |
A survey of algorithmic methods for partially observed Markov decision processes
- Lovejoy
- 1991
(Show Context)
Citation Context ...he amount of reward they produce, and how they change the state of the world. This paper is intended to make two contributions. The rst is to recapitulate work from the operations-research literature =-=[30,35,50,52,57]-=- and to describe its connection to closely related work in AI. The second is to describe a novel algorithmic approach for solving pomdps exactly.We begin byintroducing the theory of Markov decision pr... |

223 | Probabilistic Planning with Information Gathering and Contingent Execution.
- Draper, Hanks, et al.
- 1994
(Show Context)
Citation Context ...TL TR TL listen TL listen TR TRsof domains, and plan structures. The most closely related work to our own is that of Kushmerick, Hanks, and Weld [24] on the Buridan system, and Draper, Hanks and Weld =-=[13]-=- on the C-Buridan system. 6.1 Imperfect Knowledge Plans generated using standard mdp algorithms and classical (strips-like or partial-order) planning algorithms assume that the underlying state of the... |

217 | Reinforcement learning with perceptual aliasing: The perceptual distinctions approach.
- Chrisman
- 1992
(Show Context)
Citation Context ...tential signi cant advantage of being able to learn a model that is complex enough to support optimal (or good) behavior without making irrelevant distinctions; this idea has been pursued by Chrisman =-=[10]-=- and McCallum [33,34]. A Appendix Theorem 1 Let U a be a non-empty set of useful policy trees, and Q a t be the complete set of useful policy trees. Then U a 6= Q a t if and only if there is some tree... |

211 | Algorithms for Sequential Decision Making
- Littman
- 1996
(Show Context)
Citation Context ...ahan [35], is to test R( ; ~V) for every in ~V and remove those that are nowhere dominant. A much more e cient pruning method was proposed by Lark and White [57] and is described in detail by Littman =-=[29]-=- and by Cassandra [8]. Because it has many subtle technical details, it is not described here. 4.3 One Step of Value Iteration The value function for a pomdp can be computed using value iteration, wit... |

206 | The complexity of stochastic games.
- Condon
- 1992
(Show Context)
Citation Context ...transformation holds in the opposite direction: any total expected discounted reward problem (completely observable or nite horizon) can be transformed into a goal-achievement problem of similar size =-=[11,60]-=-. Roughly, the transformation simulates the discount factor by introducing an absorbing state with a small probability of being entered on each step. Rewards are then simulated by normalizing all rewa... |

202 | Incremental Pruning: A simple, fast, exact method for Partially Observable Markov Decision Process,”
- Cassandra, Littman, et al.
- 1997
(Show Context)
Citation Context ...the geometric approaches are useful only in pomdps with extremely small state spaces. Zhang and Liu [67] describe the incremental-pruning algorithm, later generalized by Cassandra, Littman, and Zhang =-=[7]-=-. This algorithm is simple to implement and empirically faster than the witness algorithm, while sharing its good worst-case complexity in terms of P a jQ a t j. The basic algorithm works like the exh... |

186 | Exact and Approximate Algorithms for Partially Observable Markov Decision Processes”,
- Cassandra
- 1998
(Show Context)
Citation Context ...ich dominates; that is, R( ; V) =fb j b >b ~; for all ~ 2V, and b 2Bg : It is relatively easy, using a linear program, to nd a point in R( ; V) if one exists, or to determine that the region is empty =-=[8]-=-. The simplest pruning strategy, described by Monahan [35], is to test R( ; ~V) for every in ~V and remove those that are nowhere dominant. A much more e cient pruning method was proposed by Lark and ... |

183 |
Information value theory.
- Howard
- 1966
(Show Context)
Citation Context ...f the simplex, the agent can take actions more likely to be appropriate for the current state of the world and, so, gain more reward. This has some connection to the notion of \value of information," =-=[19]-=- where an agent can incur a cost to move it from a highentropy to a low-entropy state; this is only worthwhile when the value of the information (the di erence in value between the two states) exceeds... |

179 | Planning under time constraints in stochastic domains.
- Dean, Kaelbling, et al.
- 1995
(Show Context)
Citation Context ...what may happen. In many cases, we may not want a full policy; methods for developing partial policies and conditional plans for completely observable domains are the subject of much current interest =-=[14,54,12]-=-. A weakness of the methods described in this paper is that they require the states of the world to be represented enumeratively, rather than through compositional representations suchasBayes nets or ... |

168 | Theory of Linear and - Schrijver |

162 | Hidden Markov model induction by Bayesian model merging.
- Stolcke, Omohundro
- 1993
(Show Context)
Citation Context ...o get good solutions to large problems. Another area that is not addressed in this paper is the acquisition of a world model. One approach is to extend techniques for learning hidden Markov mod35sels =-=[43,53]-=- to learn pomdp models. Then, we could apply algorithms of the type described in this paper to the learned models. Another approach is to combine the learning of the model with the computation of the ... |

154 | Dynamic programming and optimal control, vol - Bertsekas - 2007 |

149 |
Optimal Control of Markov Decision Processes with Incomplete State Estimation,
- Astrom
- 1965
(Show Context)
Citation Context ... agent: given the agent's current belief state (properly computed), no additional data about its past actions or observations would supply any further information about the current state of the world =-=[1,50]-=-. This means that the process over belief states is Markov, and that no additional data about the past would help to increase the agent's expected reward. To illustrate the evolution of a belief state... |

142 | The complexity of mean payoff games on graphs,
- Zwick, Paterson
- 1996
(Show Context)
Citation Context ...ansformation holds in the opposite direction: any total expected discounted reward problem (completely observable or finite horizon) can be transformed into a goal-achievement problem of similar size =-=[11,60]-=-. Roughly, the transformation simulates the discount factor by introducing an absorbing state with a small probability of being entered on each step. Rewards are then simulated by normalizing all rewa... |

129 | Computing optimal policies for partially observable markov decision processes using compact representations.
- Boutilier, Poole
- 1996
(Show Context)
Citation Context ...rough compositional representations suchasBayes nets or probabilistic operator descriptions. However, this work has served as a substrate for devel2sopment of more complex and e cient representations =-=[6]-=-. Section 6 describes the relation between the present approach and prior research in more detail. One important facet of the pomdp approach is that there is no distinction drawn between actions taken... |

123 | Anytime synthetic projection: Maximizing the probability of goal satisfaction
- Bresina, Drummond
- 1990
(Show Context)
Citation Context ...what may happen. In many cases, we may not want a full policy; methods for developing partial policies and conditional plans for completely observable domains are the subject of much current interest =-=[14,54,12]-=-. A weakness of the methods described in this paper is that they require the states of the world to be represented enumeratively, rather than through compositional representations suchasBayes nets or ... |

119 | Overcoming incomplete perception with utile distinction memory.
- McCallum
- 1993
(Show Context)
Citation Context ... advantage of being able to learn a model that is complex enough to support optimal (or good) behavior without making irrelevant distinctions; this idea has been pursued by Chrisman [10] and McCallum =-=[33,34]-=-. A Appendix Theorem 1 Let U a be a non-empty set of useful policy trees, and Q a t be the complete set of useful policy trees. Then U a 6= Q a t if and only if there is some tree p 2 U a, o 2 , and p... |

112 | The frame problem and knowledge-producing actions.
- Scherl, Levesque
- 1993
(Show Context)
Citation Context ...te can be calculated trivially from the known initial state and knowledge of the deterministic operators. The assumption of perfect knowledge is not valid in many domains. Research on epistemic logic =-=[36,37,45]-=- relaxes this assumption by making it possible to reason about what is and is not known at a given time. Unfortunately, epistemic logics have not been used as a representation in automatic planning sy... |

111 |
Markov Decision Processes—Discrete Stochastic Dynamic Programming.
- Puterman
- 1994
(Show Context)
Citation Context ...gent's actions, there is never any uncertainty about the agent's current state|it has complete and perfect perceptual abilities. Markov decision processes are described in depth in a variety of texts =-=[3,42]-=-; we will just brie y cover the necessary background. 3sStates WORLD AGENT Actions Fig. 1. An mdp models the synchronous interaction between agent and world. 2.1 Basic Framework A Markov decision proc... |

110 | Planning for contingencies: a decision-based approach.
- Pryor, Collins
- 1996
(Show Context)
Citation Context ...not easily modeled with deterministic actions, since an action can have di erent results, even when applied in exactly the same state. Extensions to classical planning, such ascnlp [39] and Cassandra =-=[41]-=- have considered operators with nondeterministic e ects. For each operator, there is a set of possible next states that could occur. A drawback of this approach is that it gives no information about t... |

109 | Instance-based utile distinctions for reinforcement learning with hidden state. In:
- McCallum
- 1995
(Show Context)
Citation Context ... advantage of being able to learn a model that is complex enough to support optimal (or good) behavior without making irrelevant distinctions; this idea has been pursued by Chrisman [10] and McCallum =-=[33,34]-=-. A Appendix Theorem 1 Let U a be a non-empty set of useful policy trees, and Q a t be the complete set of useful policy trees. Then U a 6= Q a t if and only if there is some tree p 2 U a, o 2 , and p... |

108 | Utility models for goal-directed decisiontheoretic planners
- Haddawy, Hanks
- 1998
(Show Context)
Citation Context ...the duration of a run. Koenig and Simmons [22] examine risk-sensitive planning and showed how planners for the total-reward criterion could be used to optimize risk-sensitive behavior. Haddawy et al. =-=[16]-=- looked at a broad family of decision-theoretic objectives that make it possible to specify trade-o s between partially satisfying goals quickly and satisfying them completely. Bacchus, Boutilier, and... |

107 | Memoryless policies: Theoretical limitations and practical results.
- Littman
- 1994
(Show Context)
Citation Context ...ions with the same appearance, increasing the probability that it might choose a good action; in practice deterministic observation-action mappings are prone to getting trapped in deterministic loops =-=[26]-=-. In order to behave truly e ectively in a partially observable world, it is necessary to use memory of previous actions and observations to aid in the disambiguation of the states of the world. The p... |

104 | Tight performance bounds on greedy policies based on imperfect value functions.
- Williams, Baird
- 1993
(Show Context)
Citation Context ... optimal in nite-horizon policy, . Rather than calculating a bound on t in advance and running value iteration for that long, we instead use the following result regarding the Bellman error magnitude =-=[58]-=- in order to terminate with a near-optimal policy. If jV t(s),V t,1(s)j < for all s, then the value of the greedy policy with respect 7sto V t does not di er from V by more than 2 =(1 , )atany state. ... |

87 |
Algorithms for Partially Observable Markov Decision Processes.
- Cheng
- 1988
(Show Context)
Citation Context ...d, we would like to generate the elements of V t directly. If we could do this, we might be able to reach a computation time per iteration that is polynomial in jSj, jAj, j j, jVt,1j, and jVtj. Cheng =-=[9]-=- and Smallwood and Sondik [50] also try to avoid generating all of V + t by constructing Vt directly. However, their algorithms still have worst-case running times exponential in at least one of the p... |

76 | MAXPLAN: A new approach to probabilistic planning.
- Majercik, Littman
- 1998
(Show Context)
Citation Context ...ning model is called "completely observable." The mdp model, as well as some planning systems such as cnlp and Plinth [18,19] assume complete observability. Other systems, such as Buridan an=-=d maxplan [37], have no -=-observation model and can attack "completely unobservable" problems. Classical planning systems typically have no observation model, but the fact that the initial state is known and operator... |

72 |
Knowledge preconditions for actions and plans’,
- Morgenstern
- 1987
(Show Context)
Citation Context ...te can be calculated trivially from the known initial state and knowledge of the deterministic operators. The assumption of perfect knowledge is not valid in many domains. Research on epistemic logic =-=[36,37,45]-=- relaxes this assumption by making it possible to reason about what is and is not known at a given time. Unfortunately, epistemic logics have not been used as a representation in automatic planning sy... |

68 |
A new approach to linear ltering and prediction problems
- Kalman
- 1960
(Show Context)
Citation Context ...f the underlying dynamics of the world (the map and other information), to maintain an estimate of its location. Many engineering applications follow this approach, using methods like the Kalman lter =-=[20]-=- to maintain a running estimate of the robot's spatial uncertainty, expressed as an ellipsoid or normal distribution in Cartesian space. This approach will not do for our robot, though. Its uncertaint... |

59 |
Planning with external events
- Blythe
- 1994
(Show Context)
Citation Context ...ossible to assess whether a plan is likely to reach the goal even if it is not guaranteed to do so. This type of action model is used in mdps and pomdps aswell as in Buridan and C-Buridan. Other work =-=[5,14]-=- has used representations that can be used to compute probability distributions over future states. 31s6.4 Observation Model When the starting state is known and actions are deterministic, there is no... |

54 | The Witness Algorithm: Solving Partially Observable Markov Decision Processes”,
- Littman
- 1994
(Show Context)
Citation Context ... compute the maximum of their value functions to get V t . If the value functions are represented by sets of policy trees, the test for termination can be implemented exactly using linear programming =-=[12]-=-. This is, of course, hopelessly computationally intractable. Each t-step policy tree contains (j\Omega j t \Gamma 1)=(j\Omega j \Gamma 1) nodes (the branching factor is j\Omega j, the number of possi... |

44 |
The optimal search for a moving target when the search path is constrained,
- Eagle
- 1984
(Show Context)
Citation Context ... algorithm [9]. As pointed out in Section 4, value functions in belief space have a natural geometric interpretation. For small state spaces, algorithms that exploit this geometry are quite efficient =-=[16]-=-. An excellent example of this is Cheng's linear support algorithm [10]. This algorithm can be viewed as a variation of the witness algorithm in which witness points are sought at the corners of regio... |

43 |
Solving H-horizon, Stationary Markov Decision Problems in Time Proportional to Log(H
- Tseng
- 1990
(Show Context)
Citation Context ...(t,1)step non-stationary policy. The algorithm terminates when the maximum difference between two successive value functions (known as the Bellman error magnitude) is less than some . It can be shown =-=[55]-=- that there exists a t , polynomial in jSj, jAj, the magnitude of the largest value of R(s; a), and 1=(1 , ), such that the greedy policy with respect to V t is equal to the optimal in nite-horizon po... |

43 | Conditional linear planning.
- Goldman, Boddy
- 1994
(Show Context)
Citation Context ... state. If observations reveal the precise identity of the current state, the planning model is called "completely observable." The mdp model, as well as some planning systems such as cnlp a=-=nd Plinth [18,19] assume co-=-mplete observability. Other systems, such as Buridan and maxplan [37], have no observation model and can attack "completely unobservable" problems. Classical planning systems typically have ... |

40 | An improved policy iteration algorithm for partially observable MDPs
- Hansen
- 1998
(Show Context)
Citation Context ...d piecewise-linear and convex approximation to the desired value function. From the approximate value function we can extract a stationary policy that is approximately optimal. Sondik [59] and Hansen =-=[23]-=- have shown how to use algorithms like witness algorithm that perform exact dynamic-programming backups in pomdps in a policy-iteration algorithm to find exact solutions to many infinite-horizon probl... |

37 | Control strategies for stochastic planner.
- Tash, Russell
- 1994
(Show Context)
Citation Context ...what may happen. In many cases, we may not want a full policy; methods for developing partial policies and conditional plans for completely observable domains are the subject of much current interest =-=[14,54,12]-=-. A weakness of the methods described in this paper is that they require the states of the world to be represented enumeratively, rather than through compositional representations suchasBayes nets or ... |

35 | Planning in stochastic domains: Problem characteristics and approximation
- Zhang, Liu
- 1996
(Show Context)
Citation Context ...in which S and V t are very small and Q a t is exponentially larger for some action a. From the de nition of the state estimator, SE, and the t-step value function, 7 A more recent algorithm by Zhang =-=[59]-=-, inspired by the witness algorithm, has the same asymptotic complexity but appears to be the current fastest algorithm empirically for this problem. 18sV1 := fh0; 0;::: ; 0ig t := 1 loop t := t +1 fo... |

34 | Epsilon-Safe Planning
- Goldman, Boddy
- 1994
(Show Context)
Citation Context ...ssible to assess whether a plan is likely to reach the goal even if it is not guaranteed to do so. This type of action model is used in mdps and pomdps as well as in Buridan and C-Buridan. Other work =-=[5,15,19]-=- has used representations that can be used to compute probability distributions over future states. 33 6.4 Observation Model When the starting state is known and actions are deterministic, there is no... |

32 | Rewarding behaviors, in:
- Bacchus, Boutilier, et al.
- 1996
(Show Context)
Citation Context ... at a broad family of decision-theoretic objectives that make it possible to specify trade-o s between partially satisfying goals quickly and satisfying them completely. Bacchus, Boutilier, and Grove =-=[2]-=- show how some richer objectives based on evaluations of sequences of actions can actually be converted to total-reward problems. Other objectives considered in planning systems, aside from simple goa... |

30 |
Partially observed Markov decision processes: A survey
- White
- 1991
(Show Context)
Citation Context |

30 | Efficient dynamicprogramming updates in partially observable Markov decision processes,” Brown University
- Littman, Cassandra, et al.
- 1995
(Show Context)
Citation Context ...Sondik [50] also try to avoid generating all of V + t by constructing V t directly. However, their algorithms still have worst-case running times exponential in at least one of the problem parameters =-=[28]. In fact,-=- the existence of an algorithm that runs in time polynomial in jSj, jAj, j\Omega j, jV t\Gamma1 j, and jV t j would settle the long-standing complexity-theoretic question "Does NP=RP?" in th... |

28 |
Modelfree reinforcement learning for non-markovian decision problems
- Singh, Jaakkola, et al.
- 1994
(Show Context)
Citation Context ...ardly a promising approach. Somewhat better results can be obtained by adding randomness to the agent's behavior: a policy can be a mapping from observations to probability distributions over actions =-=[49]-=-. Randomness e ectively allows the agent to sometimes choose di erent actions in di erent locations with the same appearance, increasing the probability that it might choose a good action; in practice... |

27 |
Solutions procedures for partially observed Markov decision processes
- White, Scherer
- 1989
(Show Context)
Citation Context ...y trees during the generation 23 procedure. As a result, compared to exhaustive enumeration, very few nonuseful policy trees are considered and the algorithm runs extremely quickly. White and Scherer =-=[65]-=- propose an alternative approach in which the reward function is changed so that all of the algorithms discussed in this chapter will tend to run more efficiently. This technique has not yet been comb... |

23 |
Optimal probabilistic and decision-theoretic planning using Markovian decision theory
- Koenig
- 1991
(Show Context)
Citation Context ...n of a plan ( nite or in nite horizon) constitutes the value of the plan. This objective is used extensively in most work with mdps and pomdps, including ours. 32sSeveral authors (for example, Koenig =-=[21]-=-) have pointed out that, given a completely observable problem stated as one of goal achievement, reward functions can be constructed so that a policy that maximizes reward can be used to maximize the... |

23 | The complexity of mean payoff games on graphs. Theoret. Comput. Sci - Zwick, Paterson - 1996 |

21 | On the average cost optimality equation and the structure of optimal policies for partially observable Markov decision processes
- Fernandez-Gaucherand, Arapostathis, et al.
- 1991
(Show Context)
Citation Context ...al-probability problems and discounted-optimality problems, it is hard to make technical sense of this difference. In fact, many pomdp models should probably be addressed in an average-reward context =-=[15]-=-. Using a discounted-optimal policy in a truly in nite-duration setting is a convenient approximation, similar to the use of a situation-action mapping from a nite-horizon policy in receding horizon c... |

20 | Cost-Effective Sensing During Plan Execution
- Hansen
(Show Context)
Citation Context ...ph. Each node of the graph is labeled with an action and there is one labeled outgoing edge for each possible outcome of the action. It is possible to generate this type of plan graph for some pomdps =-=[40,52,7,17]-=-. For completely observable problems with a high branching factor, a more convenient representation is a policy which maps the current state (situation) to a choice of action. Because there is an acti... |

18 | Risk-Sensitive Planning with Probabilistic Decision Graphs
- Koenig, Simmons
- 1994
(Show Context)
Citation Context ...natives to the total-reward criterion, all of which are based on the idea that the objective value for a plan is based on a summary of immediate rewards over the duration of a run. Koenig and Simmons =-=[22]-=- examine risk-sensitive planning and showed how planners for the total-reward criterion could be used to optimize risk-sensitive behavior. Haddawy et al. [16] looked at a broad family of decision-theo... |

17 | Representing uncertainty in simple planners
- Goldman, Boddy
- 1994
(Show Context)
Citation Context ...dvance, what aspects of the state will be known and unknown. It is insufficient for multiple agents reasoning about each others' knowledge and for representing certain types of correlated uncertainty =-=[20]-=-. Formulating knowledge as predicate values that are either known or unknown makes it impossible to reason about gradations of knowledge. For example, an agent that is fairly certain that it knows the... |

13 | Generating optimal policies for high-level plans with conditional branches and loops
- Lin, Dean
- 1995
(Show Context)
Citation Context ...pe than our DAG-structured plans). Our work on pomdps nds DAG-structured plans for nite-horizon problems. For in nite-horizon problems, it is necessary to introduce loops into the plan representation =-=[39,25]-=-. (Loops might also be useful in long nite-horizon pomdps for representational succinctness.) A simple loop-based plan representation de34spicts a plan as a labeled directed graph. Each node of the gr... |

13 |
A method for planning given uncertain and incomplete information
- Mansell
- 1993
(Show Context)
Citation Context ...d, the pomdp perspective applies. 30s6.2 Initial State Many classical planning systems (snlp, ucpop, cnlp) require the starting state to be known during the planning phase. An exception is the U-Plan =-=[31]-=- system, which creates a separate plan for each possible initial state with the aim of making these plans easy to merge to form a single plan. Conditional planners typically have some aspects of the i... |

13 |
A feasible computational approach to infinite-horizon partially-observed Markov decision problems
- Platzman
- 1981
(Show Context)
Citation Context ...ph. Each node of the graph is labeled with an action and there is one labeled outgoing edge for each possible outcome of the action. It is possible to generate this type of plan graph for some pomdps =-=[40,52,7,17]-=-. For completely observable problems with a high branching factor, a more convenient representation is a policy which maps the current state (situation) to a choice of action. Because there is an acti... |

11 |
The complexity of mean-payo games on graphs
- Zwick, Paterson
- 1996
(Show Context)
Citation Context ...transformation holds in the opposite direction: any total expected discounted reward problem (completely observable or nite horizon) can be transformed into a goal-achievement problem of similar size =-=[11,60]-=-. Roughly, the transformation simulates the discount factor by introducing an absorbing state with a small probability of being entered on each step. Rewards are then simulated by normalizing all rewa... |

10 | Incremental self-improvement for life-time multi-agent reinforcement learning
- Zhao, Schmidhuber
- 1996
(Show Context)
Citation Context ...resent optimal plans in general. This argues that, in the limit, a plan is actually a program. Several techniques have been proposed recently for searching for good program-like controllers in pomdps =-=[68,29]-=-. We restrict our attention to the simpler nite-horizon case and a small set of in nite-horizon problems that have optimal nite-state plans. 7 Extensions and Conclusions The pomdp model provides a rm ... |

7 |
Application of Jensen’s inequality to adaptive suboptimal design
- White, Harrington
- 1980
(Show Context)
Citation Context ...vious section, we showed that the optimal t-step value function is always piecewise-linear and convex. This is not necessarily true for the in nitehorizon discounted value function; it remains convex =-=[56]-=-, but may have innitely many facets. Still, the optimal in nite-horizon discounted value function can be approximated arbitrarily closely by a nite-horizon value function for a su ciently long horizon... |

6 |
Optimal control for partially observable Markov decision processes over an infinite horizon
- Sawaki, Ichikawa
- 1978
(Show Context)
Citation Context ... but may have innitely many facets. Still, the optimal in nite-horizon discounted value function can be approximated arbitrarily closely by a nite-horizon value function for a su ciently long horizon =-=[52,44]-=-. The optimal in nite-horizon discounted value function can be approximated via value iteration, in which the series of t-step discounted value functions is computed; the iteration is stopped when the... |

5 |
E cient dynamic-programming updates in partially observable Markov decision processes
- Littman, Cassandra, et al.
- 1996
(Show Context)
Citation Context ... Sondik [50] also try to avoid generating all of V + t by constructing Vt directly. However, their algorithms still have worst-case running times exponential in at least one of the problem parameters =-=[28]-=-. In fact, the existence of an algorithm that runs in time polynomial in jSj, jAj, j j, jVt,1j, and jVtj would settle the long-standing complexity-theoretic question \Does NP=RP?" in the a rmative [28... |

4 |
Incremental self-improvement for life-time multi-agent reinforcement learning
- Zhao, Schmidhuber
- 1996
(Show Context)
Citation Context ...resent optimal plans in general. This argues that, in the limit, a plan is actually a program. Several techniques have been proposed recently for searching for good program-like controllers in pomdps =-=[46,23]-=- We restrict our attention to the simpler nite-horizon case and a small set of in nite-horizon problems that have optimal nite-state plans. 7 Extensions and Conclusions The pomdp model provides a rm f... |

4 |
An efficient algorithm for dynamic programming in partially observable markov decision processes
- Littman, Cassandra, et al.
- 1995
(Show Context)
Citation Context ...Sondik [25] also try to avoid generating all of V + t by constructing V t directly. However, their algorithms still have worst-case running times exponential in at least one of the problem parameters =-=[13]-=-. In fact, if we could solve this problem, then RP=NP [13], so we will pursue a slightly different approach. Instead of computing V t directly, we will compute, for each action a, a set Q a t of t-ste... |

3 |
A feasible computational approach to in nite-horizon partially-observed Markov decision problems
- Platzman
- 1981
(Show Context)
Citation Context ...ph. Each node of the graph is labeled with an action and there is one labeled outgoing edge for each possible outcome of the action. It is possible to generate this type of plan graph for some pomdps =-=[40,52,7,17]-=-. For completely observable problems with a high branching factor, a more convenient representation is a policy which maps the current state (situation) toachoice of action. Because there is an action... |

3 | Solving -horizon stationary Markov decision process in time proportional to - Tseng - 1990 |

2 |
Cost-E ective Sensing During Plan Execution
- Hansen
- 1994
(Show Context)
Citation Context ...ph. Each node of the graph is labeled with an action and there is one labeled outgoing edge for each possible outcome of the action. It is possible to generate this type of plan graph for some pomdps =-=[40,52,7,17]-=-. For completely observable problems with a high branching factor, a more convenient representation is a policy which maps the current state (situation) toachoice of action. Because there is an action... |

2 |
Anytime synthetic projection
- Drummond, Bresina
- 1990
(Show Context)
Citation Context ...what may happen. In many cases, we may not want a full policy; methods for developing partial policies and conditional plans for completely observable domains are the subject of much current interest =-=[13, 11, 22]-=-. A weakness of the methods described in this paper is that they require the states of the world to be represented enumeratively, rather than through compositional representations such as Bayes nets o... |

2 |
Representation and evaluation of plans with loops. Working notes for the 1995 Stanford Spring Symposium on Extended Theories of Action
- Smith, Williamson
- 1995
(Show Context)
Citation Context ...an our DAG-structured plans). Our work on pomdps finds DAG-structured plans for finite-horizon problems. For infinite-horizon problems, it is necessary to introduce loops into the plan representation =-=[57,31]-=-. (Loops might also be useful in long finite-horizon pomdps for representational succinctness.) A simple loop-based plan representation de36 picts a plan as a labeled directed graph. Each node of the ... |

2 |
The optimal search foramoving target when the search path is constrained
- Eagle
- 1984
(Show Context)
Citation Context ...ss algorithm [9]. As pointed out in Section 4, value functions in belief space have a natural geometric interpretation. For small state spaces, algorithms that exploit this geometry are quite e cient =-=[16]-=-. An excellent example of this is Cheng's linear support algorithm [10]. This algorithm can be viewed as a variation of the witness algorithm in which witness points are sought at the corners of regio... |

1 |
The complexity ofmeanpayo games on graphs
- Zwick, Paterson
- 1996
(Show Context)
Citation Context ...transformation holds in the opposite direction: any total expected discounted reward problem (completely observable or nite horizon) can be transformed into a goal-achievement problem of similar size =-=[12,69]-=-. Roughly, the transformation simulates the discount factorby introducing an absorbing state with a small probability of being entered on each step. Rewards are then simulated by normalizing all rewar... |