Results 1  10
of
18
On the Undecidability of Probabilistic Planning and Related Stochastic Optimization Problems
 Artificial Intelligence
, 2003
"... Automated planning, the problem of how an agent achieves a goal given a repertoire of actions, is one of the foundational and most widely studied problems in the AI literature. The original formulation of the problem makes strong assumptions regarding the agent's knowledge and control over the ..."
Abstract

Cited by 73 (0 self)
 Add to MetaCart
(Show Context)
Automated planning, the problem of how an agent achieves a goal given a repertoire of actions, is one of the foundational and most widely studied problems in the AI literature. The original formulation of the problem makes strong assumptions regarding the agent's knowledge and control over the world, namely that its information is complete and correct, and that the results of its actions are deterministic and known.
Learning Accuracy and Availability of Humans who Help Mobile Robots
"... When mobile robots perform tasks in environments with humans, it seems appropriate for the robots to rely on such humans for help instead of dedicated human oracles or supervisors. However, these humans are not always available nor always accurate. In this work, we consider human help to a robot as ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
When mobile robots perform tasks in environments with humans, it seems appropriate for the robots to rely on such humans for help instead of dedicated human oracles or supervisors. However, these humans are not always available nor always accurate. In this work, we consider human help to a robot as concretely providing observations about the robot’s state to reduce state uncertainty as it executes its policy autonomously. We model the probability of receiving an observation from a human in terms of their availability and accuracy by introducing Human Observation Providers POMDPs (HOPPOMDPs). We contribute an algorithm to learn human availability and accuracy online while the robot is executing its current task policy. We demonstrate that our algorithm is effective in approximating the true availability and accuracy of humans without depending on oracles to learn, thus increasing the tractability of deploying a robot that can occasionally ask for help.
Averagereward decentralized markov decision processes
, 2007
"... Formal analysis of decentralized decision making has become a thriving research area in recent years, producing a number of multiagent extensions of Markov decision processes. While much of the work has focused on optimizing discounted cumulative reward, optimizing average reward is sometimes a mor ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
Formal analysis of decentralized decision making has become a thriving research area in recent years, producing a number of multiagent extensions of Markov decision processes. While much of the work has focused on optimizing discounted cumulative reward, optimizing average reward is sometimes a more suitable criterion. We formalize a class of such problems and analyze its characteristics, showing that it is NP complete and that optimal policies are deterministic. Our analysis lays the foundation for designing two optimal algorithms. Experimental results with a standard problem from the literature illustrate the applicability of these solution techniques. 1
Intervention in Gene Regulatory Networks Via a Stationary Meanfirstpassagetime Control Policy
 IEEE Transactions on Biomedical Engineering
, 2008
"... Abstract—A prime objective of modeling genetic regulatory networks is the identification of potential targets for therapeutic intervention. To date, optimal stochastic intervention has been studied in the context of probabilistic Boolean networks, with the control policy based on the transition p ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
(Show Context)
Abstract—A prime objective of modeling genetic regulatory networks is the identification of potential targets for therapeutic intervention. To date, optimal stochastic intervention has been studied in the context of probabilistic Boolean networks, with the control policy based on the transition probability matrix of the associated Markov chain and dynamic programming used to find optimal control policies. Dynamical programming algorithms are problematic owing to their high computational complexity. Two additional computationally burdensome issues that arise are the potential for controlling the network and identifying the best gene for intervention. This paper proposes an algorithm based on mean firstpassage time that assigns a stationary control policy for each gene candidate. It serves as an approximation to an optimal control policy and, owing to its reduced computational complexity, can be used to predict the best control gene. Once the best control gene is identified, one can derive an optimal policy or simply utilize the approximate policy for this gene when the network size precludes a direct application of dynamic programming algorithms. A salient point is that the proposed algorithm can be modelfree. It can be directly designed from timecourse data without having to infer the transition probability matrix of the network. Index Terms—Dynamic programming, genetic regulatory networks, mean firstpassage time, probabilistic Boolean networks, stochastic optimal control. I.
On policy iteration as a Newton’s method and polynomial policy iteration algorithms
, 2002
"... Policy iteration is a popular technique for solving Markov decision processes (MDPs). It is easy to describe and implement, and has excellent performance in practice. But not much is known about its complexity. The best upper bound remains exponential, and the best lower bound is a trivial Ω(n) on t ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
Policy iteration is a popular technique for solving Markov decision processes (MDPs). It is easy to describe and implement, and has excellent performance in practice. But not much is known about its complexity. The best upper bound remains exponential, and the best lower bound is a trivial Ω(n) on the number of iterations, where n is the number of states. This paper improves the upper bounds to a polynomial for policy iteration on MDP problems with special graph structure. Our analysis is based on the connection between policy iteration and Newton’s method for finding the zero of a convex function. The analysis offers an explanation as to why policy iteration is fast. It also leads to polynomial bounds on several variants of policy iteration for MDPs for which the linear programming formulation requires at most two variables per inequality (MDP(2)). The MDP(2) class includes deterministic MDPs under discounted and average reward criteria. The bounds on the run times include O(mn 2 log m log W) on MDP(2) and O(mn 2 log m) for deterministic MDPs, where m denotes the number of actions and W denotes the magnitude of the largest number in the problem description. 1
A risksensitive approach to total productive maintenance
 Automatica
"... www.elsevier.com/locate/automatica While risksensitive (RS) approaches for designing plans of total productive maintenance are critical in manufacturing systems, there is little in the literature by way of theoretical modeling. Developing such plans often requires the solution of a discretetime st ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
(Show Context)
www.elsevier.com/locate/automatica While risksensitive (RS) approaches for designing plans of total productive maintenance are critical in manufacturing systems, there is little in the literature by way of theoretical modeling. Developing such plans often requires the solution of a discretetime stochastic controloptimization problem. Renewal theory and Markov decision processes (MDPs) are commonly employed tools for solving the underlying problem. The literature on preventive maintenance, for the most part, focuses on minimizing the expected net cost, and disregards issues related to minimizing risks. RS maintenance managers employ safety factors to modify the riskneutral solution in an attempt to heuristically accommodate elements of risk in their decision making. In this paper, our efforts are directed toward developing a formal theory for developing RS preventivemaintenance plans. We employ the Markowitz paradigm in which one seeks to optimize a function of the expected cost and its variance. In particular, we present (i) a result for an RS approach in the setting of renewal processes and (ii) a result for solving an RS MDP. We also provide computational results to demonstrate the efficacy of these results. Finally, the theory developed here is of sufficiently general nature that can be applied to problems in other relevant domains.
Active Reinforcement Learning
"... When the transition probabilities and rewards of a Markov Decision Process (MDP) are known, an agent can obtain the optimal policy without any interaction with the environment. However, exact transition probabilities are difficult for experts to specify. One option left to an agent is a long and pot ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
When the transition probabilities and rewards of a Markov Decision Process (MDP) are known, an agent can obtain the optimal policy without any interaction with the environment. However, exact transition probabilities are difficult for experts to specify. One option left to an agent is a long and potentially costly exploration of the environment. In this paper, we propose another alternative: given initial (possibly inaccurate) specification of the MDP, the agent determines the sensitivity of the optimal policy to changes in transitions and rewards. It then focuses its exploration on the regions of space to which the optimal policy is most sensitive. We show that the proposed exploration strategy performs well on several control and planning problems. 1.
Discounted deterministic Markov decision processes and discounted allpairs shortest paths
 ACM Transcations on Algorithms
"... We present two new algorithms for finding optimal strategies for discounted, infinitehorizon, Deterministic Markov Decision Processes (DMDP). The first one is an adaptation of an algorithm of Young, Tarjan and Orlin for finding minimum mean weight cycles. It runs in O(mn + n 2 log n) time, where n ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
We present two new algorithms for finding optimal strategies for discounted, infinitehorizon, Deterministic Markov Decision Processes (DMDP). The first one is an adaptation of an algorithm of Young, Tarjan and Orlin for finding minimum mean weight cycles. It runs in O(mn + n 2 log n) time, where n is the number of vertices (or states) and m is the number of edges (or actions). The second one is an adaptation of a classical algorithm of Karp for finding minimum mean weight cycles. It runs in O(mn) time. The first algorithm has a slightly slower worstcase complexity, but is faster than the first algorithm in many situations. Both algorithms improve on a recent O(mn 2)time algorithm of Andersson and Vorobyov. We also present a randomized Õ(m1/2 n 2)time algorithm for finding Discounted AllPairs Shortest Paths (DAPSP), improving several previous algorithms. 1
Purpose Restrictions on Information Use ⋆
"... Abstract. Privacy policies in sectors as diverse as Web services, finance and healthcare often place restrictions on the purposes for which a governed entity may use personal information. Thus, automated methods for enforcing privacy policies require a semantics of purpose restrictions to determine ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Privacy policies in sectors as diverse as Web services, finance and healthcare often place restrictions on the purposes for which a governed entity may use personal information. Thus, automated methods for enforcing privacy policies require a semantics of purpose restrictions to determine whether a governed agent used information for a purpose. We provide such a semantics using a formalism based on planning. We model planning using Partially Observable Markov Decision Processes (POMDPs), which supports an explicit model of information. We argue that information use is for a purpose if and only if the information is used while planning to optimize the satisfaction of that purpose under the POMDP model. We determine information use by simulating ignorance of the information prohibited by the purpose restriction, which we relate to noninterference. We use this semantics to develop a sound audit algorithm to automate the enforcement of purpose restrictions. 1
PartiallySynchronized DECMDPs in Dynamic Mechanism Design
"... In this paper, we combine for the first time the methods of dynamic mechanism design with techniques from decentralized decision making under uncertainty. Consider a multiagent system with selfinterested agents acting in an uncertain environment, each with private actions, states and rewards. Ther ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
In this paper, we combine for the first time the methods of dynamic mechanism design with techniques from decentralized decision making under uncertainty. Consider a multiagent system with selfinterested agents acting in an uncertain environment, each with private actions, states and rewards. There is also a social planner with its own actions, rewards, and states, acting as a coordinator and able to influence the agents via actions (e.g., resource allocations). Agents can only communicate with the center, but may become inaccessible, e.g., when their communication device fails. When accessible to the center, agents can report their local state (and models) and receive recommendations from the center about local policies to follow for the present period and also, should they become inaccessible, until becoming accessible again. Without selfinterest, this poses a new problem class which we call partiallysynchronized DECMDPs, and for which we establish some positive complexity results under reasonable assumptions. Allowing for selfinterested agents, we are able to bridge to methods of dynamic mechanism design, aligning incentives so that agents truthfully report local state when accessible and choose to follow the prescribed “emergency policies ” of the center.