Results 1 
4 of
4
SampleBased Policy Iteration for Constrained DECPOMDPs
"... Abstract. We introduce constrained DECPOMDPs — an extension of the standard DECPOMDPs that includes constraints on the optimality of the overall team rewards. Constrained DECPOMDPs present a natural framework for modeling cooperative multiagent problems with limited resources. To solve such DEC ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We introduce constrained DECPOMDPs — an extension of the standard DECPOMDPs that includes constraints on the optimality of the overall team rewards. Constrained DECPOMDPs present a natural framework for modeling cooperative multiagent problems with limited resources. To solve such DECPOMDPs, we propose a novel samplebased policy iteration algorithm. The algorithm builds on multiagent dynamic programming and benefits from several recent advances in DECPOMDP algorithms such as MBDP [12] and TBDP [13]. Specifically, it improves the joint policy by solving a series of standard nonlinear programs (NLPs), thereby building on recent advances in NLP solvers. Our experimental results confirm the algorithm can efficiently solve constrained DECPOMDPs that cause general DECPOMDP algorithms to fail. 1
CostSensitive Exploration in Bayesian Reinforcement Learning
"... In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected longterm total reward. In order to formalize costsensitive explor ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected longterm total reward. In order to formalize costsensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. We extend BEETLE, a modelbased BRL method, for learning in the environment with cost constraints. We demonstrate the costsensitive exploration behaviour in a number of simulated problems. 1
Rapid Multirobot Deployment with Time Constraints
"... Abstract—In this paper we consider the problem of multirobot deployment under temporal deadlines. The objective is to compute strategies trading off safety for speed to maximize the probability of reaching a given set of target locations within a preassigned temporal deadline. We formulate this pr ..."
Abstract
 Add to MetaCart
Abstract—In this paper we consider the problem of multirobot deployment under temporal deadlines. The objective is to compute strategies trading off safety for speed to maximize the probability of reaching a given set of target locations within a preassigned temporal deadline. We formulate this problem using the theory of Constrained Markov Decision Processes and we show that thanks to this framework it is possible to determine deploying strategies maximizing the probability of success while satisfying the deadline. Moreover, the formulation allows to exactly compute the failure probability of complex deployment tasks. Simulation results illustrate how the proposed method works in different scenarios and show how informed decisions can be made regarding the size of the robot team. I.
Properly Acting under Partial Observability with Action Feasibility Constraints
"... Abstract. We introduce ActionConstrained Partially Observable Markov Decision Process (ACPOMDP), which arose from studying critical robotic applications with damaging actions. ACPOMDPs restrict the optimized policy to only apply feasible actions: each action is feasible in a subset of the state s ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. We introduce ActionConstrained Partially Observable Markov Decision Process (ACPOMDP), which arose from studying critical robotic applications with damaging actions. ACPOMDPs restrict the optimized policy to only apply feasible actions: each action is feasible in a subset of the state space, and the agent can observe the set of applicable actions in the current hidden state, in addition to standard observations. We present optimality equations for ACPOMDPs, which imply to operate on αvectors defined over many different belief subspaces. We propose an algorithm named PreCondition Value Iteration (PCVI), which fully exploits this specific property of ACPOMDPs about αvectors. We also designed a relaxed version of PCVI whose complexity is exponentially smaller than PCVI. Experimental results on POMDP robotic benchmarks with action feasibility constraints exhibit the benefits of explicitly exploiting the semantic richness of actionfeasibility observations in ACPOMDPs over equivalent but unstructured POMDPs.