Results 1  10
of
34
Learning All Optimal Policies with Multiple Criteria
"... We describe an algorithm for learning in the presence of multiple criteria. Our technique generalizes previous approaches in that it can learn optimal policies for all linear preference assignments over the multiple reward criteria at once. The algorithm can be viewed as an extension to standard rei ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
(Show Context)
We describe an algorithm for learning in the presence of multiple criteria. Our technique generalizes previous approaches in that it can learn optimal policies for all linear preference assignments over the multiple reward criteria at once. The algorithm can be viewed as an extension to standard reinforcement learning for MDPs where instead of repeatedly backing up maximal expected rewards, we back up the set of expected rewards that are maximal for some set of linear preferences (given by a weight vector, − → w). We present the algorithm along with a proof of correctness showing that our solution gives the optimal policy for any linear preference function. The solution reduces to the standard value iteration algorithm for a specific weight vector, − → w. 1.
Risksensitive reinforcement learning applied to control under constraints
 Journal of Artificial Intelligence Research
, 2005
"... In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of find ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
(Show Context)
In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some userspecified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed. 1.
A Robust Geometric Approach to MultiCriterion Reinforcement Learning
 Journal of Machine Learning Research
, 2004
"... We consider the problem of reinforcement learning in a dynamic environment, where the learning objective is defined in terms of multiple reward functions of the average reward type. The environment is initially unknown, and furthermore may be affected by the actions of other agents, which are observ ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
We consider the problem of reinforcement learning in a dynamic environment, where the learning objective is defined in terms of multiple reward functions of the average reward type. The environment is initially unknown, and furthermore may be affected by the actions of other agents, which are observed but cannot be predicted in advance. We model this situation through a stochastic (Markov) game model, between the learning agent and an arbitrary player, with vectorvalued rewards. State recurrence conditions are imposed throughout. The objective of the learning agent is to have its longterm average reward vector belong to a desired target set. Starting with a given target set, we devise learning algorithms to achieve this task. These algorithms rely on learning algorithms for appropriately defined scalar rewards, together with the geometric insight of the theory of approachability for stochastic games. We then address the more general problem where the target set itself may depend on the model parameters, and hence is not known in advance to the learning agent. A particular case which falls into this framework is that of stochastic games with average reward constraints. Further specialization provides a reinforcement learning algorithm for constrained Markov decision processes. Some basic examples are provided to illustrate these results.
A Survey of MultiObjective Sequential DecisionMaking
"... Sequential decisionmaking problems with multiple objectives arise naturally in practice and pose unique challenges for research in decisiontheoretic planning and learning, which has largely focused on singleobjective settings. This article surveys algorithms designed for sequential decisionmakin ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
Sequential decisionmaking problems with multiple objectives arise naturally in practice and pose unique challenges for research in decisiontheoretic planning and learning, which has largely focused on singleobjective settings. This article surveys algorithms designed for sequential decisionmaking problems with multiple objectives. Though there is a growing body of literature on this subject, little of it makes explicit under what circumstances special methods are needed to solve multiobjective problems. Therefore, we identify three distinct scenarios in which converting such a problem to a singleobjective one is impossible, infeasible, or undesirable. Furthermore, we propose a taxonomy that classifies multiobjective methods according to the applicable scenario, the nature of the scalarization function (which projects multiobjective values to scalar ones), and the type of policies considered. We show how these factors determine the nature of an optimal solution, which can be a single policy, a convex hull, or a Pareto front. Using this taxonomy, we survey the literature on multiobjective methods for planning and learning. Finally, we discuss key applications of such methods and outline opportunities for future work. 1.
RiskAware Decision Making and Dynamic Programming
"... This paper considers sequential decision making problems under uncertainty, the tradeoff between the expected return and the risk of high loss, and methods that use dynamic programming to find optimal policies. It is argued that using Bellman Principle determines how risk considerations on the retur ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
(Show Context)
This paper considers sequential decision making problems under uncertainty, the tradeoff between the expected return and the risk of high loss, and methods that use dynamic programming to find optimal policies. It is argued that using Bellman Principle determines how risk considerations on the return can be incorporated. The discussion centers around returns generated by Markov Decision Processes and conclusions concern a large class of methods in Reinforcement Learning. 1
Linear fittedq iteration with multiple reward functions
 Journal of Machine Learning Research
"... We present a general and detailed development of an algorithm for finitehorizon fittedQ iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3reward function case using t ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
We present a general and detailed development of an algorithm for finitehorizon fittedQ iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions. We also present an example of how our methods can be used to construct a realworld decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further enable our methods to make a positive impact on the field of evidencebased clinical decision support.
On the response of EMTbased control to interacting targets and models
 In Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS06
, 2006
"... A novel control mechanism was recently introduced based on Extended Markov Tracking (EMT) [9, 10]. In this paper, we present a study of its response to multiple interacting control goals. We show a simple extension that can be integrated into EMTbased control, and which provides it with the ability ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
A novel control mechanism was recently introduced based on Extended Markov Tracking (EMT) [9, 10]. In this paper, we present a study of its response to multiple interacting control goals. We show a simple extension that can be integrated into EMTbased control, and which provides it with the ability to handle several behavioral targets. Experimental support for the validity of this extension is provided. We also describe an experiment with a simulated robot, where EMTbased controllers interact and interfere indirectly via the environment. Experiments support the resilience of multiagent EMTbased team control to potential conflicts that may appear within a team. 1.
MultiAgent Inverse Reinforcement Learning
"... Learning the reward function of an agent by observing its behavior is termed inverse reinforcement learning and has applications in learning from demonstration or apprenticeship learning. We introduce the problem of multiagent inverse reinforcement learning, where reward functions of multiple agents ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Learning the reward function of an agent by observing its behavior is termed inverse reinforcement learning and has applications in learning from demonstration or apprenticeship learning. We introduce the problem of multiagent inverse reinforcement learning, where reward functions of multiple agents are learned by observing their uncoordinated behavior. A centralized controller then learns to coordinate their behavior by optimizing a weighted sum of reward functions of all the agents. We evaluate our approach on a trafficrouting domain, in which a controller coordinates actions of multiple traffic signals to regulate traffic density. We show that the learner is not only able to match but even significantly outperform the expert. I.
Efficient QoS provisioning for adaptive multimedia in mobile communication networks by reinforcement learning
 Mobile Netw. Appl
, 2006
"... The scarcity and large fluctuations of link bandwidth in wireless networks have motivated the development of adaptive multimedia services in mobile communication networks, where it is possible to increase or decrease the bandwidth of individual ongoing flows. This paper studies the issues of quality ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
The scarcity and large fluctuations of link bandwidth in wireless networks have motivated the development of adaptive multimedia services in mobile communication networks, where it is possible to increase or decrease the bandwidth of individual ongoing flows. This paper studies the issues of quality of service (QoS) provisioning in such systems. In particular, call admission control and bandwidth adaptation are formulated as a constrained Markov decision problem. The rapid growth in the number of states and the difficulty in estimating state transition probabilities in practical systems make it very difficult to employ classical methods to find the optimal policy. We present a novel approach that uses a form of discounted reward reinforcement learning known as Qlearning to solve QoS provisioning for wireless adaptive multimedia. Qlearning does not require the explicit state transition model to solve the Markov decision problem; therefore more general and realistic assumptions can be applied to the underlying system model for this approach than in previous schemes. Moreover, the proposed scheme can efficiently handle the large state space and action set of the wireless adaptive multimedia QoS provisioning problem. Handoff dropping probability and average allocated bandwidth are considered as QoS constraints in our model and can be guaranteed simultaneously. Simulation results demonstrate the effectiveness of the proposed scheme in adaptive multimedia mobile communication networks. 1.
Hypervolume indicator and dominance reward based multiobjective montecarlo tree search
 Machine Learning
, 2013
"... Concerned with multiobjective reinforcement learning (MORL), this paper presents MOMCTS, an extension of MonteCarlo Tree Search to multiobjective sequential decision making, embedding two decision rules respectively based on the hypervolume indicator and the Pareto dominance reward. The MOMCTS a ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Concerned with multiobjective reinforcement learning (MORL), this paper presents MOMCTS, an extension of MonteCarlo Tree Search to multiobjective sequential decision making, embedding two decision rules respectively based on the hypervolume indicator and the Pareto dominance reward. The MOMCTS approaches are firstly compared with the MORL state of the art on two artificial problems, the twoobjective Deep Sea Treasure problem and the threeobjective Resource Gathering problem. The scalability of MOMCTS is also examined in the context of the NPhard grid scheduling problem, showing that the MOMCTS performance matches the (nonRL based) state of the art albeit with a higher computational cost.