Results 1  10
of
21
Online Markov Decision Processes under Bandit Feedback
"... We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent o ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
(Show Context)
We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other stateaction pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a noregret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O ( T 2/3 (ln T) 1/3) , giving the first rigorously proved regret bound for the problem. 1
Online bandit learning against an adaptive adversary: from regret to policy regret
 IN PROCEEDINGS OF THE 29TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 2012
"... Online learning algorithms are designed to learn even when their input is generated by an adversary. The widelyaccepted formal definition of an online algorithm’s ability to learn is the gametheoretic notion of regret. We argue that the standard definition of regret becomes inadequate if the adve ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
Online learning algorithms are designed to learn even when their input is generated by an adversary. The widelyaccepted formal definition of an online algorithm’s ability to learn is the gametheoretic notion of regret. We argue that the standard definition of regret becomes inadequate if the adversary is allowed to adapt to the online algorithm’s actions. We define the alternative notion of policy regret, which attempts to provide a more meaningful way to measure an online algorithm’s performance against adaptive adversaries. Focusing on the online bandit setting, we show that no bandit algorithm can guarantee a sublinear policy regret against an adaptive adversary with unbounded memory. On the other hand, if the adversary’s memory is bounded, we present a general technique that converts any bandit algorithm with a sublinear regret bound into an algorithm with a sublinear policy regret bound. We extend this result to other variants of regret, such as switching regret, internal regret, and swap regret. 1.
Reinforcement learning algorithms for MDPs
, 2009
"... This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare increment ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare incremental and batch algorithmic variants and discuss the impact of the choice of the function approximation method on the success of learning. In the second half, we describe methods that target the problem of learning to control an MDP. Here online and active learning are discussed first, followed by a description of direct and actorcritic methods.
Arbitrarily modulated Markov decision processes
 In Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference
, 2009
"... decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., nonstationary) fashion. We propose an online Qlearning style algorithm and give a guarantee on its performance evaluated in retrospect against alternative policies.Unlikepreviousworks,theguarantee ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., nonstationary) fashion. We propose an online Qlearning style algorithm and give a guarantee on its performance evaluated in retrospect against alternative policies.Unlikepreviousworks,theguaranteedepends critically on the variability of the uncertainty in the transition probabilities, but holds regardless of arbitrary changes in rewards and transition probabilities over time. Besides its intrinsic computational efficiency, this approach requires neither prior knowledge nor estimation of the transition probabilities. I.
Online Learning for Global Cost Functions
, 2009
"... We consider an online learning setting where at each time step the decision maker has to choose how to distribute the future loss between k alternatives, and then observes the loss of each alternative. Motivated by load balancing and job scheduling, we consider a global cost function (over the losse ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
We consider an online learning setting where at each time step the decision maker has to choose how to distribute the future loss between k alternatives, and then observes the loss of each alternative. Motivated by load balancing and job scheduling, we consider a global cost function (over the losses incurred by each alternative), rather than a summation of the instantaneous losses as done traditionally in online learning. Such global cost functions include the makespan (the maximum over the alternatives) and the Ld norm (over the alternatives). Based on approachability theory, we design an algorithm that guarantees vanishing regret for this setting, where the regret is measured with respect to the best static decision that selects the same distribution over alternatives at every time step. For the special case of makespan cost we devise a simple and efficient algorithm. In contrast, we show that for concave global cost functions, such as Ld norms for d < 1, the worstcase average regret does not vanish.
Online learning in Markov decision processes with arbitrarily changing rewards and transitions
 In GameNets’09: Proceedings of the First ICST international conference on Game Theory for Networks
"... Abstract — We consider decisionmaking problems in Markov decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., nonstationary) fashion. We present algorithms that combine online learning and robust control, and establish guarantees on their performan ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
Abstract — We consider decisionmaking problems in Markov decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., nonstationary) fashion. We present algorithms that combine online learning and robust control, and establish guarantees on their performance evaluated in retrospect against alternative policies—i.e., its regret. These guarantees depend critically on the range of uncertainty in the transition probabilities, but hold regardless of the changes in rewards and transition probabilities over time. We present a version of the main algorithm in the setting where the decisionmaker’s observations are limited to its trajectory, and another version that allows a tradeoff between performance and computational complexity. I.
Online Markov decision processes with KullbackLeibler control cost
 Proceedings of the American Control Conference
, 2012
"... Abstract—This paper considers an online (realtime) control problem that involves an agent performing a discretetime random walk over a finite state space. The agent’s action at each time step is to specify the probability distribution for the next state given the current state. Following the setu ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract—This paper considers an online (realtime) control problem that involves an agent performing a discretetime random walk over a finite state space. The agent’s action at each time step is to specify the probability distribution for the next state given the current state. Following the setup of Todorov, the stateaction cost at each time step is a sum of a state cost and a control cost given by the KullbackLeibler (KL) divergence between the agent’s nextstate distribution and that determined by some fixed passive dynamics. The online aspect of the problem is due to the fact that the state cost functions are generated by a dynamic environment, and the agent learns the current state cost only after selecting an action. An explicit construction of a computationally efficient strategy with small regret (i.e., expected difference between its actual total cost and the smallest cost attainable using noncausal knowledge of the state costs) under mild regularity conditions is presented, along with a demonstration of the performance of the proposed strategy on a simulated target tracking problem. A number of new results on Markov decision processes with KL control cost are also obtained. I.
An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions
"... Abstract. We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the timedependent decisionmaking policy of an agent that minimizes the regret  the difference from the best xed policy. The difficulty of online MDP learning is that the reward fu ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the timedependent decisionmaking policy of an agent that minimizes the regret  the difference from the best xed policy. The difficulty of online MDP learning is that the reward function changes over time. In this paper, we show that a simple online policy gradient algorithm achieves regret O( p T) for T steps under a certain concavity assumption and O(log T) under a strong concavity assumption. To the best of our knowledge, this is the rst work to give an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the online policy gradient method through experiments.
Online Learning in Markov Decision Processes with Changing Cost Sequences
"... In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and banditinformation. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD 2 (mirror descent with approxi ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and banditinformation. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD 2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing nearoptimal regretbounds (in particular, we take into account the computational costs of performing approximate projections in MD 2). In the case of fullinformation feedback, our results complement existing ones. In the case of banditinformation feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the statevisitation probabilities are uniformly bounded away from zero under all policies. 1.