Results 1 
9 of
9
Reinforcement learning algorithms for MDPs
, 2009
"... This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare increment ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare incremental and batch algorithmic variants and discuss the impact of the choice of the function approximation method on the success of learning. In the second half, we describe methods that target the problem of learning to control an MDP. Here online and active learning are discussed first, followed by a description of direct and actorcritic methods.
Adaptive Bandits: Towards the best historydependent strategy
"... We consider multiarmed bandit games with possibly adaptive opponents. We introduce models Θ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides re ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We consider multiarmed bandit games with possibly adaptive opponents. We introduce models Θ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some model θ ∗ ∈ Θ. The regret is measured with respect to (w.r.t.) the best historydependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best historyclassbased strategy) for the best model in Θ. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When Θ = {θ}, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time T) bounded by Õ( √ T AC), where C is the number of classes of θ. Now, when many models are available, all known algorithms achieving a nice regret O ( √ T) are unfortunately not tractable and scale poorly with the number of models Θ. Our contribution here is to provide tractable algorithms with regret bounded by T 2/3 C 1/3 log(Θ) 1/2. 1
Online Optimization with Dynamic Temporal Uncertainty: Incorporating Short Term Predictions for Renewable Integration in Intelligent Energy Systems
 PROCEEDINGS OF THE TWENTYSEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 2013
"... Growing costs, environmental awareness and government directives have set the stage for an increase in the fraction of electricity supplied using intermittent renewable sources such as solar and wind energy. To compensate for the increased variability in supply and demand, we need algorithms for onl ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Growing costs, environmental awareness and government directives have set the stage for an increase in the fraction of electricity supplied using intermittent renewable sources such as solar and wind energy. To compensate for the increased variability in supply and demand, we need algorithms for online energy resource allocation under temporal uncertainty of future consumption and availability. Recent advances in prediction algorithms offer hope that a reduction in future uncertainty, through short term predictions, will increase the worth of the renewables. Predictive information is then revealed incrementally in an online manner, leading to what we call dynamic temporal uncertainty. We demonstrate the nontriviality of this problem and provide online algorithms, both randomized and deterministic, to handle time varying uncertainty in future rewards for nonstationary MDPs in general and for energy resource allocation in particular. We derive theoretical upper and lower bounds that hold even for a finite horizon, and establish that, in the deterministic case, discounting future rewards can be used as a strategy to maximize the total (undiscounted) reward. We also corroborate the efficacy of our methodology using wind and demand traces.
Better Rates for Any Adversarial Deterministic MDP
"... We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the stateoftheart forward in two ways: First, it attains a regret of O(T 2/3) with respect to the best fixed policy in hindsight, whereas t ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the stateoftheart forward in two ways: First, it attains a regret of O(T 2/3) with respect to the best fixed policy in hindsight, whereas the previous best regret bound was O(T 3/4). Second, the algorithm and its analysis are compatible with any feasible ADMDP graph topology, while all previous approaches required additional restrictions on the graph topology. 1.
TTIC
"... We consider a Markov decision process with deterministic state transition dynamics, adversarially generated rewards that change arbitrarily from round to round, and a bandit feedback model in which the decision maker only observes the rewards it receives. In this setting, we present a novel and effi ..."
Abstract
 Add to MetaCart
We consider a Markov decision process with deterministic state transition dynamics, adversarially generated rewards that change arbitrarily from round to round, and a bandit feedback model in which the decision maker only observes the rewards it receives. In this setting, we present a novel and efficient online decision making algorithm named MarcoPolo. Under mild assumptions on the structure of the transition dynamics, we prove that MarcoPolo enjoys a regret of O(T 3/4√log T) against the best deterministic policy in hindsight. Specifically, our analysis does not rely on the stringent unichain assumption, which dominates much of the previous work on this topic. 1
TTIC
"... We consider a Markov decision process with deterministic state transition dynamics, adversarially generated rewards that change arbitrarily from round to round, and a bandit feedback model in which the decision maker only observes the rewards it receives. In this setting, we present a novel and e ..."
Abstract
 Add to MetaCart
We consider a Markov decision process with deterministic state transition dynamics, adversarially generated rewards that change arbitrarily from round to round, and a bandit feedback model in which the decision maker only observes the rewards it receives. In this setting, we present a novel and efficient online decision making algorithm named MarcoPolo. Under mild assumptions on the structure of the transition dynamics, we prove that MarcoPolo enjoys a regret of O(T 3/4 log T) against the best deterministic policy in hindsight. Specifically, our analysis does not rely on the stringent unichain assumption, which dominates much of the previous work on this topic. 1
Adaptive Bandits: Towards the best historydependent strategy
"... We consider multiarmed bandit games with possibly adaptive opponents. We introduce models Θ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides re ..."
Abstract
 Add to MetaCart
(Show Context)
We consider multiarmed bandit games with possibly adaptive opponents. We introduce models Θ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some model θ ∗ ∈ Θ. The regret is measured with respect to (w.r.t.) the best historydependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best historyclassbased strategy) for the best model in Θ. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When Θ = {θ}, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time T) bounded by Õ ( √ T AC), where C is the number of classes of θ. Now, when many models are available, all known algorithms achieving a nice regret O ( √ T) are unfortunately not tractable and scale poorly with the number of models Θ. Our contribution here is to provide tractable algorithms with regret bounded by T 2/3 C 1/3 log(Θ) 1/2. 1