Results 1  10
of
79
Nearoptimal Regret Bounds for Reinforcement Learning
"... For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s ..."
Abstract

Cited by 94 (11 self)
 Add to MetaCart
(Show Context)
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s ′ there is a policy which moves from s to s ′ in at most D steps (on average). We present a reinforcement learning algorithm with total regret Õ(DS √ AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Ω ( √ DSAT) on the total regret of any learning algorithm. 1
Pac modelfree reinforcement learning
 In: ICML06: Proceedings of the 23rd international conference on Machine learning
, 2006
"... For a Markov Decision Process with finite state (size S) and action spaces (size A per state), we propose a new algorithm—Delayed QLearning. We prove it is PAC, achieving near optimal performance except for Õ(SA) timesteps using O(SA) space, improving on the Õ(S2 A) bounds of best previous algorith ..."
Abstract

Cited by 58 (13 self)
 Add to MetaCart
(Show Context)
For a Markov Decision Process with finite state (size S) and action spaces (size A per state), we propose a new algorithm—Delayed QLearning. We prove it is PAC, achieving near optimal performance except for Õ(SA) timesteps using O(SA) space, improving on the Õ(S2 A) bounds of best previous algorithms. This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience. Learning takes place from a single continuous thread of experience—no resets nor parallel sampling is used. Beyond its smaller storage and experience requirements, Delayed Qlearning’s perexperience computation cost is much less than that of previous PAC algorithms. 1.
An Analysis of ModelBased Interval Estimation for Markov Decision Processes
, 2007
"... Several algorithms for learning nearoptimal policies in Markov Decision Processes have been analyzed and proven efficient. Empirical results have suggested that Modelbased Interval Estimation (MBIE) learns efficiently in practice, effectively balancing exploration and exploitation. This paper pres ..."
Abstract

Cited by 45 (5 self)
 Add to MetaCart
Several algorithms for learning nearoptimal policies in Markov Decision Processes have been analyzed and proven efficient. Empirical results have suggested that Modelbased Interval Estimation (MBIE) learns efficiently in practice, effectively balancing exploration and exploitation. This paper presents a theoretical analysis of MBIE and a new variation called MBIEEB, proving their efficiency even under worstcase conditions. The paper also introduces a new performance metric, average loss, and relates it to its less “online” cousins from the literature.
Reinforcement Learning in Finite MDPs: PAC Analysis Reinforcement Learning in Finite MDPs: PAC Analysis
"... Editor: We study the problem of learning nearoptimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PACMDP ” algorithms include the wellknown E 3 and RMAX algorithms as well as the more recent Delayed Qlearning algorithm. We summarize the current ..."
Abstract

Cited by 45 (5 self)
 Add to MetaCart
(Show Context)
Editor: We study the problem of learning nearoptimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PACMDP ” algorithms include the wellknown E 3 and RMAX algorithms as well as the more recent Delayed Qlearning algorithm. We summarize the current stateoftheart by presenting bounds for the problem in a unified theoretical framework. We also present a more refined analysis that yields insight into the differences between the modelfree Delayed Qlearning and the modelbased RMAX. Finally, we conclude with open problems.
REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs
 In Proceedings of the 25th Annual Conference on Uncertainty in Artificial Intelligence
, 2009
"... We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and ..."
Abstract

Cited by 40 (1 self)
 Add to MetaCart
(Show Context)
We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of Õ(HS√AT). We also relate the span to various diameterlike quantities associated with the MDP, demonstrating how our results improve on previous regret bounds. 1
Logarithmic online regret bounds for undiscounted reinforcement learning
 In B. SCHÖLKOPF, J. PLATT & T. HOFFMAN, Eds., Advances in Neural Information Processing Systems 19
, 2007
"... We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the explorationexploitation tradeoff in multiarmed bandit ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
(Show Context)
We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the explorationexploitation tradeoff in multiarmed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy.
The Adaptive kMeteorologists Problem and Its Application to Structure Learning and Feature Selection in Reinforcement Learning
"... The purpose of this paper is threefold. First, we formalize and study a problem of learning probabilistic concepts in the recently proposed KWIK framework. We give details of an algorithm, known as the Adaptive kMeteorologists Algorithm, analyze its samplecomplexity upper bound, and give a matchi ..."
Abstract

Cited by 35 (6 self)
 Add to MetaCart
(Show Context)
The purpose of this paper is threefold. First, we formalize and study a problem of learning probabilistic concepts in the recently proposed KWIK framework. We give details of an algorithm, known as the Adaptive kMeteorologists Algorithm, analyze its samplecomplexity upper bound, and give a matching lower bound. Second, this algorithm is used to create a new reinforcementlearning algorithm for factoredstate problems that enjoys significant improvement over the previous stateoftheart algorithm. Finally, we apply the Adaptive kMeteorologists Algorithm to remove a limiting assumption in an existing reinforcementlearning algorithm. The effectiveness of our approaches is demonstrated empirically in a couple benchmark domains as well as a robotics navigation problem. 1.
Incremental modelbased learners with formal learningtime guarantees
 In Proc. 21st UAI Conference
, 2006
"... Modelbased learning algorithms have been shown to use experience efficiently when learning to solve Markov Decision Processes (MDPs) with finite state and action spaces. However, their high computational cost due to repeatedly solving an internal model inhibits their use in largescale problems. We ..."
Abstract

Cited by 34 (17 self)
 Add to MetaCart
(Show Context)
Modelbased learning algorithms have been shown to use experience efficiently when learning to solve Markov Decision Processes (MDPs) with finite state and action spaces. However, their high computational cost due to repeatedly solving an internal model inhibits their use in largescale problems. We propose a method based on realtime dynamic programming (RTDP) to speed up two modelbased algorithms, RMAX and MBIE (modelbased interval estimation), resulting in computationally much faster algorithms with little loss compared to existing bounds. Specifically, our two new learning algorithms, RTDPRMAX and RTDPIE, have considerably smaller computational demands than RMAX and MBIE. We develop a general theoretical framework that allows us to prove that both are efficient learners in a PAC (probably approximately correct) sense. We also present an experimental evaluation of these new algorithms that helps quantify the tradeoff between computational and experience demands. 1
Percentile Optimization for Markov Decision Processes with Parameter Uncertainty
"... Markov decision processes are an effective tool in modeling decisionmaking in uncertain dynamic environments. Since the parameters of these models are typically estimated from data or learned from experience, it is not surprising that the actual performance of a chosen strategy often significantl ..."
Abstract

Cited by 29 (7 self)
 Add to MetaCart
Markov decision processes are an effective tool in modeling decisionmaking in uncertain dynamic environments. Since the parameters of these models are typically estimated from data or learned from experience, it is not surprising that the actual performance of a chosen strategy often significantly differs from the designer’s initial expectations due to unavoidable modeling ambiguity. In this paper, we present a set of percentile criteria that are conceptually natural and representative of the tradeoff between optimistic and pessimistic point of views on the question. We study the use of these criteria under different forms of uncertainty for both the rewards and the transitions. Some forms will be shown to be efficiently solvable and others highly intractable. In each case, we will outline solution concepts that take parametric uncertainty into account in the process of decision making.