Results 1  10
of
49
Integrating Samplebased Planning and Modelbased Reinforcement Learning
"... Recent advancements in modelbased reinforcement learning have shown that the dynamics of many structured domains (e.g. DBNs) can be learned with tractable sample complexity, despite their exponentially large state spaces. Unfortunately, these algorithms all require access to a planner that computes ..."
Abstract

Cited by 38 (5 self)
 Add to MetaCart
Recent advancements in modelbased reinforcement learning have shown that the dynamics of many structured domains (e.g. DBNs) can be learned with tractable sample complexity, despite their exponentially large state spaces. Unfortunately, these algorithms all require access to a planner that computes a near optimal policy, and while many traditional MDP algorithms make this guarantee, their computation time grows with the number of states. We show how to replace these overmatched planners with a class of samplebased planners—whose computation time is independent of the number of states—without sacrificing the sampleefficiency guarantees of the overall learning algorithms. To do so, we define sufficient criteria for a samplebased planner to be used in such a learning system and analyze two popular samplebased approaches from the literature. We also introduce our own samplebased planner, which combines the strategies from these algorithms and still meets the criteria for integration into our learning system. In doing so, we define the first complete RL solution for compactly represented (exponentially sized) state spaces with efficiently learnable dynamics that is both sample efficient and whose computation time does not grow rapidly with the number of states.
A MonteCarlo AIXI Approximation
, 2009
"... This paper describes a computationally feasible approximation to the AIXI agent, a universal reinforcement learning agent for arbitrary environments. AIXI is scaled down in two key ways: First, the class of environment models is restricted to all prediction suffix trees of a fixed maximum depth. Thi ..."
Abstract

Cited by 33 (11 self)
 Add to MetaCart
This paper describes a computationally feasible approximation to the AIXI agent, a universal reinforcement learning agent for arbitrary environments. AIXI is scaled down in two key ways: First, the class of environment models is restricted to all prediction suffix trees of a fixed maximum depth. This allows a Bayesian mixture of environment models to be computed in time proportional to the logarithm of the size of the model class. Secondly, the finitehorizon expectimax search is approximated by an asymptotically convergent Monte Carlo Tree Search technique. This scaled down AIXI agent is empirically shown to be effective on a wide class of toy problem domains, ranging from simple fully observable games to small POMDPs. We explore the limits of this approximate agent and propose a general heuristic framework for scaling this technique to much larger problems.
Generalizing apprenticeship learning across hypothesis classes
 In ICML
, 2010
"... All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract

Cited by 19 (10 self)
 Add to MetaCart
All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
Exploration in modelbased reinforcement learning by empirically estimating learning progress
 In Neural Information Processing Systems (NIPS
, 2012
"... Formal exploration approaches in modelbased reinforcement learning estimate the accuracy of the currently learned model without consideration of the empirical prediction error. For example, PACMDP approaches such as RMAX base their model certainty on the amount of collected data, while Bayesian a ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
(Show Context)
Formal exploration approaches in modelbased reinforcement learning estimate the accuracy of the currently learned model without consideration of the empirical prediction error. For example, PACMDP approaches such as RMAX base their model certainty on the amount of collected data, while Bayesian approaches assume a prior over the transition dynamics. We propose extensions to such approaches which drive exploration solely based on empirical estimates of the learner’s accuracy and learning progress. We provide a “sanity check ” theoretical analysis, discussing the behavior of our extensions in the standard stationary finite stateaction case. We then provide experimental studies demonstrating the robustness of these exploration measures in cases of nonstationary environments or where original approaches are misled by wrong domain assumptions. 1
Agnostic system identification for modelbased reinforcement learning
 In ICML
, 2012
"... A fundamental problem in control is to learn a model of a system from observations that is useful for controller synthesis. To provide good performance guarantees, existing methods must assume that the real system is in the class of models considered during learning. We present an iterative method w ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
(Show Context)
A fundamental problem in control is to learn a model of a system from observations that is useful for controller synthesis. To provide good performance guarantees, existing methods must assume that the real system is in the class of models considered during learning. We present an iterative method with strong guarantees even in the agnostic case where the system is not in the class. In particular, we show that any noregret online learning algorithm can be used to obtain a nearoptimal policy, provided some model achieves low training error and access to a good exploration distribution. Our approach applies to both discrete and continuous domains. We demonstrate its efficacy and scalability on a challenging helicopter domain from the literature. 1.
Exploration in Relational Domains for Modelbased Reinforcement Learning
, 2012
"... A fundamental problem in reinforcement learning is balancing exploration and exploitation. We address this problem in the context of modelbased reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E 3 and RMAX algorithms. Efficien ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
A fundamental problem in reinforcement learning is balancing exploration and exploitation. We address this problem in the context of modelbased reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E 3 and RMAX algorithms. Efficient exploration in exponentially large state spaces needs to exploit the generalization of the learned model: what in a propositional setting would be considered a novel situation and worth exploration may in the relational setting be a wellknown context in which exploitation is promising. To address this we introduce relational count functions which generalize the classical notion of state and action visitation counts. We provide guarantees on the exploration efficiency of our framework using count functions under the assumption that we had a relational KWIK learner and a nearoptimal planner. We propose a concrete exploration algorithm which integrates a practically efficient probabilistic rule learner and a relational planner (for which there are no guarantees, however) and employs the contexts of learned relational rules as features to model the novelty of states and actions. Our results in noisy 3D simulated robot manipulation problems and in domains of the international planning competition demonstrate that our approach is more effective than existing propositional and factored exploration techniques.
Efficient learning of relational models for sequential decision making
, 2010
"... The explorationexploitation tradeoff is crucial to reinforcementlearning (RL) agents, and a significant number of sample complexity results have been derived for agents in propositional domains. These results guarantee, with high probability, nearoptimal behavior in all but a polynomial number of ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
The explorationexploitation tradeoff is crucial to reinforcementlearning (RL) agents, and a significant number of sample complexity results have been derived for agents in propositional domains. These results guarantee, with high probability, nearoptimal behavior in all but a polynomial number of timesteps in the agent’s lifetime. In this work, we prove similar results for certain relational representations, primarily a class we call “relational action schemas”. These generalized models allow us to specify state transitions in a compact form, for instance describing the effect of picking up a generic block instead of picking up 10 different specific blocks. We present theoretical results on crucial subproblems in actionschema learning using the KWIK framework, which allows us to characterize the sample efficiency of an agent learning these models in a reinforcementlearning setting. These results are extended in an apprenticeship learning paradigm where and agent has access not only to its environment, but also to a teacher that can demonstrate traces of state/action/state sequences. We show that the class of action schemas that are efficiently learnable in this paradigm is strictly larger than those learnable in the online setting. We link
Dynamic Policy Programming
 Journal of Machine Learning Research
"... The following full text is a preprint version which may differ from the publisher's version. ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
The following full text is a preprint version which may differ from the publisher's version.
Speedy Qlearning
 In Advances in Neural Information Processing Systems 24
, 2011
"... Abstract We introduce a new convergent variant of Qlearning, called speedy Qlearning (SQL), to address the problem of slow convergence in the standard form of the Qlearning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n stateaction pairs and the di ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
Abstract We introduce a new convergent variant of Qlearning, called speedy Qlearning (SQL), to address the problem of slow convergence in the standard form of the Qlearning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n stateaction pairs and the discount factor γ only T = O log(n)/(ǫ 2 (1 − γ) 4 ) steps are required for the SQL algorithm to converge to an ǫoptimal actionvalue function with high probability. This bound has a better dependency on 1/ǫ and 1/(1 − γ), and thus, is tighter than the best available result for Qlearning. Our bound is also superior to the existing results for both modelfree and modelbased instances of batch Qvalue iteration that are considered to be more efficient than the incremental methods like Qlearning.
Sample Complexity of Multitask Reinforcement Learning
"... Transferring knowledge across a sequence of reinforcementlearning tasks is challenging, and has a number of important applications. Though there is encouraging empirical evidence that transfer can improve performance in subsequent reinforcementlearning tasks, there has been very little theoretical ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Transferring knowledge across a sequence of reinforcementlearning tasks is challenging, and has a number of important applications. Though there is encouraging empirical evidence that transfer can improve performance in subsequent reinforcementlearning tasks, there has been very little theoretical analysis. In this paper, we introduce a new multitask algorithm for a sequence of reinforcementlearning tasks when each task is sampled independently from (an unknown) distribution over a finite set of Markov decision processes whose parameters are initially unknown. For this setting, we prove under certain assumptions that the pertask sample complexity of exploration is reduced significantly due to transfer compared to standard singletask algorithms. Our multitask algorithm also has the desired characteristic that it is guaranteed not to exhibit negative transfer: in the worst case its pertask sample complexity is comparable to the corresponding singletask algorithm. 1