Results 1 - 10
of
16
Near-Bayesian exploration in polynomial time (full version). Available at http://ai.stanford.edu/˜kolter
, 2009
"... We consider the exploration/exploitation problem in reinforcement learning (RL). The Bayesian approach to model-based RL offers an elegant solution to this problem, by considering a distribution over possible models and acting to maximize expected reward; unfortunately, the Bayesian solution is intr ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
We consider the exploration/exploitation problem in reinforcement learning (RL). The Bayesian approach to model-based RL offers an elegant solution to this problem, by considering a distribution over possible models and acting to maximize expected reward; unfortunately, the Bayesian solution is intractable for all but very restricted cases. In this paper we present a simple algorithm, and prove that with high probability it is able to perform ǫ-close to the true (intractable) optimal Bayesian policy after some small (polynomial in quantities describing the system) number of time steps. The algorithm and analysis are motivated by the so-called PAC-MDP approach, and extend such results into the setting of Bayesian RL. In this setting, we show that we can achieve lower sample complexity bounds than existing algorithms, while using an exploration strategy that is much greedier than the (extremely cautious) exploration of PAC-MDP algorithms. 1.
Reinforcement Learning in Finite MDPs: PAC Analysis Reinforcement Learning in Finite MDPs: PAC Analysis
"... Editor: We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PAC-MDP ” algorithms include the well-known E 3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Editor: We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PAC-MDP ” algorithms include the well-known E 3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. We also present a more refined analysis that yields insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX. Finally, we conclude with open problems.
Nonparametric Bayesian Policy Priors for Reinforcement Learning
- In Neural Information Processing Systems (NIPS
, 2010
"... We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectori ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. 1
Variance-Based Rewards for Approximate Bayesian Reinforcement Learning
"... The explore–exploit dilemma is one of the central challenges in Reinforcement Learning (RL). Bayesian RL solves the dilemma by providing the agent with information in the form of a prior distribution over environments; however, full Bayesian planning is intractable. Planning with the mean MDP is a c ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The explore–exploit dilemma is one of the central challenges in Reinforcement Learning (RL). Bayesian RL solves the dilemma by providing the agent with information in the form of a prior distribution over environments; however, full Bayesian planning is intractable. Planning with the mean MDP is a common myopic approximation of Bayesian planning. We derive a novel reward bonus that is a function of the posterior distribution over environments, which, when added to the reward in planning with the mean MDP, results in an agent which explores efficiently and effectively. Although our method is similar to existing methods when given an uninformative or unstructured prior, unlike existing methods, our method can exploit structured priors. We prove that our method results in a polynomial sample complexity and empirically demonstrate its advantages in a structured exploration task. 1
Real Time Targeted Exploration in Large Domains
"... Abstract—A developing agent needs to explore to learn about the world and learn good behaviors. In many real world tasks, this exploration can take far too long, and the agent must make decisions about which states to explore, and which states not to explore. Bayesian methods attempt to address this ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract—A developing agent needs to explore to learn about the world and learn good behaviors. In many real world tasks, this exploration can take far too long, and the agent must make decisions about which states to explore, and which states not to explore. Bayesian methods attempt to address this problem, but take too much computation time to run in reasonably sized domains. In this paper, we present TEXPLORE, the first algorithm to perform targeted exploration in real time in large domains. The algorithm learns multiple possible models of the domain that generalize action effects across states. We experiment with possible ways of adding intrinsic motivation to the agent to drive exploration. TEXPLORE isfullyimplementedandtestedinanovel domain called Fuel World that is designed to reflect the type of targeted exploration needed in the real world. We show that our algorithm significantly outperforms representative examples of both model-free and model-based RL algorithms from the literature and is able to quickly learn to perform well in a large world in real-time. I.
A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
"... Bayesian learning methods have recently been shown to provide an elegant solution to the explorationexploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). The primary focus of th ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Bayesian learning methods have recently been shown to provide an elegant solution to the explorationexploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes. This new framework can be used to simultaneously (1) learn a model of the POMDP domain through interaction with the environment, (2) track the state of the system under partial observability, and (3) plan (near-)optimal sequences of actions. An important contribution of this paper is to provide theoretical results showing how the model can be finitely approximated while preserving good learning performance. We present approximate algorithms for belief tracking and planning in this model, as well as empirical results that illustrate how the model estimate and agent’s return improve as a function of experience. Keywords: processes reinforcement learning, Bayesian inference, partially observable Markov decision 1.
Robust Bayesian reinforcement learning through tight lower bounds
"... Abstract. In the Bayesian approach to sequential decision making, exact calculation of the (subjective) utility is intractable. This extends to most special cases of interest, such as reinforcement learning problems. While utility bounds are known to exist for this problem, so far none of them were ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. In the Bayesian approach to sequential decision making, exact calculation of the (subjective) utility is intractable. This extends to most special cases of interest, such as reinforcement learning problems. While utility bounds are known to exist for this problem, so far none of them were particularly tight. In this paper, we show how to efficiently calculate a lower bound, which corresponds to the utility of the optimal stationary policy for the decision problem, which is generally different from both the Bayes-optimal policy and the policy which is optimal for the mean MDP. We then show how these can be applied to obtain robust exploration policies in a Bayesian reinforcement learning setting. 1
PAC-MDP Reinforcement Learning with Bayesian Priors
"... In an effort to build on recent advances in reinforcement learning and Bayesian modeling, this work (Asmuth et al., 2009) combines ideas from two lines of research on exploration in reinforcement learning or RL (Sutton & Barto, 1998). Bayesian RL research (Dearden et al., 1999; Poupart et al., 2006) ..."
Abstract
- Add to MetaCart
In an effort to build on recent advances in reinforcement learning and Bayesian modeling, this work (Asmuth et al., 2009) combines ideas from two lines of research on exploration in reinforcement learning or RL (Sutton & Barto, 1998). Bayesian RL research (Dearden et al., 1999; Poupart et al., 2006) formulates the RL problem as decision making in the belief space of all possible environment models. As such, it becomes meaningful to talk about optimal RL— selecting actions that maximize the expected long-term reward given the uncertainty in the model. Although progress has been made in approximating optimal policies in model belief space, these techniques have not been shown to scale well and come with no finite-sample guarantees on the quality of the derived policies. PAC-MDP RL approaches (Fiechter, 1994; Kearns &
A Practical and Conceptual Framework for Learning in Control
, 2010
"... We propose a fully Bayesian approach for efficient reinforcement learning (RL) in Markov decision processes with continuous-valued state and action spaces when no expert knowledge is available. Our framework is based on well-established ideas from statistics and machine learning and learns fast sinc ..."
Abstract
- Add to MetaCart
We propose a fully Bayesian approach for efficient reinforcement learning (RL) in Markov decision processes with continuous-valued state and action spaces when no expert knowledge is available. Our framework is based on well-established ideas from statistics and machine learning and learns fast since it carefully models, quantifies, and incorporates available knowledge when making decisions. The key ingredient of our framework is a probabilistic model, which is implemented using a Gaussian process (GP), a distribution over functions. In the context of dynamic systems, the GP models the transition function. By considering all plausible transition functions simultaneously, we reduce model bias, a problem that frequently occurs when deterministic models are used. Due to its generality and efficiency, our RL framework can be considered a conceptual and practical approach to learning models and controllers when
Research Experience Research Scientist
, 2001
"... Actively pursuing research into structured dynamical systems modeling with Bayesian nonparametrics, planning and model building for reinforcement learning, structured policy priors for policy learning, and universal inference for probabilistic programming languages. Current applied thrusts include r ..."
Abstract
- Add to MetaCart
Actively pursuing research into structured dynamical systems modeling with Bayesian nonparametrics, planning and model building for reinforcement learning, structured policy priors for policy learning, and universal inference for probabilistic programming languages. Current applied thrusts include reinforcement learning for multicore systems, machine learning for oil discovery, and generative models of machine vision. Contributed to funding efforts for AFOSR and Shell Oil.

