Results 1  10
of
38
A survey of Monte Carlo tree search methods
 IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI
, 2012
"... Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a ra ..."
Abstract

Cited by 104 (18 self)
 Add to MetaCart
(Show Context)
Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm’s derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and nongame domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.
Efficient Bayesadaptive reinforcement learning using samplebased search
 In Neural Information Processing Systems
, 2012
"... Abstract Bayesian modelbased reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty. In this setting, a Bayesoptimal policy captures the ideal tradeoff between exploration and exploitation. Unfortunately, finding Bayesoptimal policies is notor ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
(Show Context)
Abstract Bayesian modelbased reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty. In this setting, a Bayesoptimal policy captures the ideal tradeoff between exploration and exploitation. Unfortunately, finding Bayesoptimal policies is notoriously taxing due to the enormous search space in the augmented beliefstate MDP. In this paper we exploit recent advances in samplebased planning, based on MonteCarlo tree search, to introduce a tractable method for approximate Bayesoptimal planning. Unlike prior work in this area, we avoid expensive applications of Bayes rule within the search tree, by lazily sampling models from the current beliefs. Our approach outperformed prior Bayesian modelbased RL algorithms by a significant margin on several wellknown benchmark problems.
Learning is planning: near Bayesoptimal reinforcement learning via MonteCarlo tree search
"... Bayesoptimal behavior, while welldefined, is often difficult to achieve. Recent advances in the use of MonteCarlo tree search (MCTS) have shown that it is possible to act nearoptimally in Markov Decision Processes (MDPs) with very large or infinite state spaces. Bayesoptimal behavior in an unkn ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
Bayesoptimal behavior, while welldefined, is often difficult to achieve. Recent advances in the use of MonteCarlo tree search (MCTS) have shown that it is possible to act nearoptimally in Markov Decision Processes (MDPs) with very large or infinite state spaces. Bayesoptimal behavior in an unknown MDP is equivalent to optimal behavior in the known beliefspace MDP, although the size of this beliefspace MDP grows exponentially with the amount of history retained, and is potentially infinite. We show how an agent can use one particular MCTS algorithm, Forward Search Sparse Sampling (FSSS), in an efficient way to act nearly Bayesoptimally for all but a polynomial number of steps, assuming that FSSS can be used to act efficiently in any possible underlying MDP. 1
RTMBA: A RealTime ModelBased Reinforcement Learning Architecture for Robot Control
"... Abstract—Reinforcement Learning (RL) is a paradigm for learning decisionmaking tasks that could enable robots to learn and adapt to their situation online. For an RL algorithm to be practical for robotic control tasks, it must learn in very few samples, while continually taking actions in realtim ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
Abstract—Reinforcement Learning (RL) is a paradigm for learning decisionmaking tasks that could enable robots to learn and adapt to their situation online. For an RL algorithm to be practical for robotic control tasks, it must learn in very few samples, while continually taking actions in realtime. Existing modelbased RL methods learn in relatively few samples, but typically take too much time between each action for practical online learning. In this paper, we present a novel parallel architecture for modelbased RL that runs in realtime by 1) taking advantage of samplebased approximate planning methods and 2) parallelizing the acting, model learning, and planning processes in a novel way such that the acting process is sufficiently fast for typical robot control cycles. We demonstrate thatalgorithmsusingthisarchitectureperformnearlyaswellas methods using the typical sequential architecture when both are given unlimited time, and greatly outperform these methods on tasks that require realtime actions such as controlling an autonomous vehicle. I.
Action selection for MDPs: Anytime AO* vs. UCT
 In AAAI
"... In the presence of nonadmissible heuristics, A * and other bestfirst algorithms can be converted into anytime optimal algorithms over OR graphs, by simply continuing the search after the first solution is found. The same trick, however, does not work for bestfirst algorithms over AND/OR graphs, t ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
In the presence of nonadmissible heuristics, A * and other bestfirst algorithms can be converted into anytime optimal algorithms over OR graphs, by simply continuing the search after the first solution is found. The same trick, however, does not work for bestfirst algorithms over AND/OR graphs, that must be able to expand leaf nodes of the explicit graph that are not necessarily part of the best partial solution. Anytime optimal variants of AO * must thus address an explorationexploitation tradeoff: they cannot just ”exploit”, they must keep exploring as well. In this work, we develop one such variant of AO * and apply it to finitehorizon MDPs. This Anytime AO * algorithm eventually delivers an optimal policy while using nonadmissible random heuristics that can be sampled, as when the heuristic is the cost of a base policy that can be sampled with rollouts. We then test Anytime AO * for action selection over large infinitehorizon MDPs that cannot be solved with existing offline heuristic search and dynamic programming algorithms, and compare it with UCT.
Efficient learning of relational models for sequential decision making
, 2010
"... The explorationexploitation tradeoff is crucial to reinforcementlearning (RL) agents, and a significant number of sample complexity results have been derived for agents in propositional domains. These results guarantee, with high probability, nearoptimal behavior in all but a polynomial number of ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
The explorationexploitation tradeoff is crucial to reinforcementlearning (RL) agents, and a significant number of sample complexity results have been derived for agents in propositional domains. These results guarantee, with high probability, nearoptimal behavior in all but a polynomial number of timesteps in the agent’s lifetime. In this work, we prove similar results for certain relational representations, primarily a class we call “relational action schemas”. These generalized models allow us to specify state transitions in a compact form, for instance describing the effect of picking up a generic block instead of picking up 10 different specific blocks. We present theoretical results on crucial subproblems in actionschema learning using the KWIK framework, which allows us to characterize the sample efficiency of an agent learning these models in a reinforcementlearning setting. These results are extended in an apprenticeship learning paradigm where and agent has access not only to its environment, but also to a teacher that can demonstrate traces of state/action/state sequences. We show that the class of action schemas that are efficiently learnable in this paradigm is strictly larger than those learnable in the online setting. We link
Action Selection for MDPs: Anytime AO * Versus UCT
 PROCEEDINGS OF THE TWENTYSIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 2012
"... In the presence of nonadmissible heuristics, A* and other bestfirst algorithms can be converted into anytime optimal algorithms over OR graphs, by simply continuing the search after the first solution is found. The same trick, however, does not work for bestfirst algorithms over AND/OR graphs, th ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
In the presence of nonadmissible heuristics, A* and other bestfirst algorithms can be converted into anytime optimal algorithms over OR graphs, by simply continuing the search after the first solution is found. The same trick, however, does not work for bestfirst algorithms over AND/OR graphs, that must be able to expand leaf nodes of the explicit graph that are not necessarily part of the best partial solution. Anytime optimal variants of AO * must thus address an explorationexploitation tradeoff: they cannot just ”exploit”, they must keep exploring as well. In this work, we develop one such variant of AO* and apply it to finitehorizon MDPs. This Anytime AO* algorithm eventually delivers an optimal policy while using nonadmissible random heuristics that can be sampled, as when the heuristic is the cost of a base policy that can be sampled with rollouts. We then test Anytime AO* for action selection over large infinitehorizon MDPs that cannot be solved with existing offline heuristic search and dynamic programming algorithms, and compare it with UCT.
Simple regret optimization in online planning for markov decision processes
 CoRR
, 2012
"... We consider online planning in Markov decision processes (MDPs). In online planning, the agent focuses on its current state only, deliberates about the set of possible policies from that state onwards and, when interrupted, uses the outcome of that exploratory deliberation to choose what action to p ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
We consider online planning in Markov decision processes (MDPs). In online planning, the agent focuses on its current state only, deliberates about the set of possible policies from that state onwards and, when interrupted, uses the outcome of that exploratory deliberation to choose what action to perform next. Formally, the performance of algorithms for online planning is assessed in terms of simple regret, the agent’s expected performance loss when the chosen action, rather than an optimal one, is followed. To date, stateoftheart algorithms for online planning in general MDPs are either best effort, or guarantee only polynomialrate reduction of simple regret over time. Here we introduce a new MonteCarlo tree search algorithm, BRUE, that guarantees exponentialrate and smooth reduction of simple regret. At a high level, BRUE is based on a simple yet nonstandard statespace sampling scheme, MCTS2e, in which different parts of each sample are dedicated to different exploratory objectives. We further extend BRUE with a variant of “learning by forgetting. ” The resulting parametrized algorithm, BRUE(α), exhibits even more attractive formal guarantees than BRUE. Our empirical evaluation shows that both BRUE and its generalization, BRUE(α), are also very effective in practice and compare favorably to the stateoftheart. 1.
Scalable and efficient Bayesadaptive reinforcement learning based on MonteCarlo tree search
 Journal of Artificial Intelligence Research
, 2013
"... Bayesian planning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, planning optimally in the face of uncertainty is notoriously taxing, since the search space is enormous. In this paper we i ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Bayesian planning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, planning optimally in the face of uncertainty is notoriously taxing, since the search space is enormous. In this paper we introduce a tractable, samplebased method for approximate Bayesoptimal planning which exploits MonteCarlo tree search. Our approach avoids expensive applications of Bayes rule within the search tree by sampling models from current beliefs, and furthermore performs this sampling in a lazy manner. This enables it to outperform previous Bayesian modelbased reinforcement learning algorithms by a significant margin on several wellknown benchmark problems. As we show, our approach can even work in problems with an infinite state space that lie qualitatively out of reach of almost all previous work in Bayesian exploration. 1.
Efficient planning in Rmax
 In Proc. of AAMAS
, 2011
"... PACMDP algorithms are particularly efficient in terms of the number of samples obtained from the environment which are needed by the learning agents in order to achieve a near optimal performance. These algorithms however execute a time consuming planning step after each new stateaction pair becom ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
PACMDP algorithms are particularly efficient in terms of the number of samples obtained from the environment which are needed by the learning agents in order to achieve a near optimal performance. These algorithms however execute a time consuming planning step after each new stateaction pair becomes known to the agent, that is, the pair has been sampled sufficiently many times to be considered as known by the algorithm. This fact is a serious limitation on broader applications of these kind of algorithms. This paper examines the planning problem in PACMDP learning. Value iteration, prioritized sweeping, and backward value iteration are investigated. Through the exploitation of the specific nature of the planning problem in the considered reinforcement learning algorithms, we show how these planning algorithms can be improved. Our extensions yield significant improvements in all evaluated algorithms, and standard value iteration in particular. The theoretical justification to all contributions is provided and all approaches are further evaluated empirically. With our extensions, we managed to solve problems of sizes which have never been approached by PACMDP learning in the existing literature.