Results 1 - 10
of
445
Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design
"... Many applications require optimizing an unknown, noisy function that is expensive to evaluate. We formalize this task as a multiarmed bandit problem, where the payoff function is either sampled from a Gaussian process (GP) or has low RKHS norm. We resolve the important open problem of deriving regre ..."
Abstract
-
Cited by 125 (13 self)
- Add to MetaCart
(Show Context)
Many applications require optimizing an unknown, noisy function that is expensive to evaluate. We formalize this task as a multiarmed bandit problem, where the payoff function is either sampled from a Gaussian process (GP) or has low RKHS norm. We resolve the important open problem of deriving regret bounds for this setting, which imply novel convergence rates for GP optimization. We analyze GP-UCB, an intuitive upper-confidence based algorithm, and bound its cumulative regret in terms of maximal information gain, establishing a novel connection between GP optimization and experimental design. Moreover, by bounding the latter in terms of operator spectra, we obtain explicit sublinear regret bounds for many commonly used covariance functions. In some important cases, our bounds have surprisingly weak dependence on the dimensionality. In our experiments on real sensor data, GP-UCB compares favorably with other heuristical GP optimization approaches. 1.
Monte-Carlo planning in large POMDPs
- in Proc. Neural Inf. Process. Syst
, 2010
"... Abstract This paper introduces a Monte-Carlo algorithm for online planning in large POMDPs. The algorithm combines a Monte-Carlo update of the agent's belief state with a Monte-Carlo tree search from the current belief state. The new algorithm, POMCP, has two important properties. First, Monte ..."
Abstract
-
Cited by 112 (8 self)
- Add to MetaCart
(Show Context)
Abstract This paper introduces a Monte-Carlo algorithm for online planning in large POMDPs. The algorithm combines a Monte-Carlo update of the agent's belief state with a Monte-Carlo tree search from the current belief state. The new algorithm, POMCP, has two important properties. First, MonteCarlo sampling is used to break the curse of dimensionality both during belief state updates and during planning. Second, only a black box simulator of the POMDP is required, rather than explicit probability distributions. These properties enable POMCP to plan effectively in significantly larger POMDPs than has previously been possible. We demonstrate its effectiveness in three large POMDPs. We scale up a well-known benchmark problem, rocksample, by several orders of magnitude. We also introduce two challenging new POMDPs: 10 × 10 battleship and partially observable PacMan, with approximately 10 18 and 10 56 states respectively. Our MonteCarlo planning algorithm achieved a high level of performance with no prior knowledge, and was also able to exploit simple domain knowledge to achieve better results with less search. POMCP is the first general purpose planner to achieve high performance in such large and unfactored POMDPs.
A survey of Monte Carlo tree search methods
- IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI
, 2012
"... Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a ra ..."
Abstract
-
Cited by 104 (18 self)
- Add to MetaCart
(Show Context)
Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm’s derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.
Progressive Strategies for Monte-Carlo Tree Search
"... Two-person zero-sum games with perfect information have been addressed by many AI researchers with great success for fifty years [van den Herik et ..."
Abstract
-
Cited by 98 (29 self)
- Add to MetaCart
Two-person zero-sum games with perfect information have been addressed by many AI researchers with great success for fifty years [van den Herik et
Modication of UCT with patterns in Monte-Carlo go
, 2006
"... appor t de r echerche ..."
(Show Context)
Simulation-Based Approach to General Game Playing
- PROCEEDINGS OF THE TWENTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 2008
"... The aim of General Game Playing (GGP) is to create intelligent agents that automatically learn how to play many different games at an expert level without any human intervention. The most successful GGP agents in the past have used traditional game-tree search combined with an automatically learned ..."
Abstract
-
Cited by 84 (6 self)
- Add to MetaCart
(Show Context)
The aim of General Game Playing (GGP) is to create intelligent agents that automatically learn how to play many different games at an expert level without any human intervention. The most successful GGP agents in the past have used traditional game-tree search combined with an automatically learned heuristic function for evaluating game states. In this paper we describe a GGP agent that instead uses a Monte Carlo/UCT simulation technique for action selection, an approach recently popularized in computer Go. Our GGP agent has proven its effectiveness by winning last year’s AAAI GGP Competition. Furthermore, we introduce and empirically evaluate a new scheme for automatically learning search-control knowledge for guiding the simulation playouts, showing that it offers significant benefits for a variety of games.
Pure exploration in multi-armed bandits problems
- IN PROCEEDINGS OF THE TWENTIETH INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY (ALT 2009
, 2009
"... We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of strategies that explore sequentially the arms. The strategies are assessed not in terms of their cumulative regrets, as is usually the case, but through quantities referred to as simpl ..."
Abstract
-
Cited by 80 (13 self)
- Add to MetaCart
(Show Context)
We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of strategies that explore sequentially the arms. The strategies are assessed not in terms of their cumulative regrets, as is usually the case, but through quantities referred to as simple regrets. The latter are related to the (expected) gains of the decisions that the strategies would recommend for a new one-shot instance of the same multi-armed bandit problem. Here, exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when cumulative regrets are considered and when exploitation needs to be performed at the same time. We start by indicating the links between simple and cumulative regrets. A small cumulative regret entails a small simple regret but too small a cumulative regret prevents the simple regret from decreasing exponentially towards zero, its optimal distribution-dependent rate. We therefore introduce specific strategies, for which we prove both distribution-dependent and distribution-free bounds. A concluding experimental study puts these theoretical bounds in perspective and shows the interest of nonuniform exploration of the arms.
Bandit Algorithm for Tree Search
, 2007
"... apport de recherche ISSN 0249-6399 ISRN INRIA/RR--6141--FR+ENG ..."
Abstract
-
Cited by 71 (14 self)
- Add to MetaCart
(Show Context)
apport de recherche ISSN 0249-6399 ISRN INRIA/RR--6141--FR+ENG
Knows What It Knows: A Framework For Self-Aware Learning
"... We introduce a learning framework that combines elements of the well-known PAC and mistake-bound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is ..."
Abstract
-
Cited by 71 (20 self)
- Add to MetaCart
We introduce a learning framework that combines elements of the well-known PAC and mistake-bound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is true in reinforcement-learning and active-learning problems. We catalog several KWIK-learnable classes and open problems. 1.