Results 1  10
of
71
Progressive Strategies for MonteCarlo Tree Search
"... Twoperson zerosum games with perfect information have been addressed by many AI researchers with great success for fifty years [van den Herik et ..."
Abstract

Cited by 98 (29 self)
 Add to MetaCart
Twoperson zerosum games with perfect information have been addressed by many AI researchers with great success for fifty years [van den Herik et
Pure exploration in multiarmed bandits problems
 IN PROCEEDINGS OF THE TWENTIETH INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY (ALT 2009
, 2009
"... We consider the framework of stochastic multiarmed bandit problems and study the possibilities and limitations of strategies that explore sequentially the arms. The strategies are assessed not in terms of their cumulative regrets, as is usually the case, but through quantities referred to as simpl ..."
Abstract

Cited by 80 (13 self)
 Add to MetaCart
(Show Context)
We consider the framework of stochastic multiarmed bandit problems and study the possibilities and limitations of strategies that explore sequentially the arms. The strategies are assessed not in terms of their cumulative regrets, as is usually the case, but through quantities referred to as simple regrets. The latter are related to the (expected) gains of the decisions that the strategies would recommend for a new oneshot instance of the same multiarmed bandit problem. Here, exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when cumulative regrets are considered and when exploitation needs to be performed at the same time. We start by indicating the links between simple and cumulative regrets. A small cumulative regret entails a small simple regret but too small a cumulative regret prevents the simple regret from decreasing exponentially towards zero, its optimal distributiondependent rate. We therefore introduce specific strategies, for which we prove both distributiondependent and distributionfree bounds. A concluding experimental study puts these theoretical bounds in perspective and shows the interest of nonuniform exploration of the arms.
Online optimization in Xarmed bandits
 In Advances in Neural Information Processing Systems 22
, 2008
"... We consider a generalization of stochastic bandit problems where the set of arms, X, is allowed to be a generic topological space and the meanpayoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm ..."
Abstract

Cited by 49 (11 self)
 Add to MetaCart
(Show Context)
We consider a generalization of stochastic bandit problems where the set of arms, X, is allowed to be a generic topological space and the meanpayoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy whose regret improves upon previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the meanpayoff function has a finite number of global maxima around which the behavior of the function is locally Hölder with a known exponent, then the expected regret is bounded up to a logarithmic factor by √ n, i.e., the rate of the growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm for the class of problems considered. 1 Introduction and
Improved algorithms for linear stochastic bandits
 In Advances in Neural Information Processing Systems
, 2011
"... We improve the theoretical analysis and empirical performance of algorithms for the stochastic multiarmed bandit problem and the linear stochastic multiarmed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability consta ..."
Abstract

Cited by 40 (1 self)
 Add to MetaCart
We improve the theoretical analysis and empirical performance of algorithms for the stochastic multiarmed bandit problem and the linear stochastic multiarmed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vectorvalued martingales. 1
Integrating Samplebased Planning and Modelbased Reinforcement Learning
"... Recent advancements in modelbased reinforcement learning have shown that the dynamics of many structured domains (e.g. DBNs) can be learned with tractable sample complexity, despite their exponentially large state spaces. Unfortunately, these algorithms all require access to a planner that computes ..."
Abstract

Cited by 38 (5 self)
 Add to MetaCart
Recent advancements in modelbased reinforcement learning have shown that the dynamics of many structured domains (e.g. DBNs) can be learned with tractable sample complexity, despite their exponentially large state spaces. Unfortunately, these algorithms all require access to a planner that computes a near optimal policy, and while many traditional MDP algorithms make this guarantee, their computation time grows with the number of states. We show how to replace these overmatched planners with a class of samplebased planners—whose computation time is independent of the number of states—without sacrificing the sampleefficiency guarantees of the overall learning algorithms. To do so, we define sufficient criteria for a samplebased planner to be used in such a learning system and analyze two popular samplebased approaches from the literature. We also introduce our own samplebased planner, which combines the strategies from these algorithms and still meets the criteria for integration into our learning system. In doing so, we define the first complete RL solution for compactly represented (exponentially sized) state spaces with efficiently learnable dynamics that is both sample efficient and whose computation time does not grow rapidly with the number of states.
XArmed Bandits
, 2010
"... We consider a generalization of stochastic bandits where the set of arms, ..."
Abstract

Cited by 28 (7 self)
 Add to MetaCart
We consider a generalization of stochastic bandits where the set of arms,
Optimistic planning of deterministic systems
 EUROPEAN WORKSHOP ON REINFORCEMENT LEARNING, FRANCE
, 2008
"... If one possesses a model of a controlled deterministic system, then from any state, one may consider the set of all possible reachable states starting from that state and using any sequence of actions. This forms a tree whose size is exponential in the planning time horizon. Here we ask the questio ..."
Abstract

Cited by 28 (11 self)
 Add to MetaCart
(Show Context)
If one possesses a model of a controlled deterministic system, then from any state, one may consider the set of all possible reachable states starting from that state and using any sequence of actions. This forms a tree whose size is exponential in the planning time horizon. Here we ask the question: given finite computational resources (e.g. CPU time), which may not be known ahead of time, what is the best way to explore this tree, such that once all resources have been used, the algorithm would be able to propose an action (or a sequence of actions) whose performance is as close as possible to optimality? The performance with respect to optimality is assessed in terms of the regret (with respect to the sum of discounted future rewards) resulting from choosing the action returned by the algorithm instead of an optimal action. In this paper we investigate an optimistic exploration of the tree, where the most promising states are explored first, and compare this approach to a naive uniform exploration. Bounds on the regret are derived both for uniform and optimistic exploration strategies. Numerical simulations illustrate the benefit of optimistic planning.
Open Loop Optimistic Planning
"... We consider the problem of planning in a stochastic and discounted environment with a limited numerical budget. More precisely, we investigate strategies exploring the set of possible sequences of actions, so that, once all available numerical resources (e.g. CPU time, number of calls to a generativ ..."
Abstract

Cited by 22 (8 self)
 Add to MetaCart
We consider the problem of planning in a stochastic and discounted environment with a limited numerical budget. More precisely, we investigate strategies exploring the set of possible sequences of actions, so that, once all available numerical resources (e.g. CPU time, number of calls to a generative model) have been used, one returns a recommendation on the best possible immediate action to follow based on this exploration. The performance of a strategy is assessed in terms of its simple regret, that is the loss in performance resulting from choosing the recommended action instead of an optimal one. We first provide a minimax lower bound for this problem, and show that a uniform planning strategy matches this minimax rate (up to a logarithmic factor). Then we propose a UCB (Upper Confidence Bounds)based planning algorithm, called OLOP (OpenLoop Optimistic Planning), which is also minimax optimal, and prove that it enjoys much faster rates when there is a small proportion of nearoptimal sequences of actions. Finally, we compare our results with the regret bounds one can derive for our setting with bandits algorithms designed for an infinite number of arms. 1
THE PARALLELIZATION OF MONTECARLO PLANNING
 ICINCO, MADEIRA: PORTUGAL
, 2008
"... Since their impressive successes in various areas of largescale parallelization, recent techniques like UCT and other MonteCarlo planning variants (Kocsis and Szepesvari, 2006a) have been extensively studied (Coquelin and Munos, 2007; Wang and Gelly, 2007). We here propose and compare various form ..."
Abstract

Cited by 21 (11 self)
 Add to MetaCart
Since their impressive successes in various areas of largescale parallelization, recent techniques like UCT and other MonteCarlo planning variants (Kocsis and Szepesvari, 2006a) have been extensively studied (Coquelin and Munos, 2007; Wang and Gelly, 2007). We here propose and compare various forms of parallelization of banditbased treesearch, in particular for our computergo algorithm XYZ.
Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness
"... We consider a global optimization problem of a deterministic functionf in a semimetric space, given a finite budget ofnevaluations. The functionf is assumed to be locally smooth (around one of its global maxima) with respect to a semimetric ℓ. We describe two algorithms based on optimistic explorat ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
We consider a global optimization problem of a deterministic functionf in a semimetric space, given a finite budget ofnevaluations. The functionf is assumed to be locally smooth (around one of its global maxima) with respect to a semimetric ℓ. We describe two algorithms based on optimistic exploration that use a hierarchical partitioning of the space at all scales. A first contribution is an algorithm, DOO, that requires the knowledge of ℓ. We report a finitesample performance bound in terms of a measure of the quantity of nearoptimal states. We then define a second algorithm, SOO, which does not require the knowledge of the semimetric ℓ under which f is smooth, and whose performance is almost as good as DOO optimallyfitted. 1