Results 1  10
of
506
Finitetime analysis of the multiarmed bandit problem
 Machine Learning
, 2002
"... Abstract. Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy’s success in addressing ..."
Abstract

Cited by 804 (15 self)
 Add to MetaCart
Abstract. Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy’s success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multiarmed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support. Keywords: bandit problems, adaptive allocation rules, finite horizon regret 1.
The Nonstochastic Multiarmed Bandit Problem
 SIAM JOURNAL OF COMPUTING
, 2002
"... In the multiarmed bandit problem, a gambler must decide which arm of K nonidentical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the tradeoff between exploration (trying out ..."
Abstract

Cited by 492 (34 self)
 Add to MetaCart
In the multiarmed bandit problem, a gambler must decide which arm of K nonidentical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the tradeoff between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a wellbehaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the perround payoff of our algorithm approaches that of the best arm at the rate O(T−1/2). We show by a matching lower bound that this is the best possible. We also prove that our algorithm approaches the perround payoff of any set of strategies at a similar rate: if the best strategy is chosen from a pool of N strategies, then our algorithm approaches the perround payoff of the strategy at the rate O((logN)1/2T−1/2). Finally, we apply our results to the problem of playing an unknown repeated matrix game. We show that our algorithm approaches the minimax payoff of the unknown game at the rate O(T−1/2).
Bandit based MonteCarlo Planning
 In: ECML06. Number 4212 in LNCS
, 2006
"... Abstract. For large statespace Markovian Decision Problems MonteCarlo planning is one of the few viable approaches to find nearoptimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide MonteCarlo planning. In finitehorizon or discounted MDPs the algo ..."
Abstract

Cited by 433 (7 self)
 Add to MetaCart
Abstract. For large statespace Markovian Decision Problems MonteCarlo planning is one of the few viable approaches to find nearoptimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide MonteCarlo planning. In finitehorizon or discounted MDPs the algorithm is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling. Experimental results show that in several domains, UCT is significantly more efficient than its alternatives. 1
Gambling in a rigged casino: The adversarial multiarmed bandit problem
, 1995
"... In the multiarmed bandit problem, a gambler must decide which arm of K nonidentical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the tradeoff between exploration (trying ou ..."
Abstract

Cited by 244 (7 self)
 Add to MetaCart
In the multiarmed bandit problem, a gambler must decide which arm of K nonidentical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the tradeoff between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a wellbehaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the expected perround payoff of our algorithm approaches that of the best arm at the rate O(T \Gamma1=2 ), and we give an improved rate of conver...
Universal prediction
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 1998
"... This paper consists of an overview on universal prediction from an informationtheoretic perspective. Special attention is given to the notion of probability assignment under the selfinformation loss function, which is directly related to the theory of universal data compression. Both the probabili ..."
Abstract

Cited by 186 (16 self)
 Add to MetaCart
This paper consists of an overview on universal prediction from an informationtheoretic perspective. Special attention is given to the notion of probability assignment under the selfinformation loss function, which is directly related to the theory of universal data compression. Both the probabilistic setting and the deterministic setting of the universal prediction problem are described with emphasis on the analogy and the differences between results in the two settings.
Using Confidence Bounds for ExploitationExploration Tradeoffs
 Journal of Machine Learning Research
, 2002
"... We show how a standard tool from statistics  namely confidence bounds  can be used to elegantly deal with situations which exhibit an exploitationexploration tradeo#. Our technique for designing and analyzing algorithms for such situations is general and can be applied when an algorithm h ..."
Abstract

Cited by 177 (4 self)
 Add to MetaCart
We show how a standard tool from statistics  namely confidence bounds  can be used to elegantly deal with situations which exhibit an exploitationexploration tradeo#. Our technique for designing and analyzing algorithms for such situations is general and can be applied when an algorithm has to make exploitationversusexploration decisions based on uncertain information provided by a random process.
A ContextualBandit Approach to Personalized News Article Recommendation
"... Personalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two reasons. First, web service is featured with dynamic ..."
Abstract

Cited by 170 (16 self)
 Add to MetaCart
(Show Context)
Personalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two reasons. First, web service is featured with dynamically changing pools of content, rendering traditional collaborative filtering methods inapplicable. Second, the scale of most web services of practical interest calls for solutions that are both fast in learning and computation. In this work, we model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its articleselection strategy based on userclick feedback to maximize total user clicks. The contributions of this work are threefold. First, we propose a new, general contextual bandit algorithm that is computationally efficient and well motivated from learning theory. Second, we argue that any bandit algorithm can be reliably evaluated offline using previously recorded random traffic. Finally, using this offline evaluation method, we successfully applied our new algorithm to a Yahoo! Front Page Today Module dataset containing over 33 million events. Results showed a 12.5 % click lift compared to a standard contextfree bandit algorithm, and the advantage becomes even greater when data gets more scarce.
Nearly tight bounds for the continuumarmed bandit problem
 Advances in Neural Information Processing Systems 17
, 2005
"... In the multiarmed bandit problem, an online algorithm must choose from a set of strategies in a sequence of n trials so as to minimize the total cost of the chosen strategies. While nearly tight upper and lower bounds are known in the case when the strategy set is finite, much less is known when th ..."
Abstract

Cited by 121 (7 self)
 Add to MetaCart
(Show Context)
In the multiarmed bandit problem, an online algorithm must choose from a set of strategies in a sequence of n trials so as to minimize the total cost of the chosen strategies. While nearly tight upper and lower bounds are known in the case when the strategy set is finite, much less is known when there is an infinite strategy set. Here we consider the case when the set of strategies is a subset of R d, and the cost functions are continuous. In the d = 1 case, we improve on the bestknown upper and lower bounds, closing the gap to a sublogarithmic factor. We also consider the case where d> 1 and the cost functions are convex, adapting a recent online convex optimization algorithm of Zinkevich to the sparser feedback model of the multiarmed bandit problem. 1
A survey of Monte Carlo tree search methods
 IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI
, 2012
"... Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a ra ..."
Abstract

Cited by 101 (17 self)
 Add to MetaCart
(Show Context)
Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm’s derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and nongame domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.