• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Finite-time analysis of the multiarmed bandit problem,”Machine Learing (2002)

by P Auer, N Cesa-Bianchi, P Fischer
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 816
Next 10 →

A Contextual-Bandit Approach to Personalized News Article Recommendation

by Lihong Li, Wei Chu, John Langford, Robert E. Schapire
"... Personalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two reasons. First, web service is featured with dynamic ..."
Abstract - Cited by 178 (16 self) - Add to MetaCart
Personalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two reasons. First, web service is featured with dynamically changing pools of content, rendering traditional collaborative filtering methods inapplicable. Second, the scale of most web services of practical interest calls for solutions that are both fast in learning and computation. In this work, we model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article-selection strategy based on user-click feedback to maximize total user clicks. The contributions of this work are three-fold. First, we propose a new, general contextual bandit algorithm that is computationally efficient and well motivated from learning theory. Second, we argue that any bandit algorithm can be reliably evaluated offline using previously recorded random traffic. Finally, using this offline evaluation method, we successfully applied our new algorithm to a Yahoo! Front Page Today Module dataset containing over 33 million events. Results showed a 12.5 % click lift compared to a standard context-free bandit algorithm, and the advantage becomes even greater when data gets more scarce.
(Show Context)

Citation Context

...ing traffic. This strategy, with random exploration on an ǫ fraction of the traffic and greedy exploitation on the rest, is known as ǫ-greedy. Advanced exploration approaches such as EXP3 [8] or UCB1 =-=[7]-=- could be applied as well. Intuitively, we need to distribute more traffic to new content to learn its value more quickly, and fewer users to track temporal changes of existing content. Recently, pers...

Combining Online and Offline Knowledge in UCT

by Sylvain Gelly, David Silver - In Zoubin Ghahramani, editor, Proceedings of the International Conference of Machine Learning (ICML 2007 , 2007
"... The UCT algorithm learns a value function online using sample-based search. The TD(λ) algorithm can learn a value function offline for the on-policy distribution. We consider three approaches for combining offline and online value functions in the UCT algorithm. First, the offline value function is ..."
Abstract - Cited by 140 (7 self) - Add to MetaCart
The UCT algorithm learns a value function online using sample-based search. The TD(λ) algorithm can learn a value function offline for the on-policy distribution. We consider three approaches for combining offline and online value functions in the UCT algorithm. First, the offline value function is used as a default policy during Monte-Carlo simulation. Second, the UCT value function is combined with a rapid online estimate of action values. Third, the offline value function is used as prior knowledge in the UCT search tree. We evaluate these algorithms in 9 × 9 Go against GnuGo 3.7.10. The first algorithm performs better than UCT with a random simulation policy, but surprisingly, worse than UCT with a weaker, handcrafted simulation policy. The second algorithm outperforms UCT altogether. The third algorithm outperforms UCT with handcrafted prior knowledge. We combine these algorithms in MoGo, the world’s strongest 9 × 9 Go program. Each technique significantly improves MoGo’s playing strength. 1.
(Show Context)

Citation Context

... estimated for each state and action in the tree by Monte-Carlo simulation. The policy used by UCT is designed to balance exploration with exploitation, based on the multi-armed bandit algorithm UCB (=-=Auer et al., 2002-=-). If all actions from the current state s are represented in the tree, ∀a ∈ A(s),(s,a) ∈ T , then UCT selects the action that maximises an upper confidence bound on the action value, Q ⊕ UCT (s,a) = ...

Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design

by Niranjan Srinivas, Andreas Krause, Matthias Seeger
"... Many applications require optimizing an unknown, noisy function that is expensive to evaluate. We formalize this task as a multiarmed bandit problem, where the payoff function is either sampled from a Gaussian process (GP) or has low RKHS norm. We resolve the important open problem of deriving regre ..."
Abstract - Cited by 125 (13 self) - Add to MetaCart
Many applications require optimizing an unknown, noisy function that is expensive to evaluate. We formalize this task as a multiarmed bandit problem, where the payoff function is either sampled from a Gaussian process (GP) or has low RKHS norm. We resolve the important open problem of deriving regret bounds for this setting, which imply novel convergence rates for GP optimization. We analyze GP-UCB, an intuitive upper-confidence based algorithm, and bound its cumulative regret in terms of maximal information gain, establishing a novel connection between GP optimization and experimental design. Moreover, by bounding the latter in terms of operator spectra, we obtain explicit sublinear regret bounds for many commonly used covariance functions. In some important cases, our bounds have surprisingly weak dependence on the dimensionality. In our experiments on real sensor data, GP-UCB compares favorably with other heuristical GP optimization approaches. 1.
(Show Context)

Citation Context

...arametric setting, which imply convergence rates for GP optimization. In particular, we analyze the Gaussian Process Upper Confidence Bound (GP-UCB) algorithm, a simple and intuitive Bayesian method (=-=Auer et al., 2002-=-; Auer, 2002; Dani et al., 2008). While objectives are different in the multi-armed bandit and experimental design paradigm, our results draw a close technical connection between them: our regret boun...

Nearly tight bounds for the continuum-armed bandit problem

by Robert Kleinberg - Advances in Neural Information Processing Systems 17 , 2005
"... In the multi-armed bandit problem, an online algorithm must choose from a set of strategies in a sequence of n trials so as to minimize the total cost of the chosen strategies. While nearly tight upper and lower bounds are known in the case when the strategy set is finite, much less is known when th ..."
Abstract - Cited by 118 (8 self) - Add to MetaCart
In the multi-armed bandit problem, an online algorithm must choose from a set of strategies in a sequence of n trials so as to minimize the total cost of the chosen strategies. While nearly tight upper and lower bounds are known in the case when the strategy set is finite, much less is known when there is an infinite strategy set. Here we consider the case when the set of strategies is a subset of R d, and the cost functions are continuous. In the d = 1 case, we improve on the best-known upper and lower bounds, closing the gap to a sublogarithmic factor. We also consider the case where d> 1 and the cost functions are convex, adapting a recent online convex optimization algorithm of Zinkevich to the sparser feedback model of the multi-armed bandit problem. 1
(Show Context)

Citation Context

...ne. The inner loop requires a subroutine MAB which should implement a finite-armed bandit algorithm appropriate for the cost model under consideration. For example, MAB could be the algorithm UCB1 of =-=[2]-=- in the random case, or the algorithm Exp3 of [3] in the adversarial case. The semantics of MAB are as follows: it is initialized with a finite set of strategies; subsequently it recommends strategies...

Monte-Carlo planning in large POMDPs

by David Silver , Joel Veness - in Proc. Neural Inf. Process. Syst , 2010
"... Abstract This paper introduces a Monte-Carlo algorithm for online planning in large POMDPs. The algorithm combines a Monte-Carlo update of the agent's belief state with a Monte-Carlo tree search from the current belief state. The new algorithm, POMCP, has two important properties. First, Monte ..."
Abstract - Cited by 112 (8 self) - Add to MetaCart
Abstract This paper introduces a Monte-Carlo algorithm for online planning in large POMDPs. The algorithm combines a Monte-Carlo update of the agent's belief state with a Monte-Carlo tree search from the current belief state. The new algorithm, POMCP, has two important properties. First, MonteCarlo sampling is used to break the curse of dimensionality both during belief state updates and during planning. Second, only a black box simulator of the POMDP is required, rather than explicit probability distributions. These properties enable POMCP to plan effectively in significantly larger POMDPs than has previously been possible. We demonstrate its effectiveness in three large POMDPs. We scale up a well-known benchmark problem, rocksample, by several orders of magnitude. We also introduce two challenging new POMDPs: 10 × 10 battleship and partially observable PacMan, with approximately 10 18 and 10 56 states respectively. Our MonteCarlo planning algorithm achieved a high level of performance with no prior knowledge, and was also able to exploit simple domain knowledge to achieve better results with less search. POMCP is the first general purpose planner to achieve high performance in such large and unfactored POMDPs.
(Show Context)

Citation Context

...e second stage. The UCT algorithm [8] improves the greedy action selection in MCTS. Each state of the search tree is viewed as a multi-armed bandit, and actions are chosen by using the UCB1 algorithm =-=[1]-=-. The value of an action is augmented by an exploration bonus that is highest for rarely tried actions, Q⊕ √ log N(s) (s, a) = Q(s, a) + c N(s,a) . The scalar constant c determines the relative ratio ...

Evolutionary function approximation for reinforcement learning

by Shimon Whiteson - Journal of Machine Learning Research , 2006
"... Ø�ÓÒ�ÔÔÖÓÜ�Ñ�Ø�ÓÒ�ÒÓÚ�Ð�ÔÔÖÓ��ØÓ�ÙØÓÑ�Ø��ÐÐÝ× � Ø�ÓÒ�Ð���×�ÓÒ×Ì��ר��×�×�ÒÚ�ר���Ø�×�ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒ �Ò�ÓÖ�Ñ�ÒØÐ��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ö�Ø��×Ù�×�ØÓ�Ø��×�Ø�×� × ÁÒÑ�ÒÝÑ���Ò�Ð��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ò���ÒØÑÙרÐ��ÖÒ Ñ�ÒØ���Òר�ÒØ��Ø�ÓÒÓ��ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒØ�ÓÒ�ÔÔÖÓÜ�Ñ � Ù�Ðר��Ø�Ö���ØØ�Ö��Ð�ØÓÐ��ÖÒÁÔÖ�×�ÒØ��ÙÐÐÝ�ÑÔÐ � Ø�Ó ..."
Abstract - Cited by 110 (17 self) - Add to MetaCart
Ø�ÓÒ�ÔÔÖÓÜ�Ñ�Ø�ÓÒ�ÒÓÚ�Ð�ÔÔÖÓ��ØÓ�ÙØÓÑ�Ø��ÐÐÝ× � Ø�ÓÒ�Ð���×�ÓÒ×Ì��ר��×�×�ÒÚ�ר���Ø�×�ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒ �Ò�ÓÖ�Ñ�ÒØÐ��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ö�Ø��×Ù�×�ØÓ�Ø��×�Ø�×� × ÁÒÑ�ÒÝÑ���Ò�Ð��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ò���ÒØÑÙרÐ��ÖÒ Ñ�ÒØ���Òר�ÒØ��Ø�ÓÒÓ��ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒØ�ÓÒ�ÔÔÖÓÜ�Ñ � Ù�Ðר��Ø�Ö���ØØ�Ö��Ð�ØÓÐ��ÖÒÁÔÖ�×�ÒØ��ÙÐÐÝ�ÑÔÐ � Ø�ÓÒÛ���ÓÑ��Ò�ׯ��Ì�Ò�ÙÖÓ�ÚÓÐÙØ�ÓÒ�ÖÝÓÔØ�Ñ�Þ � Ð�Ø�Ò��ÙÒØ�ÓÒ�ÔÔÖÓÜ�Ñ�ØÓÖÖ�ÔÖ�×�ÒØ�Ø�ÓÒר��Ø�Ò��Ð� Ø�ÓÒØ��Ò�ÕÙ�Û�Ø�ÉÐ��ÖÒ�Ò��ÔÓÔÙÐ�ÖÌ�Ñ�Ø�Ó�Ì� � �Æ��ÒØ�Ò��Ú��Ù�ÐÐ��ÖÒ�Ò�Ì��×Ñ�Ø�Ó��ÚÓÐÚ�×�Ò��Ú� � ÓÔØ�Ñ�Þ�Ø�ÓÒ��ÐÐ�ÒØ��×�Ø��ÓÖÝ��Ú�ÐÓÔ�Ò��«�Ø�Ú�Ö��Ò �ÓÖÁÒר����ØÖ���Ú�×ÓÒÐÝÔÓ×�Ø�Ú��Ò�Ò���Ø�Ú�Ö�Û�Ö� × ÔÖÓ�Ð�Ñ××Ù��×ÖÓ�ÓØÓÒØÖÓÐ��Ñ�ÔÐ�Ý�Ò��Ò�×Ýר�Ñ �ÒÛ���Ø�����ÒØÒ�Ú�Ö×��×�Ü�ÑÔÐ�×Ó�ÓÖÖ�Ø����Ú 1.

A survey of Monte Carlo tree search methods

by Cameron Browne, Edward Powley, Daniel Whitehouse, Simon Lucas, Peter I. Cowling, Stephen Tavener, Diego Perez, Spyridon Samothrakis, Simon Colton, et al. - IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI , 2012
"... Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a ra ..."
Abstract - Cited by 104 (18 self) - Add to MetaCart
Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm’s derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.
(Show Context)

Citation Context

...that currently appear suboptimal but may turn out to be superior in the long run. A K-armed bandit is defined by random variables Xi,n for 1 ≤ i ≤ K and n ≥ 1, where i indicates the arm of the bandit =-=[13]-=-, [119], [120]. Successive plays of bandit i yield Xi,1, Xi,2, . . . which are independently and identically distributed according to an unknown law with unknown expectation µi. The K-armed bandit pro...

Learning diverse rankings with multi-armed bandits

by Filip Radlinski, Robert Kleinberg, Thorsten Joachims - In Proceedings of the 25 th ICML , 2008
"... Algorithms for learning to rank Web documents usually assume a document’s relevance is independent of other documents. This leads to learned ranking functions that produce rankings with redundant results. In contrast, user studies have shown that diversity at high ranks is often preferred. We presen ..."
Abstract - Cited by 102 (7 self) - Add to MetaCart
Algorithms for learning to rank Web documents usually assume a document’s relevance is independent of other documents. This leads to learned ranking functions that produce rankings with redundant results. In contrast, user studies have shown that diversity at high ranks is often preferred. We present two online learning algorithms that directly learn a diverse ranking of documents based on users ’ clicking behavior. We show that these algorithms minimize abandonment, or alternatively, maximize the probability that a relevant document is found in the top k positions of a ranking. Moreover, one of our algorithms asymptotically achieves optimal worst-case performance even if users’ interests change. 1.
(Show Context)

Citation Context

...lled one-armed bandits). The goal of standard MAB algorithms is to select the optimal sequence of slot machines to play to maximize the expected total reward collected. For further details, refer to (=-=Auer et al., 2002-=-a). The ranked bandits algorithm runs an MAB instance MABi for each rank i. Each of the k copies of the multi-armed bandit algorithm maintains a value (or index) for every document. When selecting the...

Stochastic linear optimization under bandit feedback

by Varsha Dani, Thomas P. Hayes, Sham M. Kakade - In submission , 2008
"... In the classical stochastic k-armed bandit problem, in each of a sequence of T rounds, a decision maker chooses one of k arms and incurs a cost chosen from an unknown distribution associated with that arm. The goal is to minimize regret, defined as the difference between the cost incurred by the alg ..."
Abstract - Cited by 100 (8 self) - Add to MetaCart
In the classical stochastic k-armed bandit problem, in each of a sequence of T rounds, a decision maker chooses one of k arms and incurs a cost chosen from an unknown distribution associated with that arm. The goal is to minimize regret, defined as the difference between the cost incurred by the algorithm and the optimal cost. In the linear optimization version of this problem (first considered by Auer [2002]), we view the arms as vectors in Rn, and require that the costs be linear functions of the chosen vector. As before, it is assumed that the cost functions are sampled independently from an unknown distribution. In this setting, the goal is to find algorithms whose running time and regret behave well as functions of the number of rounds T and the dimensionality n (rather than the number of arms, k, which may be exponential in n or even infinite). We give a nearly complete characterization of this problem in terms of both upper and lower bounds for the regret. In certain special cases (such as when the decision region is a polytope), the regret is polylog(T). In general though, the optimal regret is Θ ∗ ( √ T) — our lower bounds rule out the possibility of obtaining polylog(T) rates in general. We present two variants of an algorithm based on the idea of “upper confidence bounds. ” The first, due to Auer [2002], but not fully analyzed, obtains regret whose dependence on n and T are both essentially optimal, but which may be computationally intractable when the decision set is a polytope. The second version can be efficiently implemented when the decision set is a polytope (given as an intersection √ of half-spaces), but gives up a factor of n in the regret bound. Our results also extend to the setting where the set of allowed decisions may change over time.

Near-optimal Regret Bounds for Reinforcement Learning

by Peter Auer, Thomas Jaksch, Ronald Ortner
"... For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s ..."
Abstract - Cited by 98 (11 self) - Add to MetaCart
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s ′ there is a policy which moves from s to s ′ in at most D steps (on average). We present a reinforcement learning algorithm with total regret Õ(DS √ AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Ω ( √ DSAT) on the total regret of any learning algorithm. 1
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University