Results 11  20
of
468
Pure exploration in multiarmed bandits problems
 IN PROCEEDINGS OF THE TWENTIETH INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY (ALT 2009
, 2009
"... We consider the framework of stochastic multiarmed bandit problems and study the possibilities and limitations of strategies that explore sequentially the arms. The strategies are assessed not in terms of their cumulative regrets, as is usually the case, but through quantities referred to as simpl ..."
Abstract

Cited by 79 (16 self)
 Add to MetaCart
(Show Context)
We consider the framework of stochastic multiarmed bandit problems and study the possibilities and limitations of strategies that explore sequentially the arms. The strategies are assessed not in terms of their cumulative regrets, as is usually the case, but through quantities referred to as simple regrets. The latter are related to the (expected) gains of the decisions that the strategies would recommend for a new oneshot instance of the same multiarmed bandit problem. Here, exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when cumulative regrets are considered and when exploitation needs to be performed at the same time. We start by indicating the links between simple and cumulative regrets. A small cumulative regret entails a small simple regret but too small a cumulative regret prevents the simple regret from decreasing exponentially towards zero, its optimal distributiondependent rate. We therefore introduce specific strategies, for which we prove both distributiondependent and distributionfree bounds. A concluding experimental study puts these theoretical bounds in perspective and shows the interest of nonuniform exploration of the arms.
Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms
 In WSDM
, 2011
"... Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging du ..."
Abstract

Cited by 73 (15 self)
 Add to MetaCart
Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their“partiallabel ” nature. Common practice is to create a simulator which simulates the online environment for the problem at hand and then run an algorithm against this simulator. However, creating simulator itself is often difficult and modeling bias is usually unavoidably introduced. In this paper, we introduce a replay methodology for contextual bandit algorithm evaluation. Different from simulatorbased approaches, our method is completely datadriven and very easy to adapt to different applications. More importantly, our method can provide provably unbiased evaluations. Our empirical results on a largescale news article recommendation dataset collected from Yahoo! Front Page conform well with our theoretical results. Furthermore, comparisons between our offline replay and online bucket evaluation of several contextual bandit algorithms show accuracy and effectiveness of our offline evaluation method.
The Sample Complexity of Exploration in the MultiArmed Bandit Problem
 Journal of Machine Learning Research
, 2004
"... We consider the multiarmed bandit problem under the PAC ("probably approximately correct") model. It was shown by EvenDar et al. (2002) that given n arms, a total of O trials suffices in order to find an eoptimal arm with probability at least 1 d. We establish a matching low ..."
Abstract

Cited by 66 (3 self)
 Add to MetaCart
We consider the multiarmed bandit problem under the PAC ("probably approximately correct") model. It was shown by EvenDar et al. (2002) that given n arms, a total of O trials suffices in order to find an eoptimal arm with probability at least 1 d. We establish a matching lower bound on the expected number of trials under any sampling policy. We furthermore generalize the lower bound, and show an explicit dependence on the (unknown) statistics of the arms. We also provide a similar bound within a Bayesian setting. The case where the statistics of the arms are known but the identities of the arms are not, is also discussed. For this case, we provide a lower bound of Q on the expected number of trials, as well as a sampling policy with a matching upper bound. If instead of the expected number of trials, we consider the maximum (over all sample paths) number of trials, we establish a matching upper and lower bound of the form Q . Finally, we derive lower bounds on the expected regret, in the spirit of Lai and Robbins.
Multiarmed bandit algorithms and empirical evaluation
 In European Conference on Machine Learning
, 2005
"... Abstract. The multiarmed bandit problem for a gambler is to decide which arm of a Kslot machine to pull to maximize his total reward in a series of trials. Many realworld learning and optimization problems can be modeled in this way. Several strategies or algorithms have been proposed as a soluti ..."
Abstract

Cited by 63 (0 self)
 Add to MetaCart
(Show Context)
Abstract. The multiarmed bandit problem for a gambler is to decide which arm of a Kslot machine to pull to maximize his total reward in a series of trials. Many realworld learning and optimization problems can be modeled in this way. Several strategies or algorithms have been proposed as a solution to this problem in the last two decades, but, to our knowledge, there has been no common evaluation of these algorithms. This paper provides a preliminary empirical evaluation of several multiarmed bandit algorithms. It also describes and analyzes a new algorithm, Poker (Price Of Knowledge and Estimated Reward) whose performance compares favorably to that of other existing algorithms in several experiments. One remarkable outcome of our experiments is that the most naive approach, the ɛgreedy strategy, proves to be often hard to beat. 1
PAC bounds for multiarmed bandit and Markov decision processes
 In Fifteenth Annual Conference on Computational Learning Theory (COLT
, 2002
"... Abstract. The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O ( n ɛ2 log 1) times to find an ɛoptimal δ arm with probability of at least 1 − δ. This is in contrast to the naive bound of O ..."
Abstract

Cited by 61 (2 self)
 Add to MetaCart
(Show Context)
Abstract. The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O ( n ɛ2 log 1) times to find an ɛoptimal δ arm with probability of at least 1 − δ. This is in contrast to the naive bound of O ( n ɛ2 log n). We derive another algorithm whose complexity δ depends on the specific setting of the rewards, rather than the worst case setting. We also provide a matching lower bound. We show how given an algorithm for the PAC model MultiArmed Bandit problem, one can derive a batch learning algorithm for Markov Decision Processes. This is done essentially by simulating Value Iteration, and in each iteration invoking the multiarmed bandit algorithm. Using our PAC algorithm for the multiarmed bandit problem we improve the dependence on the number of actions. 1
Linearly Parameterized Bandits
, 2008
"... We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an rdimensional random vector Z ∈ Rr, where r ≥ 2. The objective is to choose a sequence of arms to minimize the cumulative regret and Bayes risk. W ..."
Abstract

Cited by 57 (0 self)
 Add to MetaCart
(Show Context)
We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an rdimensional random vector Z ∈ Rr, where r ≥ 2. The objective is to choose a sequence of arms to minimize the cumulative regret and Bayes risk. We propose a policy based on least squares estimation and uncertainty ellipsoids, which generalizes the upper confidence index approach pioneered by Lai and Robbins (1985). The cumulative regret and Bayes risk under our proposed policy admits an upper bound of the form r √ T log 3/2 T, which is linear in the dimension r, and independent of the number of arms. We also establish Ω(r √ T) lower bounds on the regret and risk, showing that our proposed policy is nearly optimal.
Robbing the Bandit: Less Regret in Online Geometric Optimization Against an Adaptive Adversary
 In Proceedings of the 17th ACMSIAM Symposium on Discrete Algorithms (SODA
, 2006
"... We consider “online bandit geometric optimization, ” a problem of iterated decision making in a largely unknown and constantly changing environment. The goal is to minimize “regret, ” defined as the difference between the actual loss of an online decisionmaking procedure and that of the best single ..."
Abstract

Cited by 55 (5 self)
 Add to MetaCart
We consider “online bandit geometric optimization, ” a problem of iterated decision making in a largely unknown and constantly changing environment. The goal is to minimize “regret, ” defined as the difference between the actual loss of an online decisionmaking procedure and that of the best single decision in hindsight. “Geometric optimization ” refers to a generalization of the wellknown multiarmed bandit problem, in which the decision space is some bounded subset of R d, the adversary is restricted to linear loss functions, and regret bounds should depend on the dimensionality d, rather than the total number of possible decisions. “Bandit ” refers to the setting in which the algorithm is only told its loss on each round, rather than the entire loss function. McMahan and Blum [10] presented the best known algorithm in this setting, and proved that its expected additive regret is O(poly(d)T 3/4). We simplify and improve their analysis of this algorithm to obtain regret O(poly(d)T 2/3). We also prove that, for a large class of fullinformation online optimization problems, the optimal regret against an adaptive adversary is the same as against a nonadaptive adversary. 1
Best Arm Identification in MultiArmed Bandits
"... We consider the problem of finding the best arm in a stochastic multiarmed bandit game. The regret of a forecaster is here defined by the gap between the mean reward of the optimal arm and the mean reward of the ultimately chosen arm. We propose a highly exploring UCB policy and a new algorithm bas ..."
Abstract

Cited by 55 (11 self)
 Add to MetaCart
We consider the problem of finding the best arm in a stochastic multiarmed bandit game. The regret of a forecaster is here defined by the gap between the mean reward of the optimal arm and the mean reward of the ultimately chosen arm. We propose a highly exploring UCB policy and a new algorithm based on successive rejects. We show that these algorithms are essentially optimal since their regret decreases exponentially at a rate which is, up to a logarithmic factor, the best possible. However, while the UCB policy needs the tuning of a parameter depending on the unobservable hardness of the task, the successive rejects policy benefits from being parameterfree, and also independent of the scaling of the rewards. As a byproduct of our analysis, we show that identifying the best arm (when it is unique) requires a number of samples of order (up to a log(K) factor) ∑ i 1/∆2i, where the sum is on the suboptimal arms and ∆i represents the difference between the mean reward of the best arm and the one of arm i. This generalizes the wellknown fact that one needs of order of 1/∆2 samples to differentiate the means of two distributions with gap ∆. 1
Contextual Bandits with Similarity Information
 24TH ANNUAL CONFERENCE ON LEARNING THEORY
, 2011
"... In a multiarmed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a timeinvariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work ha ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
(Show Context)
In a multiarmed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a timeinvariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs to assume extra structure in order to make the problem tractable. In particular, recent literature considered information on similarity between arms. We consider similarity information in the setting of contextual bandits, a natural extension of the basic MAB problem where before each round an algorithm is given the context – a hint about the payoffs in this round. Contextual bandits are directly motivated by placing advertisements on webpages, one of the crucial problems in sponsored search. A particularly simple way to represent similarity information in the contextual bandit setting is via a similarity distance between the contextarm pairs which bounds from above the difference between the respective expected payoffs. Prior work