Results 1  10
of
11
Thompson sampling: An asymptotically optimal finitetime analysis
 In Algorithmic Learning Theory
"... The question of the optimality of Thompson Sampling for solving the stochastic multiarmed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finitetime analysis that matches the asymptotic rate given in the Lai an ..."
Abstract

Cited by 35 (5 self)
 Add to MetaCart
(Show Context)
The question of the optimality of Thompson Sampling for solving the stochastic multiarmed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finitetime analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical comparison with other optimal policies, experiments that have been lacking in the literature until now for the Bernoulli case. 1
Unimodal bandits
, 2011
"... We consider multiarmed bandit problems where the expected reward is unimodal over partially ordered arms. In particular, the arms may belong to a continuous interval or correspond to vertices in a graph, where the graph structure represents similarity in rewards. The unimodality assumption has an im ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
(Show Context)
We consider multiarmed bandit problems where the expected reward is unimodal over partially ordered arms. In particular, the arms may belong to a continuous interval or correspond to vertices in a graph, where the graph structure represents similarity in rewards. The unimodality assumption has an important advantage: we can determine if a given arm is optimal by sampling the possible directions around it. This property allows us to quickly and efficiently find the optimal arm and detect abrupt changes in the reward distributions. For the case of bandits on graphs, we incur a regret proportional to the maximal degree and the diameter of the graph, instead of the total number of vertices.
Bounded regret in stochastic multiarmed bandits
 JMLR: WORKSHOP AND CONFERENCE PROCEEDINGS VOL (2013) 1–13
, 2013
"... We study the stochastic multiarmed bandit problem when one knows the valueµ (⋆) of an optimal arm, as a well as a positive lower bound on the smallest positive gap∆. We propose a new randomized policy that attains a regret uniformly bounded over time in this setting. We also prove several lower bou ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
We study the stochastic multiarmed bandit problem when one knows the valueµ (⋆) of an optimal arm, as a well as a positive lower bound on the smallest positive gap∆. We propose a new randomized policy that attains a regret uniformly bounded over time in this setting. We also prove several lower bounds, which show in particular that bounded regret is not possible if one only knows ∆, and bounded regret of order1/ ∆ is not possible if one only knowsµ (⋆).
MULTIARMED BANDIT PROBLEMS UNDER DELAYED FEEDBACK
, 2012
"... In this thesis, the multiarmed bandit (MAB) problem in online learning is studied, when the feedback information is not observed immediately but rather after arbitrary, unknown, random delays. In the “stochastic” setting when the rewards come from a fixed distribution, an algorithm is given that ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
In this thesis, the multiarmed bandit (MAB) problem in online learning is studied, when the feedback information is not observed immediately but rather after arbitrary, unknown, random delays. In the “stochastic” setting when the rewards come from a fixed distribution, an algorithm is given that uses a nondelayed MAB algorithm as a blackbox. We also give a method to generalize the theoretical guarantees of nondelayed UCBtype algorithms to the delayed stochastic setting. Assuming the delays are independent of the rewards, we upper bound the penalty in the performance of these algorithms (measured by “regret”) by an additive term depending on the delays. When the rewards are chosen in an adversarial manner, we give a blackbox style algorithm using multiple instances
Building Bridges: Viewing Active Learning from the MultiArmed Bandit Lens
"... In this paper we propose a multiarmed bandit inspired, pool based active learning algorithm for the problem of binary classification. By carefully constructing an analogy between active learning and multiarmed bandits, we utilize ideas such as lower confidence bounds, and selfconcordant regulariz ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
In this paper we propose a multiarmed bandit inspired, pool based active learning algorithm for the problem of binary classification. By carefully constructing an analogy between active learning and multiarmed bandits, we utilize ideas such as lower confidence bounds, and selfconcordant regularization from the multiarmed bandit literature to design our proposed algorithm. Our algorithm is a sequential algorithm, which in each round assigns a sampling distribution on the pool, samples one point from this distribution, and queries the oracle for the label of this sampled point. The design of this sampling distribution is also inspired by the analogy between active learning and multiarmed bandits. We show how to derive lower confidence bounds required by our algorithm. Experimental comparisons to previously proposed active learning algorithms show superior performance on some standard UCI datasets. 1
An optimal algorithm for the Thresholding Bandit Problem Maurilio Gutzeit
"... Abstract We study a specific combinatorial pure exploration stochastic bandit problem where the learner aims at finding the set of arms whose means are above a given threshold, up to a given precision, and for a fixed time horizon. We propose a parameterfree algorithm based on an original heuristi ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract We study a specific combinatorial pure exploration stochastic bandit problem where the learner aims at finding the set of arms whose means are above a given threshold, up to a given precision, and for a fixed time horizon. We propose a parameterfree algorithm based on an original heuristic, and prove that it is optimal for this problem by deriving matching upper and lower bounds. To the best of our knowledge, this is the first nontrivial pure exploration setting with fixed budget for which optimal strategies are constructed.
Author manuscript, published in "NIPS TwentySixth Annual Conference on Neural Information Processing Systems (2012)" Risk–Aversion in Multi–armed Bandits
"... ..."
(Show Context)
Editor:?
, 2011
"... This paper is devoted to regret lower bounds in the classical model of stochastic multiarmed bandit. A wellknown result of Lai and Robbins, which has then been extended by Burnetas and Katehakis, has established the presence of a logarithmic bound for all consistent policies. We relax the notion of ..."
Abstract
 Add to MetaCart
This paper is devoted to regret lower bounds in the classical model of stochastic multiarmed bandit. A wellknown result of Lai and Robbins, which has then been extended by Burnetas and Katehakis, has established the presence of a logarithmic bound for all consistent policies. We relax the notion of consistence, and exhibit a generalisation of the logarithmic bound. We also show the non existence of logarithmic bound in the general case of Hannan consistency. To get these results, we study variants of popular Upper Confidence Bounds (ucb) policies. As a byproduct, we prove that it is impossible to design an adaptive policy that would select the best of two algorithms by taking advantage of the properties of the environment.