Results 1  10
of
17
lil’ UCB: an optimal exploration algorithm for multiarmed bandits
, 2013
"... The paper proposes a novel upper confidence bound (UCB) procedure for identifying the arm with the largest mean in a multiarmed bandit game in the fixed confidence setting using a small number of total samples. The procedure cannot be improved in the sense that the number of samples required to ide ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
The paper proposes a novel upper confidence bound (UCB) procedure for identifying the arm with the largest mean in a multiarmed bandit game in the fixed confidence setting using a small number of total samples. The procedure cannot be improved in the sense that the number of samples required to identify the best arm is within a constant factor of a lower bound based on the law of the iterated logarithm (LIL). Inspired by the LIL, we construct our confidence bounds to explicitly account for the infinite time horizon of the algorithm. In addition, by using a novel stopping time for the algorithm we avoid a union bound over the arms that has been observed in other UCBtype algorithms. We prove that the algorithm is optimal up to constants and also show through simulations that it provides superior performance with respect to the stateoftheart.
Optimal PAC Multiple Arm Identification with Applications to Crowdsourcing
, 2014
"... We study the problem of selecting K arms with the highest expected rewards in a stochastic Narmed bandit game. Instead of using existing evaluation metrics (e.g., misidentification probability (Bubeck et al., 2013) or the metric in ExploreK(Kalyanakrishnan & Stone, 2010)), we propose to use th ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We study the problem of selecting K arms with the highest expected rewards in a stochastic Narmed bandit game. Instead of using existing evaluation metrics (e.g., misidentification probability (Bubeck et al., 2013) or the metric in ExploreK(Kalyanakrishnan & Stone, 2010)), we propose to use the aggregate regret, which is defined as the gap between the average reward of the optimal solution and that of our solution. Besides being a natural metric by itself, we argue that in many applications, such as our motivating example from the crowdsourcing, the aggregate regret bound is more suitable. We propose a new PAC algorithm, which, with probability at least 1−δ, identifies a set of K arms with regret at most . We provide a detailed analysis on the sample complexity of our algorithm. To complement, we establish a lower bound on the expected number of samples for Bernoulli bandits and show that the sample complexity of our algorithm matches the lower bound. Finally, we report experimental results on both synthetic and real data sets, which demonstrates the superior performance of the proposed algorithm.
On identifying good options under combinatorially structured feedback in finite noisy environments.”
 in Proceedings of the International Conference on Machine Learning (ICML),
, 2015
"... Abstract We consider the problem of identifying a good option out of finite set of options under combinatorially structured, noisy feedback about the quality of the options in a sequential process: In each round, a subset of the options, from an available set of subsets, can be selected to receive ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract We consider the problem of identifying a good option out of finite set of options under combinatorially structured, noisy feedback about the quality of the options in a sequential process: In each round, a subset of the options, from an available set of subsets, can be selected to receive noisy information about the quality of the options in the chosen subset. The goal is to identify the highest quality option, or a group of options of the highest quality, with a small error probability, while using the smallest number of measurements. The problem generalizes bestarm identification problems. By extending previous work, we design new algorithms that are shown to be able to exploit the combinatorial structure of the problem in a nontrivial fashion, while being unimprovable in special cases. The algorithms call a set multicovering oracle, hence their performance and efficiency is strongly tied to whether the associated set multicovering problem can be efficiently solved.
Minimizing simple and cumulative regret in MonteCarlo tree search
 PROCEEDINGS OF COMPUTER GAMES WORKSHOP AT THE 21ST EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 2014
"... Regret minimization is important in both the MultiArmed Bandit problem and MonteCarlo Tree Search (MCTS). Recently, simple regret, i.e., the regret of not recommending the best action, has been proposed as an alternative to cumulative regret in MCTS, i.e., regret accumulated over time. Each type ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Regret minimization is important in both the MultiArmed Bandit problem and MonteCarlo Tree Search (MCTS). Recently, simple regret, i.e., the regret of not recommending the best action, has been proposed as an alternative to cumulative regret in MCTS, i.e., regret accumulated over time. Each type of regret is appropriate in different contexts. Although the majority of MCTS research applies the UCT selection policy for minimizing cumulative regret in the tree, this paper introduces a new MCTS variant, Hybrid MCTS (HMCTS), which minimizes both types of regret in different parts of the tree. HMCTS uses SHOT, a recursive version of Sequential Halving, to minimize simple regret near the root, and UCT to minimize cumulative regret when descending further down the tree. We discuss the motivation for this new search technique, and show the performance of HMCTS in six distinct
Simple regret for infinitely many armed bandits. arXiv:1505.04627, http://arxiv.org/abs/1505.04627, ArXiv eprints,,
, 2015
"... Abstract We consider a stochastic bandit problem with infinitely many arms. In this setting, the learner has no chance of trying all the arms even once and has to dedicate its limited number of samples only to a certain number of arms. All previous algorithms for this setting were designed for mini ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract We consider a stochastic bandit problem with infinitely many arms. In this setting, the learner has no chance of trying all the arms even once and has to dedicate its limited number of samples only to a certain number of arms. All previous algorithms for this setting were designed for minimizing the cumulative regret of the learner. In this paper, we propose an algorithm aiming at minimizing the simple regret. As in the cumulative regret setting of infinitely many armed bandits, the rate of the simple regret will depend on a parameter β characterizing the distribution of the nearoptimal arms. We prove that depending on β, our algorithm is minimax optimal either up to a multiplicative constant or up to a log(n) factor. We also provide extensions to several important cases: when β is unknown, in a natural setting where the nearoptimal arms have a small variance, and in the case of unknown time horizon.
Sparse Dueling Bandits
"... The dueling bandit problem is a variation of the classical multiarmed bandit in which the allowable actions are noisy comparisons between pairs of arms. This paper focuses on a new approach for finding the “best ” arm according to the Borda criterion using noisy comparisons. We prove that in the a ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
The dueling bandit problem is a variation of the classical multiarmed bandit in which the allowable actions are noisy comparisons between pairs of arms. This paper focuses on a new approach for finding the “best ” arm according to the Borda criterion using noisy comparisons. We prove that in the absence of structural assumptions, the sample complexity of this problem is proportional to the sum of the inverse squared gaps between the Borda scores of each suboptimal arm and the best arm. We explore this dependence further and consider structural constraints on the pairwise comparison matrix (a particular form of sparsity natural to this problem) that can significantly reduce the sample complexity. This motivates a new algorithm called Successive Elimination with Comparison Sparsity (SECS) that exploits sparsity to find the Borda winner using fewer samples than standard algorithms. We also evaluate the new algorithm experimentally with synthetic and real data. The results show that the sparsity model and the new algorithm can provide significant improvements over standard approaches. 1
On Finding the Largest Mean Among Many
, 1306
"... Sampling from distributions to find the one with the largest mean arises in a broad range of applications, and it can be mathematically modeled as a multiarmed bandit problem in which each distribution is associated with an arm. This paper studies the sample complexity of identifying the best arm ( ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Sampling from distributions to find the one with the largest mean arises in a broad range of applications, and it can be mathematically modeled as a multiarmed bandit problem in which each distribution is associated with an arm. This paper studies the sample complexity of identifying the best arm (largest mean) in a multiarmed bandit problem. Motivated by largescale applications, we are especially interested in identifying situations where the total number of samplesthatarenecessaryandsufficientto find the best arm scale linearly with the number of arms. We present a singleparameter multiarmed bandit model that spans the range from linear to superlinear sample complexity. We also give a new algorithm for best arm identification, called PRISM, with linear sample complexity for a wide range of mean distributions. The algorithm, like most exploration procedures for multiarmed bandits, is adaptive in the sense that the next arms to sample are selected based on previous samples. We compare the sample complexity of adaptive procedures with simpler nonadaptive procedures using new lower bounds. For many problem instances, the increased sample complexityrequiredbynonadaptiveproceduresisa polynomial factor of the number of arms. 1
To cite this version:
, 2015
"... HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract
 Add to MetaCart
(Show Context)
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Simple regret for infinitely many armed bandits
PAC Lower Bounds and Efficient Algorithms for The Max KArmed Bandit Problem
"... Abstract We consider the Max KArmed Bandit problem, where a learning agent is faced with several stochastic arms, each a source of i.i.d. rewards of unknown distribution. At each time step the agent chooses an arm, and observes the reward of the obtained sample. Each sample is considered here as a ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract We consider the Max KArmed Bandit problem, where a learning agent is faced with several stochastic arms, each a source of i.i.d. rewards of unknown distribution. At each time step the agent chooses an arm, and observes the reward of the obtained sample. Each sample is considered here as a separate item with the reward designating its value, and the goal is to find an item with the highest possible value. Our basic assumption is a known lower bound on the tail function of the reward distributions. Under the PAC framework, we provide a lower bound on the sample complexity of any ( , δ)correct algorithm, and propose an algorithm that attains this bound up to logarithmic factors. We provide an analysis of the robustness of the proposed algorithm to the model assumptions, and further compare its performance to the simple nonadaptive variant, in which the arms are chosen randomly at each stage.
Top Arm Identification in MultiArmed Bandits with Batch Arm Pulls
"... Abstract We introduce a new multiarmed bandit (MAB) problem in which arms must be sampled in batches, rather than one at a time. This is motivated by applications in social media monitoring and biological experimentation where such batch constraints naturally arise. This paper develops and analyze ..."
Abstract
 Add to MetaCart
Abstract We introduce a new multiarmed bandit (MAB) problem in which arms must be sampled in batches, rather than one at a time. This is motivated by applications in social media monitoring and biological experimentation where such batch constraints naturally arise. This paper develops and analyzes algorithms for batch MABs and top arm identification, for both fixed confidence and fixed budget settings. Our main theoretical results show that the batch constraint does not significantly a↵ect the sample complexity of top arm identification compared to unconstrained MAB algorithms. Alternatively, if one views a batch as the fundamental sampling unit, then the results can be interpreted as showing that the sample complexity of batch MABs can be significantly less than traditional MABs. We demonstrate the new batch MAB algorithms with simulations and in two interesting realworld applications: (i) microwell array experiments for identifying genes that are important in virus replication and (ii) finding the most active users in Twitter on a specific topic.