Results 1  10
of
78
Best Arm Identification in MultiArmed Bandits
"... We consider the problem of finding the best arm in a stochastic multiarmed bandit game. The regret of a forecaster is here defined by the gap between the mean reward of the optimal arm and the mean reward of the ultimately chosen arm. We propose a highly exploring UCB policy and a new algorithm bas ..."
Abstract

Cited by 58 (9 self)
 Add to MetaCart
We consider the problem of finding the best arm in a stochastic multiarmed bandit game. The regret of a forecaster is here defined by the gap between the mean reward of the optimal arm and the mean reward of the ultimately chosen arm. We propose a highly exploring UCB policy and a new algorithm based on successive rejects. We show that these algorithms are essentially optimal since their regret decreases exponentially at a rate which is, up to a logarithmic factor, the best possible. However, while the UCB policy needs the tuning of a parameter depending on the unobservable hardness of the task, the successive rejects policy benefits from being parameterfree, and also independent of the scaling of the rewards. As a byproduct of our analysis, we show that identifying the best arm (when it is unique) requires a number of samples of order (up to a log(K) factor) ∑ i 1/∆2i, where the sum is on the suboptimal arms and ∆i represents the difference between the mean reward of the best arm and the one of arm i. This generalizes the wellknown fact that one needs of order of 1/∆2 samples to differentiate the means of two distributions with gap ∆. 1
XArmed Bandits
, 2010
"... We consider a generalization of stochastic bandits where the set of arms, ..."
Abstract

Cited by 28 (7 self)
 Add to MetaCart
We consider a generalization of stochastic bandits where the set of arms,
Regret bounds for gaussian process bandit problems
 In AISTATS
, 2010
"... Bandit algorithms are concerned with trading exploration with exploitation where a number of options are available but we can only learn their quality by experimenting with them. We consider the scenario in which the reward distribution for arms is modelled by a Gaussian process and there is no nois ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
Bandit algorithms are concerned with trading exploration with exploitation where a number of options are available but we can only learn their quality by experimenting with them. We consider the scenario in which the reward distribution for arms is modelled by a Gaussian process and there is no noise in the observed reward. Our main result is to bound the regret experienced by algorithms relative to the a posteriori optimal strategy of playing the best arm throughout based on benign assumptions about the covariance function defining the Gaussian process. We further complement these upper bounds with corresponding lower bounds for particular covariance functions demonstrating that in general there is at most a logarithmic looseness in our upper bounds. 1
Portfolio Allocation for Bayesian Optimization
"... Bayesian optimization with Gaussian processes has become an increasingly popular tool in the machine learning community. It is efficient and can be used when very little is known about the objective function, making it popular in expensive blackbox optimization scenarios. It uses Bayesian methods t ..."
Abstract

Cited by 23 (14 self)
 Add to MetaCart
(Show Context)
Bayesian optimization with Gaussian processes has become an increasingly popular tool in the machine learning community. It is efficient and can be used when very little is known about the objective function, making it popular in expensive blackbox optimization scenarios. It uses Bayesian methods to sample the objective efficiently using an acquisition function which incorporates the posterior estimate of the objective. However, there are several different parameterized acquisition functions in the literature, and it is often unclear which one to use. Instead of using a single acquisition function, we adopt a portfolio of acquisition functions governed by an online multiarmed bandit strategy. We propose several portfolio strategies, the best of which we call GPHedge, and show that this method outperforms the best individual acquisition function. We also provide a theoretical bound on the algorithm’s performance. 1
Open Loop Optimistic Planning
"... We consider the problem of planning in a stochastic and discounted environment with a limited numerical budget. More precisely, we investigate strategies exploring the set of possible sequences of actions, so that, once all available numerical resources (e.g. CPU time, number of calls to a generativ ..."
Abstract

Cited by 22 (8 self)
 Add to MetaCart
We consider the problem of planning in a stochastic and discounted environment with a limited numerical budget. More precisely, we investigate strategies exploring the set of possible sequences of actions, so that, once all available numerical resources (e.g. CPU time, number of calls to a generative model) have been used, one returns a recommendation on the best possible immediate action to follow based on this exploration. The performance of a strategy is assessed in terms of its simple regret, that is the loss in performance resulting from choosing the recommended action instead of an optimal one. We first provide a minimax lower bound for this problem, and show that a uniform planning strategy matches this minimax rate (up to a logarithmic factor). Then we propose a UCB (Upper Confidence Bounds)based planning algorithm, called OLOP (OpenLoop Optimistic Planning), which is also minimax optimal, and prove that it enjoys much faster rates when there is a small proportion of nearoptimal sequences of actions. Finally, we compare our results with the regret bounds one can derive for our setting with bandits algorithms designed for an infinite number of arms. 1
Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness
"... We consider a global optimization problem of a deterministic functionf in a semimetric space, given a finite budget ofnevaluations. The functionf is assumed to be locally smooth (around one of its global maxima) with respect to a semimetric ℓ. We describe two algorithms based on optimistic explorat ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
We consider a global optimization problem of a deterministic functionf in a semimetric space, given a finite budget ofnevaluations. The functionf is assumed to be locally smooth (around one of its global maxima) with respect to a semimetric ℓ. We describe two algorithms based on optimistic exploration that use a hierarchical partitioning of the space at all scales. A first contribution is an algorithm, DOO, that requires the knowledge of ℓ. We report a finitesample performance bound in terms of a measure of the quantity of nearoptimal states. We then define a second algorithm, SOO, which does not require the knowledge of the semimetric ℓ under which f is smooth, and whose performance is almost as good as DOO optimallyfitted. 1
Parallelizing explorationexploitation tradeoffs with gaussian process bandit optimization
 In In Proc. International Conference on Machine Learning
, 2012
"... How can we take advantage of opportunities for experimental parallelization in explorationexploitation tradeoffs? In many experimental scenarios, it is often desirable to execute experiments simultaneously or in batches, rather than only performing one at a time. Additionally, observations may be ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
(Show Context)
How can we take advantage of opportunities for experimental parallelization in explorationexploitation tradeoffs? In many experimental scenarios, it is often desirable to execute experiments simultaneously or in batches, rather than only performing one at a time. Additionally, observations may be both noisy and expensive. We introduce Gaussian Process Batch Upper Confidence Bound (GPBUCB), an upper confidence boundbased algorithm, which models the reward function as a sample from a Gaussian process and which can select batches of experiments to run in parallel. We prove a general regret bound for GPBUCB, as well as the surprising result that for some common kernels, the asymptotic average regret can be made independent of the batch size. The GPBUCB algorithm is also applicable in the related case of a delay between initiation of an experiment and observation of its results, for which the same regret bounds hold. We also introduce Gaussian Process Adaptive Upper Confidence Bound (GPAUCB), a variant of GPBUCB which can exploit parallelism in an adaptive manner. We evaluate GPBUCB and GPAUCB on several simulated and real data sets. These experiments show that GPBUCB and GPAUCB are competitive with stateoftheart heuristics.1
ǫ–First Policies for Budget–Limited MultiArmed Bandits Long TranThanh
"... We introduce the budget–limited multi–armed bandit (MAB), which captures situations where a learner’s actions are costly and constrained by a fixed budget that is incommensurable with the rewards earned from the bandit machine, and then describe a first algorithm for solving it. Since the learner ha ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
We introduce the budget–limited multi–armed bandit (MAB), which captures situations where a learner’s actions are costly and constrained by a fixed budget that is incommensurable with the rewards earned from the bandit machine, and then describe a first algorithm for solving it. Since the learner has a budget, the problem’s duration is finite. Consequently an optimal exploitation policy is not to pull the optimal arm repeatedly, but to pull the combination of arms that maximises the agent’s total reward within the budget. As such, the rewards for all arms must be estimated, because any of them may appear in the optimal combination. This difference from existing MABs means that new approaches to maximising the total reward are required. To this end, we propose anǫ–first algorithm, in which the firstǫ of the budget is used solely to learn the arms ’ rewards (exploration), while the remaining 1−ǫ is used to maximise the received reward based on those estimates (exploitation). We derive bounds on the algorithm’s loss for generic and uniform exploration methods, and compare its performance with traditional MAB algorithms under various distributions of rewards and costs, showing that it outperforms the others by up to 50%. 1
Almost optimal exploration in multiarmed bandits
 Proceedings of the 30th International Conference on Machine Learning, June2013. 7 Eyal EvenDar, Shie Mannor, and Yishay
"... We study the problem of exploration in stochastic MultiArmed Bandits. Even in the simplest setting of identifying the best arm, there remains a logarithmic multiplicative gap between the known lower and upper bounds for the number of arm pulls required for the task. This extra logarithmic factor is ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
(Show Context)
We study the problem of exploration in stochastic MultiArmed Bandits. Even in the simplest setting of identifying the best arm, there remains a logarithmic multiplicative gap between the known lower and upper bounds for the number of arm pulls required for the task. This extra logarithmic factor is quite meaningful in nowadays largescale applications. We present two novel, parameterfree algorithms for identifying the best arm, in two different settings: given a target confidence and given a target budget of arm pulls, for which we prove upper bounds whose gap from the lower bound is only doublylogarithmic in the problem parameters. We corroborate our theoretical results with experiments demonstrating that our algorithm outperforms the stateoftheart and scales better as the size of the problem increases. 1.
Efficient Bayesadaptive reinforcement learning using samplebased search
 In Neural Information Processing Systems
, 2012
"... Abstract Bayesian modelbased reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty. In this setting, a Bayesoptimal policy captures the ideal tradeoff between exploration and exploitation. Unfortunately, finding Bayesoptimal policies is notor ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
(Show Context)
Abstract Bayesian modelbased reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty. In this setting, a Bayesoptimal policy captures the ideal tradeoff between exploration and exploitation. Unfortunately, finding Bayesoptimal policies is notoriously taxing due to the enormous search space in the augmented beliefstate MDP. In this paper we exploit recent advances in samplebased planning, based on MonteCarlo tree search, to introduce a tractable method for approximate Bayesoptimal planning. Unlike prior work in this area, we avoid expensive applications of Bayes rule within the search tree, by lazily sampling models from the current beliefs. Our approach outperformed prior Bayesian modelbased RL algorithms by a significant margin on several wellknown benchmark problems.