Results 1  10
of
22
Combinatorial Bandits
"... We study sequential prediction problems in which, at each time instance, the forecaster chooses a binary vector from a certain fixed set S ⊆ {0, 1} d and suffers a loss that is the sum of the losses of those vector components that equal to one. The goal of the forecaster is to achieve that, in the l ..."
Abstract

Cited by 46 (6 self)
 Add to MetaCart
(Show Context)
We study sequential prediction problems in which, at each time instance, the forecaster chooses a binary vector from a certain fixed set S ⊆ {0, 1} d and suffers a loss that is the sum of the losses of those vector components that equal to one. The goal of the forecaster is to achieve that, in the long run, the accumulated loss is not much larger than that of the best possible vector in the class. We consider the “bandit ” setting in which the forecaster has only access to the losses of the chosen vectors. We introduce a new general forecaster achieving a regret bound that, for a variety of concrete choices of S, is of order √ nd ln S  where n is the time horizon. This is not improvable in general and is better than previously known bounds. We also point out that computationally efficient implementations for various interesting choices of S exist. 1
Combinatorial MultiArmed Bandit: General Framework, Results and Applications
"... We define a general framework for a large class of combinatorial multiarmed bandit (CMAB) problems, where simple arms with unknown distributions form super arms. In each round, a super arm is played and the outcomes of its related simple arms are observed, which helps the selection of super arms in ..."
Abstract

Cited by 29 (4 self)
 Add to MetaCart
(Show Context)
We define a general framework for a large class of combinatorial multiarmed bandit (CMAB) problems, where simple arms with unknown distributions form super arms. In each round, a super arm is played and the outcomes of its related simple arms are observed, which helps the selection of super arms in future rounds. The reward of the super arm depends on the outcomes of played arms, and it only needs to satisfy two mild assumptions, which allow a large class of nonlinear reward instances. We assume the availability of an (α, β)approximation oracle that takes the means of the distributions of arms and outputs a super arm that with probability β generates an α fraction of the optimal expected reward. The objective of a CMAB algorithm is to minimize (α, β)approximation regret, which is the difference in total expected reward between the αβ fraction of expected reward when always playing the optimal super arm, and the expected reward of playing super arms according to the algorithm. We provide CUCB algorithm that achieves O(log n) regret, where n is the number of rounds played, and we further provide distributionindependent bounds for a large class of reward functions. Our regret analysis is tight in that it matches the bound for classical MAB problem up to a constant factor, and it significantly improves the regret bound Proceedings of the 30 th
From bandits to experts: A tale of domination and independence
 NIPS
, 2013
"... We consider the partial observability model for multiarmed bandits, introduced by Mannor and Shamir [14]. Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph (which must be accessible befo ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
We consider the partial observability model for multiarmed bandits, introduced by Mannor and Shamir [14]. Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph (which must be accessible before selecting an action). In the undirected case, we show that the learner can achieve optimal regret without even accessing the observability graph before selecting an action. Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a timeefficient manner. 1
Leveraging Side Observations in Stochastic Bandits
"... This paper considers stochastic bandits with side observations, a model that accounts for both the exploration/exploitation dilemma and relationships between arms. In this setting, after pulling an arm i, the decision maker also observes the rewards for some other actions related to i. We will see t ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
This paper considers stochastic bandits with side observations, a model that accounts for both the exploration/exploitation dilemma and relationships between arms. In this setting, after pulling an arm i, the decision maker also observes the rewards for some other actions related to i. We will see that this model is suited to content recommendation in social networks, where users ’ reactions may be endorsed or not by their friends. We provide efficient algorithms based on upper confidence bounds (UCBs) to leverage this additional information and derive new bounds improving on standard regret guarantees. We also evaluate these policies in the context of movie recommendation in social networks: experiments on real datasets show substantial learning rate speedups ranging from 2.2 × to 14 × on dense networks. 1
Combinatorial multiarmed bandit and its extension to probabilistically triggered arms.
 Journal of Machine Learning Research,
, 2016
"... Abstract We define a general framework for a large class of combinatorial multiarmed bandit (CMAB) problems, where subsets of base arms with unknown distributions form super arms. In each round, a super arm is played and the base arms contained in the super arm are played and their outcomes are ob ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Abstract We define a general framework for a large class of combinatorial multiarmed bandit (CMAB) problems, where subsets of base arms with unknown distributions form super arms. In each round, a super arm is played and the base arms contained in the super arm are played and their outcomes are observed. We further consider the extension in which more base arms could be probabilistically triggered based on the outcomes of already triggered arms. The reward of the super arm depends on the outcomes of all played arms, and it only needs to satisfy two mild assumptions, which allow a large class of nonlinear reward instances. We assume the availability of an offline (α, β)approximation oracle that takes the means of the outcome distributions of arms and outputs a super arm that with probability β generates an α fraction of the optimal expected reward. The objective of an online learning algorithm for CMAB is to minimize (α, β)approximation regret, which is the difference in total expected reward between the αβ fraction of expected reward when always playing the optimal super arm, and the expected reward of playing super arms according to the algorithm. We provide CUCB algorithm that achieves O(log n) distributiondependent regret, where n is the number of rounds played, and we further provide distributionindependent bounds for a large class of reward functions. Our regret analysis is tight in that it matches the bound of UCB1 algorithm (up to a constant factor) for the classical MAB problem, and it significantly improves the regret bound in an earlier paper on combinatorial bandits * . A preliminary version of this paper has appeared in ICML
MULTIARMED BANDIT PROBLEMS UNDER DELAYED FEEDBACK
, 2012
"... In this thesis, the multiarmed bandit (MAB) problem in online learning is studied, when the feedback information is not observed immediately but rather after arbitrary, unknown, random delays. In the “stochastic” setting when the rewards come from a fixed distribution, an algorithm is given that ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
In this thesis, the multiarmed bandit (MAB) problem in online learning is studied, when the feedback information is not observed immediately but rather after arbitrary, unknown, random delays. In the “stochastic” setting when the rewards come from a fixed distribution, an algorithm is given that uses a nondelayed MAB algorithm as a blackbox. We also give a method to generalize the theoretical guarantees of nondelayed UCBtype algorithms to the delayed stochastic setting. Assuming the delays are independent of the rewards, we upper bound the penalty in the performance of these algorithms (measured by “regret”) by an additive term depending on the delays. When the rewards are chosen in an adversarial manner, we give a blackbox style algorithm using multiple instances
Online learning with feedback graphs: Beyond bandits. arXiv preprint arXiv:1502.07617,
, 2015
"... Abstract We study a general class of online learning problems where the feedback is specified by a graph. This class includes online prediction with expert advice and the multiarmed bandit problem, but also several learning problems where the online player does not necessarily observe his own loss ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Abstract We study a general class of online learning problems where the feedback is specified by a graph. This class includes online prediction with expert advice and the multiarmed bandit problem, but also several learning problems where the online player does not necessarily observe his own loss. We analyze how the structure of the feedback graph controls the inherent difficulty of the induced T round learning problem. Specifically, we show that any feedback graph belongs to one of three classes: strongly observable graphs, weakly observable graphs, and unobservable graphs. We prove that the first class induces learning problems with Θ(α 1/2 T 1/2 ) minimax regret, where α is the independence number of the underlying graph; the second class induces problems with Θ(δ 1/3 T 2/3 ) minimax regret, where δ is the domination number of a certain portion of the graph; and the third class induces problems with linear minimax regret. Our results subsume much of the previous work on learning with feedback graphs and reveal new connections to partial monitoring games. We also show how the regret is affected if the graphs are allowed to vary with time.
Cascading bandits: Learning to rank in the cascade model
 In Proceedings of the 32nd International Conference on Machine Learning
"... A search engine usually outputs a list of K web pages. The user examines this list, from the first web page to the last, and chooses the first attractive page. This model of user behavior is known as the cascade model. In this paper, we propose cascading bandits, a learning variant of the cascade ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
A search engine usually outputs a list of K web pages. The user examines this list, from the first web page to the last, and chooses the first attractive page. This model of user behavior is known as the cascade model. In this paper, we propose cascading bandits, a learning variant of the cascade model where the objective is to identify K most attractive items. We formulate our problem as a stochastic combinatorial partial monitoring problem. We propose two algorithms for solving it, CascadeUCB1 and CascadeKLUCB. We also prove gapdependent upper bounds on the regret of these algorithms and derive a lower bound on the regret in cascading bandits. The lower bound matches the upper bound of CascadeKLUCB up to a logarithmic factor. We experiment with our algorithms on several problems. The algorithms perform surprisingly well even when our modeling assumptions are violated. 1.
Reducing Dueling Bandits to Cardinal Bandits
, 2014
"... We present algorithms for reducing the Dueling Bandits problem to the conventional (stochastic) MultiArmed Bandits problem. The Dueling Bandits problem is an online model of learning with ordinal feedback of the form “A is preferred to B ” (as opposed to cardinal feedback like “A has value 2.5” ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
We present algorithms for reducing the Dueling Bandits problem to the conventional (stochastic) MultiArmed Bandits problem. The Dueling Bandits problem is an online model of learning with ordinal feedback of the form “A is preferred to B ” (as opposed to cardinal feedback like “A has value 2.5”), giving it wide applicability in learning from implicit user feedback and revealed and stated preferences. In contrast to existing algorithms for the Dueling Bandits problem, our reductions – named Doubler, MultiSBM and Sparring – provide a generic schema for translating the extensive body of known results about conventional MultiArmed Bandit algorithms to the Dueling Bandits setting. For Doubler and MultiSBM we prove regret upper bounds in both finite and infinite settings, and conjecture about the performance of Sparring which empirically outperforms the other two as well as previous algorithms in our experiments. In addition, we provide the first almost optimal regret bound in terms of second order terms, such as the differences between the values of the arms.