Results 1  10
of
40
Combinatorial Bandits
"... We study sequential prediction problems in which, at each time instance, the forecaster chooses a binary vector from a certain fixed set S ⊆ {0, 1} d and suffers a loss that is the sum of the losses of those vector components that equal to one. The goal of the forecaster is to achieve that, in the l ..."
Abstract

Cited by 46 (6 self)
 Add to MetaCart
(Show Context)
We study sequential prediction problems in which, at each time instance, the forecaster chooses a binary vector from a certain fixed set S ⊆ {0, 1} d and suffers a loss that is the sum of the losses of those vector components that equal to one. The goal of the forecaster is to achieve that, in the long run, the accumulated loss is not much larger than that of the best possible vector in the class. We consider the “bandit ” setting in which the forecaster has only access to the losses of the chosen vectors. We introduce a new general forecaster achieving a regret bound that, for a variety of concrete choices of S, is of order √ nd ln S  where n is the time horizon. This is not improvable in general and is better than previously known bounds. We also point out that computationally efficient implementations for various interesting choices of S exist. 1
Thompson Sampling for Contextual Bandits with Linear Payoffs.
, 2013
"... Abstract Thompson Sampling is one of the oldest heuristics for multiarmed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the stateofthear ..."
Abstract

Cited by 28 (3 self)
 Add to MetaCart
(Show Context)
Abstract Thompson Sampling is one of the oldest heuristics for multiarmed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the stateoftheart methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze a generalization of Thompson Sampling algorithm for the stochastic contextual multiarmed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied versions of the contextual bandits problem. We provide the first theoretical guarantees for the contextual version of Thompson Sampling. We prove a high probability regret bound of , which is the best regret bound achieved by any computationally efficient algorithm available for this problem in the current literature, and is within a factor of √ d (or log(N )) of the informationtheoretic lower bound for this problem.
Learning to Optimize Via Posterior Sampling
, 2013
"... This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multiarmed bandit problems. The algorithm, also known as Thompson Sampling, offers significant advantages over the popular upper confide ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
(Show Context)
This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multiarmed bandit problems. The algorithm, also known as Thompson Sampling, offers significant advantages over the popular upper confidence bound (UCB) approach, and can be applied to problems with finite or infinite action spaces and complicated relationships among action rewards. We make two theoretical contributions. The first establishes a connection between posterior sampling and UCB algorithms. This result lets us convert regret bounds developed for UCB algorithms into Bayes risk bounds for posterior sampling. Our second theoretical contribution is a Bayes risk bound for posterior sampling that applies broadly and can be specialized to many model classes. This bound depends on a new notion we refer to as the margin dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm Bayes risk bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. Further, our analysis provides insight into performance advantages of posterior sampling, which are highlighted through simulation results that demonstrate performance surpassing recently proposed UCB algorithms. 1
Bandit Theory meets Compressed Sensing for high dimensional Stochastic Linear Bandit
, 2012
"... We consider a linear stochastic bandit problem where the dimension K of the unknown parameter θ is larger than the sampling budget n. In such cases, it is in general impossible to derive sublinear regret bounds since usual linear bandit algorithms have a regret in O(K √ n). In this paper we assume ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
We consider a linear stochastic bandit problem where the dimension K of the unknown parameter θ is larger than the sampling budget n. In such cases, it is in general impossible to derive sublinear regret bounds since usual linear bandit algorithms have a regret in O(K √ n). In this paper we assume that θ is S−sparse, i.e. has at most S−nonzero components, and that the space of arms is the unit ball for the .2 norm. We combine ideas from Compressed Sensing and Bandit Theory and derive algorithms with regret bounds in O(S √ n). We detail an application to the problem of optimizing a function that depends on many variables but among which only a small number of them (initially unknown) are relevant.
Spectral bandits for smooth graph functions
 in Proc. Intern. Conf. Mach. Learning (ICML
, 2014
"... Abstract Smooth functions on graphs have wide applications in manifold and semisupervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as contentbased re ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
Abstract Smooth functions on graphs have wide applications in manifold and semisupervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as contentbased recommendation. In this problem, each item we can recommend is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in realworld graphs, and propose two algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on realworld content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of nodes evaluations.
On the complexity of bandit and derivativefree stochastic convex optimization
 CoRR
"... The problem of stochastic convex optimization with bandit feedback (in the learning community) or without knowledge of gradients (in the optimization community) has received much attention in recent years, in the form of algorithms and performance upper bounds. However, much less is known about the ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
The problem of stochastic convex optimization with bandit feedback (in the learning community) or without knowledge of gradients (in the optimization community) has received much attention in recent years, in the form of algorithms and performance upper bounds. However, much less is known about the inherent complexity of these problems, and there are few lower bounds in the literature, especially for nonlinear functions. In this paper, we investigate the attainable error/regret in the bandit and derivativefree settings, as a function of the dimension d and the available number of queries T. We provide a precise characterization of the attainable performance for stronglyconvex and smooth functions, which also imply a nontrivial lower bound for more general problems. Moreover, we prove that in both the bandit and derivativefree setting, the required number of queries must scale at least quadratically with the dimension. Finally, we show that on the natural class of quadratic functions, it is possible to obtain a “fast ” O(1/T) error rate in terms of T, under mild assumptions, even without having access to gradients. To the best of our knowledge, this is the first such rate in a derivativefree stochastic setting, and holds despite previous results which seem to imply the contrary.
Learningbased optimization of cache content in a small cell base station,” arXiv preprint arXiv:1402.3247
, 2014
"... Abstract—Optimal cache content placement in a wireless small cell base station (sBS) with limited backhaul capacity is studied. The sBS has a large cache memory and provides contentlevel selective offloading by delivering high data rate contents to users in its coverage area. The goal of the sBS co ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Optimal cache content placement in a wireless small cell base station (sBS) with limited backhaul capacity is studied. The sBS has a large cache memory and provides contentlevel selective offloading by delivering high data rate contents to users in its coverage area. The goal of the sBS content controller (CC) is to store the most popular contents in the sBS cache memory such that the maximum amount of data can be fetched directly form the sBS, not relying on the limited backhaul resources during peak traffic periods. If the popularity profile is known in advance, the problem reduces to a knapsack problem. However, it is assumed in this work that, the popularity profile of the files is not known by the CC, and it can only observe the instantaneous demand for the cached content. Hence, the cache content placement is optimised based on the demand history. By refreshing the cache content at regular time intervals, the CC tries to learn the popularity profile, while exploiting the limited cache capacity in the best way possible. Three algorithms are studied for this cache content placement problem, leading to different exploitationexploration tradeoffs. We provide extensive numerical simulations in order to study the timeevolution of these algorithms, and the impact of the system parameters, such as the number of files, the number of users, the cache size, and the skewness of the popularity profile, on the performance. It is shown that the proposed algorithms quickly learn the popularity profile for a wide range of system parameters. I.
Dynamic pricing and learning: historical origins, current research, and new directions. Working paper. Available at http://ssrn.com/abstract=2334429
"... The topic of dynamic pricing and learning has received a considerable amount of attention in recent years, from different scientific communities. We survey these literature streams: we provide a brief introduction to the historical origins of quantitative research on pricing and demand estimation, p ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
The topic of dynamic pricing and learning has received a considerable amount of attention in recent years, from different scientific communities. We survey these literature streams: we provide a brief introduction to the historical origins of quantitative research on pricing and demand estimation, point to different subfields in the area of dynamic pricing, and provide an indepth overview of the available literature on dynamic pricing and learning. Our focus is on the operations research and management science literature, but we also discuss relevant contributions from marketing, economics, econometrics, and computer science. We discuss relations with methodologically related research areas, and identify directions for future research.
Thompson Sampling for Complex Online Problems
"... We consider stochastic multiarmed bandit problems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms ’ rewards, and the feedback observed may not ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We consider stochastic multiarmed bandit problems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms ’ rewards, and the feedback observed may not necessarily be the reward perarm. For instance, when the complex actions are subsets of the arms, we may only observe the maximum reward over the chosen subset. Thus, feedback across complex actions may be coupled due to the nature of the reward function. We prove a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them. The bound holds for discretelysupported priors over the parameter space without additional structural properties such as closedform posteriors, conjugate prior structure or independence across arms. The regret bound scales logarithmically with time but, more importantly, with an improved constant that nontrivially captures the coupling across complex actions due to the structure of the rewards. As applications, we derive improved regret bounds for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear MAX reward feedback from subsets. Using particle filters for computing posterior distributions which lack an explicit closedform, we present numerical results for the performance of Thompson sampling for subsetselection and job
Eluder dimension and the sample complexity of optimistic exploration
 In Advances in Neural Information Processing Systems
, 2013
"... Abstract This paper considers the sample complexity of the multiarmed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper c ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract This paper considers the sample complexity of the multiarmed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models.