Results 1  10
of
42
Contextual Bandits with Similarity Information
 24TH ANNUAL CONFERENCE ON LEARNING THEORY
, 2011
"... In a multiarmed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a timeinvariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work ha ..."
Abstract

Cited by 57 (9 self)
 Add to MetaCart
In a multiarmed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a timeinvariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs to assume extra structure in order to make the problem tractable. In particular, recent literature considered information on similarity between arms. We consider similarity information in the setting of contextual bandits, a natural extension of the basic MAB problem where before each round an algorithm is given the context – a hint about the payoffs in this round. Contextual bandits are directly motivated by placing advertisements on webpages, one of the crucial problems in sponsored search. A particularly simple way to represent similarity information in the contextual bandit setting is via a similarity distance between the contextarm pairs which bounds from above the difference between the respective expected payoffs. Prior work
Combinatorial Bandits
"... We study sequential prediction problems in which, at each time instance, the forecaster chooses a binary vector from a certain fixed set S ⊆ {0, 1} d and suffers a loss that is the sum of the losses of those vector components that equal to one. The goal of the forecaster is to achieve that, in the l ..."
Abstract

Cited by 46 (6 self)
 Add to MetaCart
(Show Context)
We study sequential prediction problems in which, at each time instance, the forecaster chooses a binary vector from a certain fixed set S ⊆ {0, 1} d and suffers a loss that is the sum of the losses of those vector components that equal to one. The goal of the forecaster is to achieve that, in the long run, the accumulated loss is not much larger than that of the best possible vector in the class. We consider the “bandit ” setting in which the forecaster has only access to the losses of the chosen vectors. We introduce a new general forecaster achieving a regret bound that, for a variety of concrete choices of S, is of order √ nd ln S  where n is the time horizon. This is not improvable in general and is better than previously known bounds. We also point out that computationally efficient implementations for various interesting choices of S exist. 1
Improved algorithms for linear stochastic bandits
 In Advances in Neural Information Processing Systems
, 2011
"... We improve the theoretical analysis and empirical performance of algorithms for the stochastic multiarmed bandit problem and the linear stochastic multiarmed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability consta ..."
Abstract

Cited by 40 (1 self)
 Add to MetaCart
We improve the theoretical analysis and empirical performance of algorithms for the stochastic multiarmed bandit problem and the linear stochastic multiarmed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vectorvalued martingales. 1
Thompson Sampling for Contextual Bandits with Linear Payoffs.
, 2013
"... Abstract Thompson Sampling is one of the oldest heuristics for multiarmed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the stateofthear ..."
Abstract

Cited by 28 (3 self)
 Add to MetaCart
(Show Context)
Abstract Thompson Sampling is one of the oldest heuristics for multiarmed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the stateoftheart methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze a generalization of Thompson Sampling algorithm for the stochastic contextual multiarmed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied versions of the contextual bandits problem. We provide the first theoretical guarantees for the contextual version of Thompson Sampling. We prove a high probability regret bound of , which is the best regret bound achieved by any computationally efficient algorithm available for this problem in the current literature, and is within a factor of √ d (or log(N )) of the informationtheoretic lower bound for this problem.
Optimistic Bayesian sampling in contextualbandit problems
, 2011
"... In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the explorationexploitation dilemma in a general setting encompassing both standard and con ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the explorationexploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise clickthrough rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout.
Contextual Gaussian Process Bandit Optimization
"... How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
(Show Context)
How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the contextaction space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint contextaction space, and develop CGPUCB, an intuitive upperconfidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGPUCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that contextsensitive optimization outperforms no or naive use of context. 1
Linear Submodular Bandits and their Application to Diversified Retrieval
"... Diversified retrieval and online learning are two core research areas in the design of modern information retrieval systems. In this paper, we propose the linear submodular bandits problem, which is an online learning setting for optimizing a general class of featurerich submodular utility models f ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
(Show Context)
Diversified retrieval and online learning are two core research areas in the design of modern information retrieval systems. In this paper, we propose the linear submodular bandits problem, which is an online learning setting for optimizing a general class of featurerich submodular utility models for diversified retrieval. We present an algorithm, called LSBGREEDY, and prove that it efficiently converges to a nearoptimal model. As a case study, we applied our approach to the setting of personalized news recommendation, where the system must recommend small sets of news articles selected from tens of thousands of available articles each day. In a live user study, we found that LSBGREEDY significantly outperforms existing online learning approaches. 1
Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits
, 2014
"... We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes an action in response to the observed context, observing the reward only for that action. Our method assumes access to an oracle for solving costsensitive classification problems and achieves t ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes an action in response to the observed context, observing the reward only for that action. Our method assumes access to an oracle for solving costsensitive classification problems and achieves the statistically optimal regret guarantee with only Õ( √ T) oracle calls across all T rounds. By doing so, we obtain the most practical contextual bandit learning algorithm amongst approaches that work for general policy classes. We further conduct a proofofconcept experiment which demonstrates the excellent computational and prediction performance of (an online variant of) our algorithm relative to several baselines.
Spectral bandits for smooth graph functions
 in Proc. Intern. Conf. Mach. Learning (ICML
, 2014
"... Abstract Smooth functions on graphs have wide applications in manifold and semisupervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as contentbased re ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
Abstract Smooth functions on graphs have wide applications in manifold and semisupervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as contentbased recommendation. In this problem, each item we can recommend is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in realworld graphs, and propose two algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on realworld content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of nodes evaluations.
Online Learning for Linearly Parametrized Control Problems
, 2012
"... Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author’s prior written permission.