Results 1  10
of
30
Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms
 In WSDM
, 2011
"... Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging du ..."
Abstract

Cited by 79 (18 self)
 Add to MetaCart
Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their“partiallabel ” nature. Common practice is to create a simulator which simulates the online environment for the problem at hand and then run an algorithm against this simulator. However, creating simulator itself is often difficult and modeling bias is usually unavoidably introduced. In this paper, we introduce a replay methodology for contextual bandit algorithm evaluation. Different from simulatorbased approaches, our method is completely datadriven and very easy to adapt to different applications. More importantly, our method can provide provably unbiased evaluations. Our empirical results on a largescale news article recommendation dataset collected from Yahoo! Front Page conform well with our theoretical results. Furthermore, comparisons between our offline replay and online bucket evaluation of several contextual bandit algorithms show accuracy and effectiveness of our offline evaluation method.
Efficient Optimal Learning for Contextual Bandits
"... We address the problem of learning in an online setting where the learner repeatedly observes features x, selects among K actions, and receives reward r for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses an oracle which returns an optimal policy ..."
Abstract

Cited by 30 (3 self)
 Add to MetaCart
We address the problem of learning in an online setting where the learner repeatedly observes features x, selects among K actions, and receives reward r for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses an oracle which returns an optimal policy given rewards for all actions for each x. The algorithm has running time polylog(N), where N is the number of policies that we compete with. This is exponentially faster than all previous algorithms that achieve optimal regret in this setting. Our formulation also enables us to create an algorithm with regret that is additive rather than multiplicative in feedback delay as in all previous work. 1.
Learning to Optimize Via Posterior Sampling
, 2013
"... This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multiarmed bandit problems. The algorithm, also known as Thompson Sampling, offers significant advantages over the popular upper confide ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
(Show Context)
This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multiarmed bandit problems. The algorithm, also known as Thompson Sampling, offers significant advantages over the popular upper confidence bound (UCB) approach, and can be applied to problems with finite or infinite action spaces and complicated relationships among action rewards. We make two theoretical contributions. The first establishes a connection between posterior sampling and UCB algorithms. This result lets us convert regret bounds developed for UCB algorithms into Bayes risk bounds for posterior sampling. Our second theoretical contribution is a Bayes risk bound for posterior sampling that applies broadly and can be specialized to many model classes. This bound depends on a new notion we refer to as the margin dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm Bayes risk bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. Further, our analysis provides insight into performance advantages of posterior sampling, which are highlighted through simulation results that demonstrate performance surpassing recently proposed UCB algorithms. 1
Pacbayesian inequalities for martingales
 IEEE Transactions on Information Theory
, 2012
"... Abstract—We present a set of highprobability inequalities that control the concentration of weighted averages of multiple (possibly uncountably many) simultaneously evolving and interdependent martingales. Our results extend the PACBayesian (probably approximately correct) analysis in learning the ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
Abstract—We present a set of highprobability inequalities that control the concentration of weighted averages of multiple (possibly uncountably many) simultaneously evolving and interdependent martingales. Our results extend the PACBayesian (probably approximately correct) analysis in learning theory from the i.i.d. setting to martingales opening the way for its application to importance weighted sampling, reinforcement learning, and other interactive learning domains, as well as many other domains in probability theory and statistics, where martingales are encountered. We also present a comparison inequality that bounds the expectation of a convex function of a martingaledifferencesequenceshiftedto the interval by the expectation of the same function of independent Bernoulli random variables. This inequality is applied to derive a tighter analog of Hoeffding–Azuma’s inequality. Index Terms—Bernstein’s inequality, Hoeffding–Azuma’s inequality, martingales, PACBayesian bounds.
Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits
, 2014
"... We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes an action in response to the observed context, observing the reward only for that action. Our method assumes access to an oracle for solving costsensitive classification problems and achieves t ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes an action in response to the observed context, observing the reward only for that action. Our method assumes access to an oracle for solving costsensitive classification problems and achieves the statistically optimal regret guarantee with only Õ( √ T) oracle calls across all T rounds. By doing so, we obtain the most practical contextual bandit learning algorithm amongst approaches that work for general policy classes. We further conduct a proofofconcept experiment which demonstrates the excellent computational and prediction performance of (an online variant of) our algorithm relative to several baselines.
Learning hurdles for sleeping experts
 In Innovations in Theoretical Computer Science
, 2012
"... We study the online decision problem where the set of available actions varies over time, also called the sleeping experts problem. We consider the setting where the performance comparison is made with respect to the best ordering of actions in hindsight. In this paper, both the payoff function and ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
We study the online decision problem where the set of available actions varies over time, also called the sleeping experts problem. We consider the setting where the performance comparison is made with respect to the best ordering of actions in hindsight. In this paper, both the payoff function and the availability of actions is adversarial. Kleinberg et al. (2008) gave a computationally efficient noregret algorithm in the setting where payoffs are stochastic. Kanade et al. (2009) gave an efficient noregret algorithm in the setting where action availability is stochastic. However, the question of whether there exists a computationally efficient noregret algorithm in the adversarial setting was posed as an open problem by Kleinberg et al. (2008). We show that such an algorithm would imply an algorithm for PAC learning DNF, a long standing important open problem. We also consider the setting where the number of available actions is restricted, and study its relation to agnostic learning monotone disjunctions over examples with bounded Hamming weight. 1
PACBayesian Analysis of Contextual Bandits
"... We derive an instantaneous (perround) datadependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). pThe scaling of our regret bound with the number of states (contexts) N goes as NI⇢t (S; A), where I⇢t (S; A) is the mutual information betwe ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
We derive an instantaneous (perround) datadependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). pThe scaling of our regret bound with the number of states (contexts) N goes as NI⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm uses all the side information, the regret bound scales as p N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I⇢t (S; A) =0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1
Eluder dimension and the sample complexity of optimistic exploration
 In Advances in Neural Information Processing Systems
, 2013
"... Abstract This paper considers the sample complexity of the multiarmed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper c ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract This paper considers the sample complexity of the multiarmed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models.
PACBayesian analysis of martingales and multiarmed bandits. http://arxiv.org/abs/1105.2416
, 2011
"... Abstract We present two alternative ways to apply PACBayesian analysis to sequences of dependent random variables. The first is based on a new lemma that enables to bound expectations of convex functions of certain dependent random variables by expectations of the same functions of independent Ber ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Abstract We present two alternative ways to apply PACBayesian analysis to sequences of dependent random variables. The first is based on a new lemma that enables to bound expectations of convex functions of certain dependent random variables by expectations of the same functions of independent Bernoulli random variables. This lemma provides an alternative tool to HoeffdingAzuma inequality to bound concentration of martingale values. Our second approach is based on integration of HoeffdingAzuma inequality with PACBayesian analysis. We also introduce a way to apply PACBayesian analysis in situation of limited feedback. We combine the new tools to derive PACBayesian generalization and regret bounds for the multiarmed bandit problem. Although our regret bound is not yet as tight as stateoftheart regret bounds based on other wellestablished techniques, our results significantly expand the range of potential applications of PACBayesian analysis and introduce a new analysis tool to reinforcement learning and many other fields, where martingales and limited feedback are encountered.
PACBayesBernstein inequality for martingales and its application to multiarmed bandits
 JMLR Workshop and Conference Proceedings
"... We develop a new tool for datadependent analysis of the explorationexploitation tradeoff in learning under limited feedback. Our tool is based on two main ingredients. The first ingredient is a new concentration inequality that makes it possible to control the concentration of weighted averages o ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
We develop a new tool for datadependent analysis of the explorationexploitation tradeoff in learning under limited feedback. Our tool is based on two main ingredients. The first ingredient is a new concentration inequality that makes it possible to control the concentration of weighted averages of multiple (possibly uncountably many) simultaneously evolving and interdependent martingales. 1 The second ingredient is an application of this inequality to the explorationexploitation tradeoff via importance weighted sampling. We apply the new tool to the stochastic multiarmed bandit problem, however, the main importance of this paper is the development and understanding of the new tool rather than improvement of existing algorithms for stochastic multiarmed bandits. In the followup work we demonstrate that the new tool can improve over stateoftheart in structurally richer problems, such as stochastic multiarmed bandits with side information (Seldin et al., 2011a).