Results 1  10
of
177
Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms
 In WSDM
, 2011
"... Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging du ..."
Abstract

Cited by 79 (18 self)
 Add to MetaCart
(Show Context)
Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their“partiallabel ” nature. Common practice is to create a simulator which simulates the online environment for the problem at hand and then run an algorithm against this simulator. However, creating simulator itself is often difficult and modeling bias is usually unavoidably introduced. In this paper, we introduce a replay methodology for contextual bandit algorithm evaluation. Different from simulatorbased approaches, our method is completely datadriven and very easy to adapt to different applications. More importantly, our method can provide provably unbiased evaluations. Our empirical results on a largescale news article recommendation dataset collected from Yahoo! Front Page conform well with our theoretical results. Furthermore, comparisons between our offline replay and online bucket evaluation of several contextual bandit algorithms show accuracy and effectiveness of our offline evaluation method.
An Empirical Evaluation of Thompson Sampling
"... Thompson sampling is one of oldest heuristic to address the exploration / exploitation tradeoff, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heur ..."
Abstract

Cited by 72 (6 self)
 Add to MetaCart
(Show Context)
Thompson sampling is one of oldest heuristic to address the exploration / exploitation tradeoff, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1
Knows What It Knows: A Framework For SelfAware Learning
"... We introduce a learning framework that combines elements of the wellknown PAC and mistakebound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is ..."
Abstract

Cited by 71 (20 self)
 Add to MetaCart
We introduce a learning framework that combines elements of the wellknown PAC and mistakebound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is true in reinforcementlearning and activelearning problems. We catalog several KWIKlearnable classes and open problems. 1.
Combinatorial Bandits
"... We study sequential prediction problems in which, at each time instance, the forecaster chooses a binary vector from a certain fixed set S ⊆ {0, 1} d and suffers a loss that is the sum of the losses of those vector components that equal to one. The goal of the forecaster is to achieve that, in the l ..."
Abstract

Cited by 46 (6 self)
 Add to MetaCart
(Show Context)
We study sequential prediction problems in which, at each time instance, the forecaster chooses a binary vector from a certain fixed set S ⊆ {0, 1} d and suffers a loss that is the sum of the losses of those vector components that equal to one. The goal of the forecaster is to achieve that, in the long run, the accumulated loss is not much larger than that of the best possible vector in the class. We consider the “bandit ” setting in which the forecaster has only access to the losses of the chosen vectors. We introduce a new general forecaster achieving a regret bound that, for a variety of concrete choices of S, is of order √ nd ln S  where n is the time horizon. This is not improvable in general and is better than previously known bounds. We also point out that computationally efficient implementations for various interesting choices of S exist. 1
Contextual Bandits with Linear Payoff Functions
"... In this paper we study the contextual bandit problem (also known as the multiarmed bandit problem with expert advice) for linear payoff functions. For T rounds, K actions, and d(√ dimensional feature vectors, we prove an O T d ln 3) (KT ln(T)/δ) regret bound that holds with probability 1 − δ for th ..."
Abstract

Cited by 42 (4 self)
 Add to MetaCart
(Show Context)
In this paper we study the contextual bandit problem (also known as the multiarmed bandit problem with expert advice) for linear payoff functions. For T rounds, K actions, and d(√ dimensional feature vectors, we prove an O T d ln 3) (KT ln(T)/δ) regret bound that holds with probability 1 − δ for the simplest known (both conceptually and computationally) efficient upper confidence bound algorithm for this problem. We also prove a lower bound of Ω ( √ T d) for this setting, matching the upper bound up to logarithmic factors. 1
Improved algorithms for linear stochastic bandits
 In Advances in Neural Information Processing Systems
, 2011
"... We improve the theoretical analysis and empirical performance of algorithms for the stochastic multiarmed bandit problem and the linear stochastic multiarmed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability consta ..."
Abstract

Cited by 40 (1 self)
 Add to MetaCart
(Show Context)
We improve the theoretical analysis and empirical performance of algorithms for the stochastic multiarmed bandit problem and the linear stochastic multiarmed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vectorvalued martingales. 1
Multidomain collaborative filtering
 in UAI 2010
, 2010
"... ABSTRACT In this paper, we study collaborative filtering (CF) in an interactive setting, in which a recommender system continuously recommends items to individual users and receives interactive feedback. Whilst users enjoy sequential recommendations, the recommendation predictions are constantly re ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
(Show Context)
ABSTRACT In this paper, we study collaborative filtering (CF) in an interactive setting, in which a recommender system continuously recommends items to individual users and receives interactive feedback. Whilst users enjoy sequential recommendations, the recommendation predictions are constantly refined using uptodate feedback on the recommended items. Bringing the interactive mechanism back to the CF process is fundamental because the ultimate goal for a recommender system is about the discovery of interesting items for individual users and yet users' personal preferences and contexts evolve over time during the interactions with the system. This requires us not to distinguish between the stages of collecting information to construct the user profile and making recommendations, but to seamlessly integrate these stages together during the interactive process, with the goal of maximizing the overall recommendation accuracy throughout the interactions. This mechanism naturally addresses the coldstart problem as any user can immediately receive sequential recommendations without providing ratings beforehand. We formulate the interactive CF with the probabilistic matrix factorization (PMF) framework, and leverage several exploitationexploration algorithms to select items, including the empirical Thompson sampling and upper confidence bound based algorithms. We conduct our experiment on coldstart users as well as warmstart users with drifting taste. Results show that the proposed methods have significant improvements over several strong baselines for the MovieLens, EachMovie and Netflix datasets.
Learning from logged implicit exploration data
 In Proceedings of the 24th Annual Conference on Neural Information Processing Systems
, 2010
"... We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit ” or “partially labeled ” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline ” data ..."
Abstract

Cited by 24 (8 self)
 Add to MetaCart
(Show Context)
We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit ” or “partially labeled ” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline ” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of realworld data obtained from Yahoo!. 1
From Bandits to Experts: On the Value of SideObservations
"... We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
(Show Context)
We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the wellknown “experts ” setting, where the decision maker can view all rewards, and the multiarmed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on nontrivial graphtheoretic properties of the information feedback structure. We also provide partiallymatching lower bounds. 1