Results 1  10
of
164
Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms
 In WSDM
, 2011
"... Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging du ..."
Abstract

Cited by 73 (15 self)
 Add to MetaCart
(Show Context)
Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their“partiallabel ” nature. Common practice is to create a simulator which simulates the online environment for the problem at hand and then run an algorithm against this simulator. However, creating simulator itself is often difficult and modeling bias is usually unavoidably introduced. In this paper, we introduce a replay methodology for contextual bandit algorithm evaluation. Different from simulatorbased approaches, our method is completely datadriven and very easy to adapt to different applications. More importantly, our method can provide provably unbiased evaluations. Our empirical results on a largescale news article recommendation dataset collected from Yahoo! Front Page conform well with our theoretical results. Furthermore, comparisons between our offline replay and online bucket evaluation of several contextual bandit algorithms show accuracy and effectiveness of our offline evaluation method.
An Empirical Evaluation of Thompson Sampling
"... Thompson sampling is one of oldest heuristic to address the exploration / exploitation tradeoff, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heur ..."
Abstract

Cited by 70 (6 self)
 Add to MetaCart
(Show Context)
Thompson sampling is one of oldest heuristic to address the exploration / exploitation tradeoff, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1
Knows What It Knows: A Framework For SelfAware Learning
"... We introduce a learning framework that combines elements of the wellknown PAC and mistakebound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is ..."
Abstract

Cited by 67 (20 self)
 Add to MetaCart
We introduce a learning framework that combines elements of the wellknown PAC and mistakebound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is true in reinforcementlearning and activelearning problems. We catalog several KWIKlearnable classes and open problems. 1.
Combinatorial Bandits
"... We study sequential prediction problems in which, at each time instance, the forecaster chooses a binary vector from a certain fixed set S ⊆ {0, 1} d and suffers a loss that is the sum of the losses of those vector components that equal to one. The goal of the forecaster is to achieve that, in the l ..."
Abstract

Cited by 46 (7 self)
 Add to MetaCart
(Show Context)
We study sequential prediction problems in which, at each time instance, the forecaster chooses a binary vector from a certain fixed set S ⊆ {0, 1} d and suffers a loss that is the sum of the losses of those vector components that equal to one. The goal of the forecaster is to achieve that, in the long run, the accumulated loss is not much larger than that of the best possible vector in the class. We consider the “bandit ” setting in which the forecaster has only access to the losses of the chosen vectors. We introduce a new general forecaster achieving a regret bound that, for a variety of concrete choices of S, is of order √ nd ln S  where n is the time horizon. This is not improvable in general and is better than previously known bounds. We also point out that computationally efficient implementations for various interesting choices of S exist. 1
Contextual Bandits with Linear Payoff Functions
"... In this paper we study the contextual bandit problem (also known as the multiarmed bandit problem with expert advice) for linear payoff functions. For T rounds, K actions, and d(√ dimensional feature vectors, we prove an O T d ln 3) (KT ln(T)/δ) regret bound that holds with probability 1 − δ for th ..."
Abstract

Cited by 37 (4 self)
 Add to MetaCart
(Show Context)
In this paper we study the contextual bandit problem (also known as the multiarmed bandit problem with expert advice) for linear payoff functions. For T rounds, K actions, and d(√ dimensional feature vectors, we prove an O T d ln 3) (KT ln(T)/δ) regret bound that holds with probability 1 − δ for the simplest known (both conceptually and computationally) efficient upper confidence bound algorithm for this problem. We also prove a lower bound of Ω ( √ T d) for this setting, matching the upper bound up to logarithmic factors. 1
Improved algorithms for linear stochastic bandits
 In Advances in Neural Information Processing Systems
, 2011
"... We improve the theoretical analysis and empirical performance of algorithms for the stochastic multiarmed bandit problem and the linear stochastic multiarmed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability consta ..."
Abstract

Cited by 35 (0 self)
 Add to MetaCart
(Show Context)
We improve the theoretical analysis and empirical performance of algorithms for the stochastic multiarmed bandit problem and the linear stochastic multiarmed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vectorvalued martingales. 1
MultiDomain Collaborative Filtering
"... Collaborative filtering is an effective recommendation approach in which the preference of a user on an item is predicted based on the preferences of other users with similar interests. A big challenge in using collaborative filtering methods is the data sparsity problem which often arises because e ..."
Abstract

Cited by 26 (0 self)
 Add to MetaCart
(Show Context)
Collaborative filtering is an effective recommendation approach in which the preference of a user on an item is predicted based on the preferences of other users with similar interests. A big challenge in using collaborative filtering methods is the data sparsity problem which often arises because each user typically only rates very few items and hence the rating matrix is extremely sparse. In this paper, we address this problem by considering multiple collaborative filtering tasks in different domains simultaneously and exploiting the relationships between domains. We refer to it as a multidomain collaborative filtering (MCF) problem. To solve the MCF problem, we propose a probabilistic framework which uses probabilistic matrix factorization to model the rating problem in each domain and allows the knowledge to be adaptively transferred across different domains by automatically learning the correlation between domains. We also introduce the link function for different domains to correct their biases. Experiments conducted on several realworld applications demonstrate the effectiveness of our methods when compared with some representative methods. 1
Learning from logged implicit exploration data
 In Proceedings of the 24th Annual Conference on Neural Information Processing Systems
, 2010
"... We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit ” or “partially labeled ” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline ” data ..."
Abstract

Cited by 23 (8 self)
 Add to MetaCart
(Show Context)
We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit ” or “partially labeled ” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline ” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of realworld data obtained from Yahoo!. 1
Balancing exploration and exploitation in listwise and pairwise online learning to rank for information retrieval
 INF RETRIEVAL
, 2012
"... ..."