Results 11  20
of
311
Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms
 In WSDM
, 2011
"... Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging du ..."
Abstract

Cited by 73 (15 self)
 Add to MetaCart
(Show Context)
Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their“partiallabel ” nature. Common practice is to create a simulator which simulates the online environment for the problem at hand and then run an algorithm against this simulator. However, creating simulator itself is often difficult and modeling bias is usually unavoidably introduced. In this paper, we introduce a replay methodology for contextual bandit algorithm evaluation. Different from simulatorbased approaches, our method is completely datadriven and very easy to adapt to different applications. More importantly, our method can provide provably unbiased evaluations. Our empirical results on a largescale news article recommendation dataset collected from Yahoo! Front Page conform well with our theoretical results. Furthermore, comparisons between our offline replay and online bucket evaluation of several contextual bandit algorithms show accuracy and effectiveness of our offline evaluation method.
Motivating innovation
 AFA 2007 Chicago Meetings Paper, Hudson Institute Research Paper 0801, http://ssrn. com/abstract=891514
"... forthcoming in the Journal of Finance Motivating innovation is important in many incentive problems. This paper shows that the optimal innovationmotivating incentive scheme exhibits substantial tolerance (or even reward) for early failure and reward for longterm success. Moreover, commitment to a ..."
Abstract

Cited by 63 (6 self)
 Add to MetaCart
forthcoming in the Journal of Finance Motivating innovation is important in many incentive problems. This paper shows that the optimal innovationmotivating incentive scheme exhibits substantial tolerance (or even reward) for early failure and reward for longterm success. Moreover, commitment to a longterm compensation plan, job security, and timely feedback on performance are essential to motivate innovation. In the context of managerial compensation, the optimal innovationmotivating incentive scheme can be implemented via a combination of stock options with long vesting periods, option repricing, golden parachutes, and managerial entrenchment.
Reinforcement learning is direct adaptive optimal control
 In Proceedings of the American Control Conference
, 1991
"... ..."
(Show Context)
Cognitive Medium Access: Exploration, Exploitation and Competition
, 2007
"... This paper establishes the equivalence between cognitive medium access and the competitive multiarmed bandit problem. First, the scenario in which a single cognitive user wishes to opportunistically exploit the availability of empty frequency bands in the spectrum with multiple bands is considered ..."
Abstract

Cited by 58 (5 self)
 Add to MetaCart
(Show Context)
This paper establishes the equivalence between cognitive medium access and the competitive multiarmed bandit problem. First, the scenario in which a single cognitive user wishes to opportunistically exploit the availability of empty frequency bands in the spectrum with multiple bands is considered. In this scenario, the availability probability of each channel is unknown to the cognitive user a priori. Hence efficient medium access strategies must strike a balance between exploring the availability of other free channels and exploiting the opportunities identified thus far. By adopting a Bayesian approach for this classical bandit problem, the optimal medium access strategy is derived and its underlying recursive structure is illustrated via examples. To avoid the prohibitive computational complexity of the optimal strategy, a low complexity asymptotically optimal strategy is developed. The proposed strategy does not require any prior statistical knowledge about the traffic pattern on the different channels. Next, the multicognitive user scenario is considered and low complexity medium access protocols, which strike the optimal balance between exploration and exploitation in such competitive environments, are developed. Finally, this formalism is extended to the case in which each cognitive user is capable of sensing and using multiple channels simultaneously.
Linearly Parameterized Bandits
, 2008
"... We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an rdimensional random vector Z ∈ Rr, where r ≥ 2. The objective is to choose a sequence of arms to minimize the cumulative regret and Bayes risk. W ..."
Abstract

Cited by 57 (0 self)
 Add to MetaCart
(Show Context)
We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an rdimensional random vector Z ∈ Rr, where r ≥ 2. The objective is to choose a sequence of arms to minimize the cumulative regret and Bayes risk. We propose a policy based on least squares estimation and uncertainty ellipsoids, which generalizes the upper confidence index approach pioneered by Lai and Robbins (1985). The cumulative regret and Bayes risk under our proposed policy admits an upper bound of the form r √ T log 3/2 T, which is linear in the dimension r, and independent of the number of arms. We also establish Ω(r √ T) lower bounds on the regret and risk, showing that our proposed policy is nearly optimal.
Bayesian Learning in Normal Form Games
 GAMES AND ECONOMIC BEHAVIOR 3, 6081 (1991)
, 1991
"... This paper studies myopic Bayesian learning processes for finiteplayer, finitestrategy normal form games. Initially, each player is presumed to know his own payoff function but not the payoff functions of the other players. Assuming that the common prior distribution of payoff functions satisfies i ..."
Abstract

Cited by 54 (2 self)
 Add to MetaCart
This paper studies myopic Bayesian learning processes for finiteplayer, finitestrategy normal form games. Initially, each player is presumed to know his own payoff function but not the payoff functions of the other players. Assuming that the common prior distribution of payoff functions satisfies independence across players, it is proved that the conditional distributions on strategies converge to a set of Nash equilibria with probability one. Under a further assumption that the prior distributions are sufficiently uniform, convergence to a set of Nash equilibria is proved for every profile of payoff functions, that is, every normal form game.
Learning from Neighbors
, 1996
"... When payoffs from different actions are unknown, agents use their own past experience as well as the experience of their neighbors to guide their current decision making. This paper develops a general framework to study the relationship between the structure of information flows and the process of s ..."
Abstract

Cited by 53 (3 self)
 Add to MetaCart
When payoffs from different actions are unknown, agents use their own past experience as well as the experience of their neighbors to guide their current decision making. This paper develops a general framework to study the relationship between the structure of information flows and the process of social learning. We show that in a connected society, local learning ensures that all agents obtain the same utility, in the long run. We develop conditions under which this utility is the maximal attainable, i.e. optimal actions are adopted. This analysis identifies a structural property of information structures  local independence  which greatly facilitates social learning. Our analysis also suggests that there exists a negative relationship between the degree of social integration and the likelihood of diversity. Simulations of the model generate spatial and temporal patterns of adoption that are consistent with empirical work. Key Words: Connected societies, conformism, social integ...
Contextual Bandits with Similarity Information
 24TH ANNUAL CONFERENCE ON LEARNING THEORY
, 2011
"... In a multiarmed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a timeinvariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work ha ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
(Show Context)
In a multiarmed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a timeinvariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs to assume extra structure in order to make the problem tractable. In particular, recent literature considered information on similarity between arms. We consider similarity information in the setting of contextual bandits, a natural extension of the basic MAB problem where before each round an algorithm is given the context – a hint about the payoffs in this round. Contextual bandits are directly motivated by placing advertisements on webpages, one of the crucial problems in sponsored search. A particularly simple way to represent similarity information in the contextual bandit setting is via a similarity distance between the contextarm pairs which bounds from above the difference between the respective expected payoffs. Prior work
Exploration of MultiState Environments: Local Measures and BackPropagation of Uncertainty
, 1998
"... . This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Qlearning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The rst is to dene a ..."
Abstract

Cited by 52 (1 self)
 Add to MetaCart
. This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Qlearning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The rst is to dene a local measure of the uncertainty using the theory of bandit problems. We show that such a measure suers from several drawbacks. In particular, a direct application of it leads to algorithms of low quality that can be easily misled by particular congurations of the environment. The second basic principle was introduced to eliminate this drawback. It consists of assimilating the local measures of uncertainty to rewards, and backpropagating them with the dynamic programming or temporal dierence mechanisms. This allows reproducing globalscale reasoning about the uncertainty, using only local measures of it. Numerical simulations clearly show the eciency of these propositions. Keywords: ...