Results 1  10
of
157
Reinforcement Learning I: Introduction
, 1998
"... In which we try to give a basic intuitive sense of what reinforcement learning is and how it differs and relates to other fields, e.g., supervised learning and neural networks, genetic algorithms and artificial life, control theory. Intuitively, RL is trial and error (variation and selection, search ..."
Abstract

Cited by 5500 (120 self)
 Add to MetaCart
In which we try to give a basic intuitive sense of what reinforcement learning is and how it differs and relates to other fields, e.g., supervised learning and neural networks, genetic algorithms and artificial life, control theory. Intuitively, RL is trial and error (variation and selection, search) plus learning (association, memory). We argue that RL is the only field that seriously addresses the special features of the problem of learning from interaction to achieve longterm goals.
A ContextualBandit Approach to Personalized News Article Recommendation
"... Personalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two reasons. First, web service is featured with dynamic ..."
Abstract

Cited by 170 (16 self)
 Add to MetaCart
(Show Context)
Personalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two reasons. First, web service is featured with dynamically changing pools of content, rendering traditional collaborative filtering methods inapplicable. Second, the scale of most web services of practical interest calls for solutions that are both fast in learning and computation. In this work, we model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its articleselection strategy based on userclick feedback to maximize total user clicks. The contributions of this work are threefold. First, we propose a new, general contextual bandit algorithm that is computationally efficient and well motivated from learning theory. Second, we argue that any bandit algorithm can be reliably evaluated offline using previously recorded random traffic. Finally, using this offline evaluation method, we successfully applied our new algorithm to a Yahoo! Front Page Today Module dataset containing over 33 million events. Results showed a 12.5 % click lift compared to a standard contextfree bandit algorithm, and the advantage becomes even greater when data gets more scarce.
An Empirical Evaluation of Thompson Sampling
"... Thompson sampling is one of oldest heuristic to address the exploration / exploitation tradeoff, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heur ..."
Abstract

Cited by 70 (6 self)
 Add to MetaCart
(Show Context)
Thompson sampling is one of oldest heuristic to address the exploration / exploitation tradeoff, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1
A Bayesian Sampling Approach to Exploration in Reinforcement Learning
"... We present a modular approach to reinforcement learning that uses a Bayesian representation of the uncertainty over models. The approach, BOSS (Best of Sampled Set), drives exploration by sampling multiple models from the posterior and selecting actions optimistically. It extends previous work by pr ..."
Abstract

Cited by 68 (9 self)
 Add to MetaCart
(Show Context)
We present a modular approach to reinforcement learning that uses a Bayesian representation of the uncertainty over models. The approach, BOSS (Best of Sampled Set), drives exploration by sampling multiple models from the posterior and selecting actions optimistically. It extends previous work by providing a rule for deciding when to resample and how to combine the models. We show that our algorithm achieves nearoptimal reward with high probability with a sample complexity that is low relative to the speed at which the posterior distribution converges during learning. We demonstrate that BOSS performs quite favorably compared to stateoftheart reinforcementlearning approaches and illustrate its flexibility by pairing it with a nonparametric model that generalizes across states. 1
Linearly Parameterized Bandits
, 2008
"... We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an rdimensional random vector Z ∈ Rr, where r ≥ 2. The objective is to choose a sequence of arms to minimize the cumulative regret and Bayes risk. W ..."
Abstract

Cited by 57 (0 self)
 Add to MetaCart
(Show Context)
We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an rdimensional random vector Z ∈ Rr, where r ≥ 2. The objective is to choose a sequence of arms to minimize the cumulative regret and Bayes risk. We propose a policy based on least squares estimation and uncertainty ellipsoids, which generalizes the upper confidence index approach pioneered by Lai and Robbins (1985). The cumulative regret and Bayes risk under our proposed policy admits an upper bound of the form r √ T log 3/2 T, which is linear in the dimension r, and independent of the number of arms. We also establish Ω(r √ T) lower bounds on the regret and risk, showing that our proposed policy is nearly optimal.
Bayesian sparse sampling for online reward optimization
 In ICML ’05: Proceedings of the 22nd international conference on Machine learning
, 2005
"... We present an efficient “sparse sampling ” technique for approximating Bayes optimal decision making in reinforcement learning, addressing the well known exploration versus exploitation tradeoff. Our approach combines sparse sampling with Bayesian exploration to achieve improved decision making whil ..."
Abstract

Cited by 53 (5 self)
 Add to MetaCart
(Show Context)
We present an efficient “sparse sampling ” technique for approximating Bayes optimal decision making in reinforcement learning, addressing the well known exploration versus exploitation tradeoff. Our approach combines sparse sampling with Bayesian exploration to achieve improved decision making while controlling computational cost. The idea is to grow a sparse lookahead tree, intelligently, by exploiting information in a Bayesian posterior—rather than enumerate action branches (standard sparse sampling) or compensate myopically (value of perfect information). The outcome is a flexible, practical technique for improving action selection in simple reinforcement learning scenarios. 1.
Explorationexploitation tradeoff using variance estimates in multiarmed bandits
, 2009
"... ..."
(Show Context)
Multitask reinforcement learning: A hierarchical bayesian approach
 In: ICML ’07: Proceedings of the 24th international conference on Machine learning
, 2007
"... We consider the problem of multitask reinforcement learning, where the agent needs to solve a sequence of Markov Decision Processes (MDPs) chosen randomly from a fixed but unknown distribution. We model the distribution over MDPs using a hierarchical Bayesian infinite mixture model. For each novel ..."
Abstract

Cited by 46 (3 self)
 Add to MetaCart
(Show Context)
We consider the problem of multitask reinforcement learning, where the agent needs to solve a sequence of Markov Decision Processes (MDPs) chosen randomly from a fixed but unknown distribution. We model the distribution over MDPs using a hierarchical Bayesian infinite mixture model. For each novel MDP, we use the previously learned distribution as an informed prior for modelbased Bayesian reinforcement learning. The hierarchical Bayesian framework provides a strong prior that allows us to rapidly infer the characteristics of new environments based on previous environments, while the use of a nonparametric model allows us to quickly adapt to environments we have not encountered before. In addition, the use of infinite mixtures allows for the model to automatically learn the number of underlying MDP components. We evaluate our approach and show that it leads to significant speedups in convergence to an optimal policy after observing only a small number of tasks. 1.
Bayes’ bluff: Opponent modelling in poker
 In Proceedings of the 21st Annual Conference on Uncertainty in Artificial Intelligence (UAI
, 2005
"... Poker is a challenging problem for artificial intelligence, with nondeterministic dynamics, partial observability, and the added difficulty of unknown adversaries. Modelling all of the uncertainties in this domain is not an easy task. In this paper we present a Bayesian probabilistic model for a br ..."
Abstract

Cited by 40 (3 self)
 Add to MetaCart
(Show Context)
Poker is a challenging problem for artificial intelligence, with nondeterministic dynamics, partial observability, and the added difficulty of unknown adversaries. Modelling all of the uncertainties in this domain is not an easy task. In this paper we present a Bayesian probabilistic model for a broad class of poker games, separating the uncertainty in the game dynamics from the uncertainty of the opponent’s strategy. We then describe approaches to two key subproblems: (i) inferring a posterior over opponent strategies given a prior distribution and observations of their play, and (ii) playing an appropriate response to that distribution. We demonstrate the overall approach on a reduced version of poker using Dirichlet priors and then on the full game of Texas hold’em using a more informed prior. We demonstrate methods for playing effective responses to the opponent, based on the posterior. 1