Results 1  10
of
79
The EpochGreedy Algorithm for Contextual Multiarmed Bandits
"... We present EpochGreedy, an algorithm for contextual multiarmed bandits (also known as bandits with side information). EpochGreedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by EpochGreedy is controlled by a sample complexity bound for a ..."
Abstract

Cited by 78 (9 self)
 Add to MetaCart
(Show Context)
We present EpochGreedy, an algorithm for contextual multiarmed bandits (also known as bandits with side information). EpochGreedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by EpochGreedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales as O(T 2/3 S 1/3) or better (sometimes, much better). Here S is the complexity term in a sample complexity bound for standard supervised learning. 1
Best Arm Identification in MultiArmed Bandits
"... We consider the problem of finding the best arm in a stochastic multiarmed bandit game. The regret of a forecaster is here defined by the gap between the mean reward of the optimal arm and the mean reward of the ultimately chosen arm. We propose a highly exploring UCB policy and a new algorithm bas ..."
Abstract

Cited by 55 (11 self)
 Add to MetaCart
We consider the problem of finding the best arm in a stochastic multiarmed bandit game. The regret of a forecaster is here defined by the gap between the mean reward of the optimal arm and the mean reward of the ultimately chosen arm. We propose a highly exploring UCB policy and a new algorithm based on successive rejects. We show that these algorithms are essentially optimal since their regret decreases exponentially at a rate which is, up to a logarithmic factor, the best possible. However, while the UCB policy needs the tuning of a parameter depending on the unobservable hardness of the task, the successive rejects policy benefits from being parameterfree, and also independent of the scaling of the rewards. As a byproduct of our analysis, we show that identifying the best arm (when it is unique) requires a number of samples of order (up to a log(K) factor) ∑ i 1/∆2i, where the sum is on the suboptimal arms and ∆i represents the difference between the mean reward of the best arm and the one of arm i. This generalizes the wellknown fact that one needs of order of 1/∆2 samples to differentiate the means of two distributions with gap ∆. 1
Exploration scavenging
 In Proceedings of the International Conference on Machine Learning
, 2008
"... We examine the problem of evaluating a policy in the contextual bandit setting using only observations collected during the execution of another policy. We show that policy evaluation can be impossible if the exploration policy chooses actions based on the side information provided at each time step ..."
Abstract

Cited by 42 (8 self)
 Add to MetaCart
(Show Context)
We examine the problem of evaluating a policy in the contextual bandit setting using only observations collected during the execution of another policy. We show that policy evaluation can be impossible if the exploration policy chooses actions based on the side information provided at each time step. We then propose and prove the correctness of a principled method for policy evaluation which works when this is not the case, even when the exploration policy is deterministic, as long as each action is explored sufficiently often. We apply this general technique to the problem of offline evaluation of internet advertising policies. Although our theoretical results hold only when the exploration policy chooses ads independent of side information, an assumption that is typically violated by commercial systems, we show how clever uses of the theory provide nontrivial and realistic applications. We also provide an empirical demonstration of the effectiveness of our techniques on real ad placement data.
Rollout sampling approximate policy iteration
 Machine Learning
, 2008
"... Several researchers have recently investigated the connection between reinforcement learning and classification. We are motivated by proposals of approximate policy iteration schemes without value functions, which focus on policy representation using classifiers and address policy learning as a supe ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
(Show Context)
Several researchers have recently investigated the connection between reinforcement learning and classification. We are motivated by proposals of approximate policy iteration schemes without value functions, which focus on policy representation using classifiers and address policy learning as a supervised learning problem. This paper proposes variants of an improved policy iteration scheme which addresses the core sampling problem in evaluating a policy through simulation as a multiarmed bandit machine. The resulting algorithm offers comparable performance to the previous algorithm achieved, however, with significantly less computational effort. An order of magnitude improvement is demonstrated experimentally in two standard reinforcement learning domains: inverted pendulum and mountaincar. 1
UCB REVISITED: IMPROVED REGRET BOUNDS FOR THE STOCHASTIC MULTIARMED BANDIT PROBLEM
"... ABSTRACT. In the stochastic multiarmed bandit problem we consider a modification of the UCB algorithm of Auer et al. [4]. For this modified algorithm we give an improved bound on the regret with respect to the optimal reward. While for the original UCB algorithm the regret in KK log(T) armed bandi ..."
Abstract

Cited by 30 (5 self)
 Add to MetaCart
(Show Context)
ABSTRACT. In the stochastic multiarmed bandit problem we consider a modification of the UCB algorithm of Auer et al. [4]. For this modified algorithm we give an improved bound on the regret with respect to the optimal reward. While for the original UCB algorithm the regret in KK log(T) armed bandits after T trials is bounded by const ·, where ∆ measures the distance between a suboptimal arm and the optimal arm, for the modified UCB algorithm we show an upper bound on the regret of const · K log(T ∆2)
The Karmed Dueling Bandits Problem
, 2009
"... We study a partialinformation onlinelearning problem where actions are restricted to noisy comparisons between pairs of strategies (also known as bandits). In contrast to conventional approaches that require the absolute reward of the chosen strategy to be quantifiable and observable, our setting ..."
Abstract

Cited by 29 (7 self)
 Add to MetaCart
(Show Context)
We study a partialinformation onlinelearning problem where actions are restricted to noisy comparisons between pairs of strategies (also known as bandits). In contrast to conventional approaches that require the absolute reward of the chosen strategy to be quantifiable and observable, our setting assumes only that (noisy) binary feedback about the relative reward of two chosen strategies is available. This type of relative feedback is particularly appropriate in applications where absolute rewards have no natural scale or are difficult to measure (e.g., userperceived quality of a set of retrieval results, taste of food, product attractiveness), but where pairwise comparisons are easy to make. We propose a novel regret formulation in this setting, as well as present an algorithm that achieves (almost) informationtheoretically optimal regret bounds (up to a constant factor).
Efficient Optimal Learning for Contextual Bandits
"... We address the problem of learning in an online setting where the learner repeatedly observes features x, selects among K actions, and receives reward r for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses an oracle which returns an optimal policy ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
(Show Context)
We address the problem of learning in an online setting where the learner repeatedly observes features x, selects among K actions, and receives reward r for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses an oracle which returns an optimal policy given rewards for all actions for each x. The algorithm has running time polylog(N), where N is the number of policies that we compete with. This is exponentially faster than all previous algorithms that achieve optimal regret in this setting. Our formulation also enables us to create an algorithm with regret that is additive rather than multiplicative in feedback delay as in all previous work. 1.
PAC Subset Selection in Stochastic Multiarmed Bandits
"... We consider the problem of selecting, from among the arms of a stochastic narmed bandit, a subset of size m of those arms with the highest expected rewards, based on efficiently sampling the arms. This “subset selection ” problem finds application in a variety of areas. In the authors ’ previous wo ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
(Show Context)
We consider the problem of selecting, from among the arms of a stochastic narmed bandit, a subset of size m of those arms with the highest expected rewards, based on efficiently sampling the arms. This “subset selection ” problem finds application in a variety of areas. In the authors ’ previous work (Kalyanakrishnan & Stone, 2010), this problem is framed under a PAC setting (denoted “Explorem”), and corresponding sampling algorithms are analyzed. Whereas the formal analysis therein is restricted to the worst case sample complexity of algorithms, in this paper, we design and analyze an algorithm (“LUCB”) with improved expected sample complexity. Interestingly LUCB bears a close resemblance to the wellknown UCB algorithm for regret minimization. The expected sample complexity bound we show for LUCB is novel even for singlearm selection (Explore1). We also give a lower bound on the worst case sample complexity of PAC algorithms for Explorem. 1.
Contextual MultiArmed Bandits
"... We study contextual multiarmed bandit problems where the context comes from a metric space and the payoff satisfies a Lipschitz condition with respect to the metric. Abstractly, a contextual multiarmed bandit problem models a situation where, in a sequence of independent trials, an online algorith ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
We study contextual multiarmed bandit problems where the context comes from a metric space and the payoff satisfies a Lipschitz condition with respect to the metric. Abstractly, a contextual multiarmed bandit problem models a situation where, in a sequence of independent trials, an online algorithm chooses, based on a given context (side information), an action from a set of possible actions so as to maximize the total payoff of the chosen actions. The payoff depends on both the action chosen and the context. In contrast, contextfree multiarmed bandit problems, a focus of much previous research, model situations where no side information is available and the payoff depends only on the action chosen. Our problem is motivated by sponsored web search, where the task is to display ads to a user of an Internet search engine based on her search query so as to maximize the clickthrough rate (CTR) of the ads displayed. We cast this problem as a contextual multiarmed bandit problem where queries and ads form metric spaces and the payoff function is Lipschitz with respect to both the metrics. For any ɛ> 0 we present an algorithm with regret O(T a+b+1 a+b+2 +ɛ) where a, b are the covering dimensions of the query space and the ad space respectively. We prove a lower bound Ω(T ã+ ˜ b+1 ã+ ˜b+2 −ɛ) for the regret of any algorithm where ã, ˜b are packing dimensions of the query spaces and the ad space respectively. For finite spaces or convex bounded subsets of Euclidean spaces, this gives an almost matching upper and lower bound.
MultiBandit Best Arm Identification
"... We study the problem of identifying the best arm in each of the bandits in a multibandit multiarmed setting. We first propose an algorithm called Gapbased Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap). We then intro ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
(Show Context)
We study the problem of identifying the best arm in each of the bandits in a multibandit multiarmed setting. We first propose an algorithm called Gapbased Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap). We then introduce an algorithm, called GapEV, which takes into account the variance of the arms in addition to their gap. We prove an upperbound on the probability of error for both algorithms. Since GapE and GapEV need to tune an exploration parameter that depends on the complexity of the problem, which is often unknown in advance, we also introduce variations of these algorithms that estimate this complexity online. Finally, we evaluate the performance of these algorithms and compare them to other allocation strategies on a number of synthetic problems. 1