Results 1  10
of
173
Reinforcement Learning I: Introduction
, 1998
"... In which we try to give a basic intuitive sense of what reinforcement learning is and how it differs and relates to other fields, e.g., supervised learning and neural networks, genetic algorithms and artificial life, control theory. Intuitively, RL is trial and error (variation and selection, search ..."
Abstract

Cited by 5500 (120 self)
 Add to MetaCart
In which we try to give a basic intuitive sense of what reinforcement learning is and how it differs and relates to other fields, e.g., supervised learning and neural networks, genetic algorithms and artificial life, control theory. Intuitively, RL is trial and error (variation and selection, search) plus learning (association, memory). We argue that RL is the only field that seriously addresses the special features of the problem of learning from interaction to achieve longterm goals.
A ContextualBandit Approach to Personalized News Article Recommendation
"... Personalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two reasons. First, web service is featured with dynamic ..."
Abstract

Cited by 170 (16 self)
 Add to MetaCart
(Show Context)
Personalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two reasons. First, web service is featured with dynamically changing pools of content, rendering traditional collaborative filtering methods inapplicable. Second, the scale of most web services of practical interest calls for solutions that are both fast in learning and computation. In this work, we model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its articleselection strategy based on userclick feedback to maximize total user clicks. The contributions of this work are threefold. First, we propose a new, general contextual bandit algorithm that is computationally efficient and well motivated from learning theory. Second, we argue that any bandit algorithm can be reliably evaluated offline using previously recorded random traffic. Finally, using this offline evaluation method, we successfully applied our new algorithm to a Yahoo! Front Page Today Module dataset containing over 33 million events. Results showed a 12.5 % click lift compared to a standard contextfree bandit algorithm, and the advantage becomes even greater when data gets more scarce.
Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design
"... Many applications require optimizing an unknown, noisy function that is expensive to evaluate. We formalize this task as a multiarmed bandit problem, where the payoff function is either sampled from a Gaussian process (GP) or has low RKHS norm. We resolve the important open problem of deriving regre ..."
Abstract

Cited by 118 (11 self)
 Add to MetaCart
Many applications require optimizing an unknown, noisy function that is expensive to evaluate. We formalize this task as a multiarmed bandit problem, where the payoff function is either sampled from a Gaussian process (GP) or has low RKHS norm. We resolve the important open problem of deriving regret bounds for this setting, which imply novel convergence rates for GP optimization. We analyze GPUCB, an intuitive upperconfidence based algorithm, and bound its cumulative regret in terms of maximal information gain, establishing a novel connection between GP optimization and experimental design. Moreover, by bounding the latter in terms of operator spectra, we obtain explicit sublinear regret bounds for many commonly used covariance functions. In some important cases, our bounds have surprisingly weak dependence on the dimensionality. In our experiments on real sensor data, GPUCB compares favorably with other heuristical GP optimization approaches. 1.
The jackknifea review
 Biometrika
, 1974
"... Interleukin (IL)33 is a new member of the IL1 superfamily of cytokines that is expressed by mainly stromal cells, such as epithelial and endothelial cells, and its expression is upregulated following proinflammatory stimulation. IL33 can function both as a traditional cytokine and as a nuclear f ..."
Abstract

Cited by 101 (0 self)
 Add to MetaCart
(Show Context)
Interleukin (IL)33 is a new member of the IL1 superfamily of cytokines that is expressed by mainly stromal cells, such as epithelial and endothelial cells, and its expression is upregulated following proinflammatory stimulation. IL33 can function both as a traditional cytokine and as a nuclear factor regulating gene transcription. It is thought to function as an ‘alarmin ’ released following cell necrosis to alerting the immune system to tissue damage or stress. It mediates its biological effects via interaction with the receptors ST2 (IL1RL1) and IL1 receptor accessory protein (IL1RAcP), both of which are widely expressed, particularly by innate immune cells and T helper 2 (Th2) cells. IL33 strongly induces Th2 cytokine production from these cells and can promote the pathogenesis of Th2related disease such as asthma, atopic dermatitis and anaphylaxis. However, IL33 has shown various protective effects in cardiovascular diseases such as atherosclerosis, obesity, type 2 diabetes and cardiac remodeling. Thus, the effects of IL33 are either pro or antiinflammatory depending on the disease and the model. In this review the role of IL33 in the inflammation of several disease pathologies will be discussed, with particular emphasis on recent advances.
Stochastic linear optimization under bandit feedback
 In submission
, 2008
"... In the classical stochastic karmed bandit problem, in each of a sequence of T rounds, a decision maker chooses one of k arms and incurs a cost chosen from an unknown distribution associated with that arm. The goal is to minimize regret, defined as the difference between the cost incurred by the alg ..."
Abstract

Cited by 98 (8 self)
 Add to MetaCart
In the classical stochastic karmed bandit problem, in each of a sequence of T rounds, a decision maker chooses one of k arms and incurs a cost chosen from an unknown distribution associated with that arm. The goal is to minimize regret, defined as the difference between the cost incurred by the algorithm and the optimal cost. In the linear optimization version of this problem (first considered by Auer [2002]), we view the arms as vectors in Rn, and require that the costs be linear functions of the chosen vector. As before, it is assumed that the cost functions are sampled independently from an unknown distribution. In this setting, the goal is to find algorithms whose running time and regret behave well as functions of the number of rounds T and the dimensionality n (rather than the number of arms, k, which may be exponential in n or even infinite). We give a nearly complete characterization of this problem in terms of both upper and lower bounds for the regret. In certain special cases (such as when the decision region is a polytope), the regret is polylog(T). In general though, the optimal regret is Θ ∗ ( √ T) — our lower bounds rule out the possibility of obtaining polylog(T) rates in general. We present two variants of an algorithm based on the idea of “upper confidence bounds. ” The first, due to Auer [2002], but not fully analyzed, obtains regret whose dependence on n and T are both essentially optimal, but which may be computationally intractable when the decision set is a polytope. The second version can be efficiently implemented when the decision set is a polytope (given as an intersection √ of halfspaces), but gives up a factor of n in the regret bound. Our results also extend to the setting where the set of allowed decisions may change over time.
Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms
 In WSDM
, 2011
"... Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging du ..."
Abstract

Cited by 73 (15 self)
 Add to MetaCart
Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their“partiallabel ” nature. Common practice is to create a simulator which simulates the online environment for the problem at hand and then run an algorithm against this simulator. However, creating simulator itself is often difficult and modeling bias is usually unavoidably introduced. In this paper, we introduce a replay methodology for contextual bandit algorithm evaluation. Different from simulatorbased approaches, our method is completely datadriven and very easy to adapt to different applications. More importantly, our method can provide provably unbiased evaluations. Our empirical results on a largescale news article recommendation dataset collected from Yahoo! Front Page conform well with our theoretical results. Furthermore, comparisons between our offline replay and online bucket evaluation of several contextual bandit algorithms show accuracy and effectiveness of our offline evaluation method.
Knows What It Knows: A Framework For SelfAware Learning
"... We introduce a learning framework that combines elements of the wellknown PAC and mistakebound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is ..."
Abstract

Cited by 67 (20 self)
 Add to MetaCart
We introduce a learning framework that combines elements of the wellknown PAC and mistakebound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is true in reinforcementlearning and activelearning problems. We catalog several KWIKlearnable classes and open problems. 1.
Minimizing regret with label efficient prediction
 IEEE Trans. Inform. Theory
, 2005
"... Abstract. We investigate label efficient prediction, a variant of the problem of prediction with expert advice, proposed by Helmbold and Panizza, in which the forecaster does not have access to the outcomes of the sequence to be predicted unless he asks for it, which he can do for a limited number o ..."
Abstract

Cited by 58 (8 self)
 Add to MetaCart
(Show Context)
Abstract. We investigate label efficient prediction, a variant of the problem of prediction with expert advice, proposed by Helmbold and Panizza, in which the forecaster does not have access to the outcomes of the sequence to be predicted unless he asks for it, which he can do for a limited number of times. We determine matching upper and lower bounds for the best possible excess error when the number of allowed queries is a constant. We also prove that a query rate of order (ln n)(ln ln n) 2 /n is sufficient for achieving Hannan consistency, a fundamental property in gametheoretic prediction models. Finally, we apply the label efficient framework to pattern classification and prove a label efficient mistake bound for a randomized variant of Littlestone’s zerothreshold Winnow algorithm. 1
Linearly Parameterized Bandits
, 2008
"... We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an rdimensional random vector Z ∈ Rr, where r ≥ 2. The objective is to choose a sequence of arms to minimize the cumulative regret and Bayes risk. W ..."
Abstract

Cited by 57 (0 self)
 Add to MetaCart
(Show Context)
We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an rdimensional random vector Z ∈ Rr, where r ≥ 2. The objective is to choose a sequence of arms to minimize the cumulative regret and Bayes risk. We propose a policy based on least squares estimation and uncertainty ellipsoids, which generalizes the upper confidence index approach pioneered by Lai and Robbins (1985). The cumulative regret and Bayes risk under our proposed policy admits an upper bound of the form r √ T log 3/2 T, which is linear in the dimension r, and independent of the number of arms. We also establish Ω(r √ T) lower bounds on the regret and risk, showing that our proposed policy is nearly optimal.
Contextual Bandits with Similarity Information
 24TH ANNUAL CONFERENCE ON LEARNING THEORY
, 2011
"... In a multiarmed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a timeinvariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work ha ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
In a multiarmed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a timeinvariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs to assume extra structure in order to make the problem tractable. In particular, recent literature considered information on similarity between arms. We consider similarity information in the setting of contextual bandits, a natural extension of the basic MAB problem where before each round an algorithm is given the context – a hint about the payoffs in this round. Contextual bandits are directly motivated by placing advertisements on webpages, one of the crucial problems in sponsored search. A particularly simple way to represent similarity information in the contextual bandit setting is via a similarity distance between the contextarm pairs which bounds from above the difference between the respective expected payoffs. Prior work