#### DMCA

## Learning to Optimize Via Information-Directed Sampling (2014)

Citations: | 1 - 0 self |

### Citations

812 | Finite-time analysis of the multiarmed bandit problem
- Auer, Cesa-Bianchi, et al.
- 2002
(Show Context)
Citation Context ...arms and a time horizon of 1000. We compared the performance of IDS to that of six other algorithms, and found that it had the lowest average regret of 18.16. The famous UCB1 algorithm of Auer et al. =-=[9]-=- selects the action a which maximizes the upper confidence bound θ̂t(a) + √ 2 log(t) Nt(a) where θ̂t(a) is the empirical average reward from samples of action a and Nt(a) is the number of samples of a... |

509 |
Asymptotically efficient adaptive allocation rules
- Lai, Robbins
- 1985
(Show Context)
Citation Context ...nt algorithm. This is particularly surprising for Bernoulli bandit problems, where Thompson sampling and UCB algorithms are known to be asymptotically optimal in the sense proposed by Lai and Robbins =-=[39]-=-. IDS is a stationary randomized policy. Randomized because the ratio between expected singleperiod regret and our measure of information gain can be smaller for a randomized action than for any deter... |

442 | Bandit based Monte-Carlo planning
- Kocsis, Szepesvári
- 2006
(Show Context)
Citation Context ...te of regret for problems with dependent arms. Both UCB algorithms and Thompson sampling have been applied to other types of problems, like reinforcement learning [31, 42] and Monte Carlo tree search =-=[10, 36]-=-. We will describe both UCB algorithms and Thompson sampling in more detail in Section 8. In one of the first papers on multi-armed bandit problems with dependent arms, Agrawal et al. [3] consider a g... |

392 |
Convex optimization. Cambridge university press
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...s information about the optimum. D Proof of Proposition 1 Proof. First, we show the function Ψ : π 7→ ( πT∆ )2 /πT g is convex on { π ∈ RK |πT g > 0 } . As shown in Chapter 3 of Boyd and Vandenberghe =-=[11]-=-, f : (x, y) 7→ x2/y is convex over {(x, y) ∈ R 2 : y > 0}. The function h : π 7→ (πT∆, πT g) ∈ R2 is affine. Since convexity is preserved under composition with an affine function, the function Ψ = g... |

332 | Entropy and Information Theory
- Gray
- 1990
(Show Context)
Citation Context ...ndence on the conditional probability measure P (·|Ft−1). To reduce notation, we define the information gain from an action a to be gt(a) := It(A ∗;Yt(a)). As shown for example in Lemma 5.5.6 of Gray =-=[29]-=-, this is equal to the expected reduction in entropy of the posterior distribution of A∗ due to observing Yt(a): gt(a) = E [H(αt)−H(αt+1)|Ft−1, At = a] , (5) which plays a crucial role in our results.... |

323 |
Multi-armed Bandit Allocation Indices
- Gittins
- 1989
(Show Context)
Citation Context ...nting out that, although Gittins’ indices characterize the Bayes optimal policy for infinite horizon discounted problems, the finite horizon formulation considered here is computationally intractable =-=[24]-=-. A similar index policy [41] designed for finite horizon problems could be applied as a heuristic in this setting. However, a specialized algorithm is required to compute these indices, and this proc... |

128 |
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems.
- Bubeck, Cesa-Bianchi
- 2012
(Show Context)
Citation Context ...uct an example in which the decision maker plays m bandit games in parallel, each with d/m actions. Using that example, and the standard bandit lower bound (see Theorem 3.5 of Bubeck and Cesa-Bianchi =-=[14]-=-), the agent’s regret from each component must be at least √ d m T , and hence her overall expected regret is lower bounded by a term of order m √ d m T = √ mdT . 11 Both classes of algorithms experim... |

99 | Stochastic linear optimization under bandit feedback
- Dani, Hayes, et al.
- 2008
(Show Context)
Citation Context ...t of the stochastic multi-armed bandit literature that treats problems with dependent arms. UCB algorithms have been applied 2 to problems where the mapping from action to expected reward is a linear =-=[1, 20, 44]-=-, generalized linear [21], or sparse linear [2] model; is sampled from a Gaussian process [51] or has small norm in a reproducing kernel Hilbert space [51, 53]; or is a smooth (e.g. Lipschitz continuo... |

98 | Near-optimal Regret Bounds for Reinforcement Learning.
- Jaksch, Ortner, et al.
- 2010
(Show Context)
Citation Context ...c frequentist bounds on the growth rate of regret for problems with dependent arms. Both UCB algorithms and Thompson sampling have been applied to other types of problems, like reinforcement learning =-=[31, 42]-=- and Monte Carlo tree search [10, 36]. We will describe both UCB algorithms and Thompson sampling in more detail in Section 8. In one of the first papers on multi-armed bandit problems with dependent ... |

91 | A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning.
- Brochu, Cora, et al.
- 2009
(Show Context)
Citation Context ...e objective function, reflecting the belief that “smoother” objective functions are more plausible than others. This approach is often called “Bayesian optimization” in the machine learning community =-=[12]-=-. Both Villemonteix et al. [54] and Hennig and Schuler [30] propose selecting each sample to maximize the mutual information between the next observation and the optimal solution. Several features dis... |

86 | Multi-Armed Bandits in Metric Spaces
- Kleinberg, Slivkins, et al.
- 2008
(Show Context)
Citation Context ...zed linear [21], or sparse linear [2] model; is sampled from a Gaussian process [51] or has small norm in a reproducing kernel Hilbert space [51, 53]; or is a smooth (e.g. Lipschitz continuous) model =-=[15, 35, 52]-=-. Recently, an algorithm known as Thompson sampling has received a great deal of interest. Agrawal and Goyal [6] provided the first analysis for linear contextual bandit problems. Russo and Van Roy [4... |

72 | An empirical evaluation of Thompson sampling.
- Chapelle, Li
- 2011
(Show Context)
Citation Context ...ned, but uses slightly different confidence bounds. It is known to satisfy regret bounds for this problem that are minimax optimal up to a numerical constant factor. In previous numerical experiments =-=[18, 33, 34, 50]-=-, Thompson sampling and Bayes UCB exhibited state-of-the-art performance for this problem. Each also satisfies strong theoretical guarantees, and is known to be asymptotically optimal in the sense def... |

69 | Adaptive submodularity: Theory and applications in active learning and stochastic optimization
- Golovin, Krause
- 2011
(Show Context)
Citation Context ...ximize a measure of the information gain from the next sample. Jedynak et al. [32] and Waeber et al. [55] consider problem settings in which this greedy policy is optimal. Another recent line of work =-=[25, 26]-=- shows that measures of information gain sometimes satisfy a decreasing returns property known as adaptive sub-modularity, implying the greedy policy is competitive with the optimal policy. Our algori... |

62 | On bayesian methods for seeking the extremum. - Mockus - 1974 |

60 | Linearly parameterized bandits.
- Rusmevichientong, Tsitsiklis
- 2010
(Show Context)
Citation Context ...t of the stochastic multi-armed bandit literature that treats problems with dependent arms. UCB algorithms have been applied 2 to problems where the mapping from action to expected reward is a linear =-=[1, 20, 44]-=-, generalized linear [21], or sparse linear [2] model; is sampled from a Gaussian process [51] or has small norm in a reproducing kernel Hilbert space [51, 53]; or is a smooth (e.g. Lipschitz continuo... |

58 |
The price of bandit information for online optimization,
- Dani, Hayes, et al.
- 2008
(Show Context)
Citation Context ... Combining this result with Proposition 2 shows E [ Regret(T, πIDS) ] ≤ √ 1 2H(α1)T . Further, a worst–case bound on the entropy of α1 shows that E [ Regret(T, πIDS) ] ≤ √ 1 2 log(|A|)T . Dani et al. =-=[19]-=- show this bound is order optimal, in the sense that for any time horizon T and number of actions |A| there exists a prior distribution over p∗ under which infpi E [Regret(T, π)] ≥ c0 √ log(|A|)T wher... |

51 |
Adaptive treatment allocation and the multi-armed bandit problem
- Lai
- 1987
(Show Context)
Citation Context ...rically effective, and have strong theoretical guarantees. Specific UCB algorithms and Thompson sampling are known to be asymptotically efficient for multi–armed bandit problems with independent arms =-=[5, 16, 33, 38, 39]-=- and satisfy strong regret bounds for some problems with dependent arms [15, 20, 21, 27, 44, 46, 51]. 3In their formulation, the reward from selecting action a is ∑ i∈a θt,i, which is m times larger t... |

48 | An informational approach to the global optimization of expensive-to-evaluate functions.
- Villemonteix, Vazquez, et al.
- 2009
(Show Context)
Citation Context ...rkov chains and to problems with infinite parameter spaces. These papers provide results of fundemental importance, but seem to have been overlooked by much of the recent literature. Two other papers =-=[30, 54]-=- have used the mutual information between the optimal action and the next observation to guide action selection. Each focuses on optimization of expensive-to-evaluate black-box functions. Here, black–... |

40 | Improved algorithms for linear stochastic bandits.
- Abbasi-Yadkori, Pal, et al.
- 2011
(Show Context)
Citation Context ...t of the stochastic multi-armed bandit literature that treats problems with dependent arms. UCB algorithms have been applied 2 to problems where the mapping from action to expected reward is a linear =-=[1, 20, 44]-=-, generalized linear [21], or sparse linear [2] model; is sampled from a Gaussian process [51] or has small norm in a reproducing kernel Hilbert space [51, 53]; or is a smooth (e.g. Lipschitz continuo... |

38 |
Near-optimal bayesian active learning with noisy observations.
- Golovin, Krause, et al.
- 2010
(Show Context)
Citation Context ...ximize a measure of the information gain from the next sample. Jedynak et al. [32] and Waeber et al. [55] consider problem settings in which this greedy policy is optimal. Another recent line of work =-=[25, 26]-=- shows that measures of information gain sometimes satisfy a decreasing returns property known as adaptive sub-modularity, implying the greedy policy is competitive with the optimal policy. Our algori... |

36 | The knowledge gradient algorithm for a general class of online learning problems.
- Ryzhov, Powell, et al.
- 2012
(Show Context)
Citation Context ... reward from each action and an estimate p∗ ∈ R of the expected reward from the optimal action A∗. C On the non-convergence of the knowledge gradient algorithm Like IDS, the knowledge–gradient policy =-=[48]-=- selects actions by optimizing a single period objective that encourages earning high expected reward and acquiring a lot of information. Define Vt = max a′∈A E [ R ( Yt(a ′) ) |Ft−1] to be the expect... |

35 | A knowledge-gradient policy for sequential information collection
- Frazier, Powell, et al.
- 2008
(Show Context)
Citation Context ... the decision made by a greedy algorithm, which simply selects the action with highest posterior expected reward. This measure was proposed by Mockus et al. [40] and studied further by Frazier et al. =-=[23]-=- and 3 Ryzhov et al. [48]. KG seems natural since it explicitly seeks information that improves decision quality. Computational studies suggest that for problems with Gaussian priors, Gaussian rewards... |

35 | Thompson sampling: An asymptotically optimal finite time analysis
- KAUFMANN, KORDA, et al.
- 2012
(Show Context)
Citation Context ...rically effective, and have strong theoretical guarantees. Specific UCB algorithms and Thompson sampling are known to be asymptotically efficient for multi–armed bandit problems with independent arms =-=[5, 16, 33, 38, 39]-=- and satisfy strong regret bounds for some problems with dependent arms [15, 20, 21, 27, 44, 46, 51]. 3In their formulation, the reward from selecting action a is ∑ i∈a θt,i, which is m times larger t... |

35 |
A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise.
- Kushner
- 1964
(Show Context)
Citation Context ...and more importantly, they focus only on problems with Gaussian process priors and continuous action spaces. For such problems, simpler approaches like UCB algorithms [51], probability of improvement =-=[37]-=-, and expected improvement [40] are already extremely effective. As noted by Brochu et al. [12], each of these algorithms simply chooses points with “potentially high values of the objective function:... |

33 | A modern Bayesian look at the multiarmed bandit.
- Scott
- 2010
(Show Context)
Citation Context ...ned, but uses slightly different confidence bounds. It is known to satisfy regret bounds for this problem that are minimax optimal up to a numerical constant factor. In previous numerical experiments =-=[18, 33, 34, 50]-=-, Thompson sampling and Bayes UCB exhibited state-of-the-art performance for this problem. Each also satisfies strong theoretical guarantees, and is known to be asymptotically optimal in the sense def... |

29 | Asymptotically efficient adaptive allocation schemes for controlled iid processes: Finite parameter space.
- Agrawal, Teneketzis, et al.
- 1989
(Show Context)
Citation Context ... search [10, 36]. We will describe both UCB algorithms and Thompson sampling in more detail in Section 8. In one of the first papers on multi-armed bandit problems with dependent arms, Agrawal et al. =-=[3]-=- consider a general model in which the reward distribution associated with each action depends on a common unknown parameter. When the parameter space is finite, they provide a lower bound on the asym... |

28 | Thompson sampling for contextual bandits with linear payoffs.
- Agrawal, Goyal
- 2012
(Show Context)
Citation Context ...ernel Hilbert space [51, 53]; or is a smooth (e.g. Lipschitz continuous) model [15, 35, 52]. Recently, an algorithm known as Thompson sampling has received a great deal of interest. Agrawal and Goyal =-=[6]-=- provided the first analysis for linear contextual bandit problems. Russo and Van Roy [46] consider a more general class of models, and show that standard analysis of upper confidence bound algorithms... |

28 | X-armed bandits
- Bubeck, Munos, et al.
(Show Context)
Citation Context ...zed linear [21], or sparse linear [2] model; is sampled from a Gaussian process [51] or has small norm in a reproducing kernel Hilbert space [51, 53]; or is a smooth (e.g. Lipschitz continuous) model =-=[15, 35, 52]-=-. Recently, an algorithm known as Thompson sampling has received a great deal of interest. Agrawal and Goyal [6] provided the first analysis for linear contextual bandit problems. Russo and Van Roy [4... |

28 | Kullback-leibler upper confidence bounds for optimal sequential allocation
- Cappé, Garivier, et al.
- 2013
(Show Context)
Citation Context ...rically effective, and have strong theoretical guarantees. Specific UCB algorithms and Thompson sampling are known to be asymptotically efficient for multi–armed bandit problems with independent arms =-=[5, 16, 33, 38, 39]-=- and satisfy strong regret bounds for some problems with dependent arms [15, 20, 21, 27, 44, 46, 51]. 3In their formulation, the reward from selecting action a is ∑ i∈a θt,i, which is m times larger t... |

28 | Entropy search for information-efficient global optimization.
- Hennig, Schuler
- 2012
(Show Context)
Citation Context ...rkov chains and to problems with infinite parameter spaces. These papers provide results of fundemental importance, but seem to have been overlooked by much of the recent literature. Two other papers =-=[30, 54]-=- have used the mutual information between the optimal action and the next observation to guide action selection. Each focuses on optimization of expensive-to-evaluate black-box functions. Here, black–... |

27 | On Bayesian upper confidence bounds for bandit problems
- KAUFMANN, CAPPÉ, et al.
- 2012
(Show Context)
Citation Context ... be optimal under one of the outcome distributions p ∈ Pt. An alternative method involves choosing Bt(a) to be a particular quantile of the posterior distribution of the action’s mean reward under p∗ =-=[34]-=-. In each of the examples we construct, such an algorithm always chooses actions from the support of A∗ unless the quantiles are so low that maxa∈ABt(a) < E [R(Yt(A ∗))]. 8.1 Example: a revealing acti... |

26 | Information-theoretic regret bounds for gaussian process optimization in the bandit setting.
- Srinivas, Krause, et al.
- 2012
(Show Context)
Citation Context ...ithms have been applied 2 to problems where the mapping from action to expected reward is a linear [1, 20, 44], generalized linear [21], or sparse linear [2] model; is sampled from a Gaussian process =-=[51]-=- or has small norm in a reproducing kernel Hilbert space [51, 53]; or is a smooth (e.g. Lipschitz continuous) model [15, 35, 52]. Recently, an algorithm known as Thompson sampling has received a great... |

23 | Further Optimal Regret Bounds for Thompson Sampling
- Agrawal, Goyal
- 2013
(Show Context)
Citation Context |

23 | Dynamic assortment optimization with a multinomial logit choice model and capacity constraint’,
- Rusmevichientong, Shen, et al.
- 2010
(Show Context)
Citation Context ...ysis accommodate this level of generality, they can be specialized to problems that in the past have been studied individually, such as those involving pricing and assortment optimization (see, e.g., =-=[13, 45, 49]-=-), though in each case, developing an efficient version of IDS may require innovation. 2 Literature review UCB algorithms are the primary approach considered in the segment of the stochastic multi-arm... |

19 |
Parametric Bandits: The Generalized Linear Case.
- Filippi, Cappe, et al.
- 2010
(Show Context)
Citation Context ...bandit literature that treats problems with dependent arms. UCB algorithms have been applied 2 to problems where the mapping from action to expected reward is a linear [1, 20, 44], generalized linear =-=[21]-=-, or sparse linear [2] model; is sampled from a Gaussian process [51] or has small norm in a reproducing kernel Hilbert space [51, 53]; or is a smooth (e.g. Lipschitz continuous) model [15, 35, 52]. R... |

18 | Learning to optimize via posterior sampling.
- Russo, Roy
- 2014
(Show Context)
Citation Context ...2]. Recently, an algorithm known as Thompson sampling has received a great deal of interest. Agrawal and Goyal [6] provided the first analysis for linear contextual bandit problems. Russo and Van Roy =-=[46]-=- consider a more general class of models, and show that standard analysis of upper confidence bound algorithms leads to bounds on the expected regret of Thompson sampling. Very recent work of Gopalan ... |

15 | Regret in online combinatorial optimization.
- Audibert, Bubeck, et al.
- 2014
(Show Context)
Citation Context ...m−1 ∑ i∈a θt,i. It may be natural instead to assume that the outcome of each selected project (θt,i : i ∈ a) is observed. This type of observation structure is sometimes called “semi–bandit” feedback =-=[8]-=-. A naive application of Proposition 6 to address this problem would show Ψ∗t ≤ d/2. The next proposition shows that since the entire parameter vector (θt,i : i ∈ a) is observed upon selecting action ... |

15 | Asymptotically efficient adaptive choice of control laws in controlled markov chains
- Graves, Lai
- 1997
(Show Context)
Citation Context ...wth rate of the regret of any admisible policy as the time horizon tends to infinity and show that this bound is attainable. These results were later extended by Agrawal et al. [4] and Graves and Lai =-=[28]-=- to apply to the adaptive control of Markov chains and to problems with infinite parameter spaces. These papers provide results of fundemental importance, but seem to have been overlooked by much of t... |

13 |
Dynamic pricing under a general parametric choice model
- Broder, Rusmevichientong
- 2012
(Show Context)
Citation Context ...ysis accommodate this level of generality, they can be specialized to problems that in the past have been studied individually, such as those involving pricing and assortment optimization (see, e.g., =-=[13, 45, 49]-=-), though in each case, developing an efficient version of IDS may require innovation. 2 Literature review UCB algorithms are the primary approach considered in the segment of the stochastic multi-arm... |

13 | Stochastic Simultaneous Optimistic Optimization. - Valko, Carpentier, et al. - 2013 |

10 | Online-to-confidence-set conversions and application to sparse stochastic bandits - Abbasi-yadkori, Pal, et al. - 2012 |

10 |
W.B.: Paradoxes in learning and the marginal value of information. Decision Analysis
- Frazier, Powell
- 2010
(Show Context)
Citation Context ...ould never select action 2. Its cumulative regret over T time periods is therefore equal to T (E [max{.51, θ}]− .51), which grows linearly with T . Many step lookahead. As noted by Frazier and Powell =-=[22]-=-, the algorithm’s poor performance in this case is due to the increasing returns to information. A single sample of action a2 provides no value, even though sampling the action several times could be ... |

9 | Bisection search with noisy responses
- Waeber, Frazier, et al.
- 2013
(Show Context)
Citation Context ...review). Recent work has demonstrated the effectiveness of greedy or myopic policies that always maximize a measure of the information gain from the next sample. Jedynak et al. [32] and Waeber et al. =-=[55]-=- consider problem settings in which this greedy policy is optimal. Another recent line of work [25, 26] shows that measures of information gain sometimes satisfy a decreasing returns property known as... |

6 | Computing a classic index for finite-horizon bandits
- Niño-Mora
- 2011
(Show Context)
Citation Context ...ins’ indices characterize the Bayes optimal policy for infinite horizon discounted problems, the finite horizon formulation considered here is computationally intractable [24]. A similar index policy =-=[41]-=- designed for finite horizon problems could be applied as a heuristic in this setting. However, a specialized algorithm is required to compute these indices, and this procedure is extremely computatio... |

5 | Thompson sampling for complex online problems.
- Gopalan, Mannor, et al.
- 2014
(Show Context)
Citation Context ...er a more general class of models, and show that standard analysis of upper confidence bound algorithms leads to bounds on the expected regret of Thompson sampling. Very recent work of Gopalan et al. =-=[27]-=- provides asymptotic frequentist bounds on the growth rate of regret for problems with dependent arms. Both UCB algorithms and Thompson sampling have been applied to other types of problems, like rein... |

4 | An information-theoretic analysis of Thompson sampling.
- Russo, Roy
- 2014
(Show Context)
Citation Context ...assesses information gain allows it to dramatically outperform UCB algorithms and Thompson sampling. Further, by leveraging the tools of our recent information theoretic analysis of Thompson sampling =-=[47]-=-, we establish an expected regret bound for IDS that applies across a very general class of models and scales with the entropy of the optimal action distribution. We also specialize this bound to seve... |

3 | Minimax policies for bandits games
- Audibert, Bubeck
- 2009
(Show Context)
Citation Context ...) is an upper bound on the variance of the reward distribution at action a. While this method dramatically outperforms UCB1, it is still outperformed by IDS. The MOSS algorithm of Audibert and Bubeck =-=[7]-=- is similar to UCB1 and UCB–Tuned, but uses slightly different confidence bounds. It is known to satisfy regret bounds for this problem that are minimax optimal up to a numerical constant factor. In p... |

3 | Optimal dynamic assortment planning with demand learning
- Sauré, Zeevi
(Show Context)
Citation Context ...ysis accommodate this level of generality, they can be specialized to problems that in the past have been studied individually, such as those involving pricing and assortment optimization (see, e.g., =-=[13, 45, 49]-=-), though in each case, developing an efficient version of IDS may require innovation. 2 Literature review UCB algorithms are the primary approach considered in the segment of the stochastic multi-arm... |

2 | Bayesian mixture modelling and inference based Thompson sampling in MonteCarlo tree search.
- Bai, Wu, et al.
- 2013
(Show Context)
Citation Context ...te of regret for problems with dependent arms. Both UCB algorithms and Thompson sampling have been applied to other types of problems, like reinforcement learning [31, 42] and Monte Carlo tree search =-=[10, 36]-=-. We will describe both UCB algorithms and Thompson sampling in more detail in Section 8. In one of the first papers on multi-armed bandit problems with dependent arms, Agrawal et al. [3] consider a g... |

2 | Finite-Time Analysis of Kernelised Contextual Bandits.
- Valko, Korda, et al.
- 2013
(Show Context)
Citation Context ...action to expected reward is a linear [1, 20, 44], generalized linear [21], or sparse linear [2] model; is sampled from a Gaussian process [51] or has small norm in a reproducing kernel Hilbert space =-=[51, 53]-=-; or is a smooth (e.g. Lipschitz continuous) model [15, 35, 52]. Recently, an algorithm known as Thompson sampling has received a great deal of interest. Agrawal and Goyal [6] provided the first analy... |

1 |
et al. Bayesian experimental design: A review
- Chaloner, Verdinelli
- 1995
(Show Context)
Citation Context ...ppendix C, we define KG more formally, and provide some insight into why it can fail to converge to optimality. Our work also connects to a much larger literature on Bayesian experimental design (see =-=[17]-=- for a review). Recent work has demonstrated the effectiveness of greedy or myopic policies that always maximize a measure of the information gain from the next sample. Jedynak et al. [32] and Waeber ... |

1 |
et al. Twenty questions with noise: Bayes optimal policies for entropy loss
- Jedynak, Frazier, et al.
(Show Context)
Citation Context ...design (see [17] for a review). Recent work has demonstrated the effectiveness of greedy or myopic policies that always maximize a measure of the information gain from the next sample. Jedynak et al. =-=[32]-=- and Waeber et al. [55] consider problem settings in which this greedy policy is optimal. Another recent line of work [25, 26] shows that measures of information gain sometimes satisfy a decreasing re... |

1 |
Optimal learning, volume 841
- Powell, Ryzhov
- 2012
(Show Context)
Citation Context ...97 36.728 46.969 40.909 154.91 48.424 57.373 Table 1: Realized regret over 1000 trials in Bernoulli experiment A somewhat different approach is the knowledge gradient (KG) policy of Powell and Ryzhov =-=[43]-=-, which uses a one-step lookahead approximation to the value of information to guide experimentation. For reasons described in Appendix C, KG does not explore sufficiently to identify the optimal arm ... |