Results 1  10
of
22
XArmed Bandits
, 2010
"... We consider a generalization of stochastic bandits where the set of arms, ..."
Abstract

Cited by 28 (7 self)
 Add to MetaCart
We consider a generalization of stochastic bandits where the set of arms,
Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness
"... We consider a global optimization problem of a deterministic functionf in a semimetric space, given a finite budget ofnevaluations. The functionf is assumed to be locally smooth (around one of its global maxima) with respect to a semimetric ℓ. We describe two algorithms based on optimistic explorat ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
We consider a global optimization problem of a deterministic functionf in a semimetric space, given a finite budget ofnevaluations. The functionf is assumed to be locally smooth (around one of its global maxima) with respect to a semimetric ℓ. We describe two algorithms based on optimistic exploration that use a hierarchical partitioning of the space at all scales. A first contribution is an algorithm, DOO, that requires the knowledge of ℓ. We report a finitesample performance bound in terms of a measure of the quantity of nearoptimal states. We then define a second algorithm, SOO, which does not require the knowledge of the semimetric ℓ under which f is smooth, and whose performance is almost as good as DOO optimallyfitted. 1
Optimistic planning for sparsely stochastic systems
 In IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning
, 2011
"... AbstractWe propose an online planning algorithm for finiteaction, sparsely stochastic Markov decision processes, in which the random state transitions can only end up in a small number of possible next states. The algorithm builds a planning tree by iteratively expanding states, where each expansi ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
(Show Context)
AbstractWe propose an online planning algorithm for finiteaction, sparsely stochastic Markov decision processes, in which the random state transitions can only end up in a small number of possible next states. The algorithm builds a planning tree by iteratively expanding states, where each expansion exploits sparsity to add all possible successor states. Each state to expand is actively chosen to improve the knowledge about action quality, and this allows the algorithm to return a good action after a strictly limited number of expansions. More specifically, the active selection method is optimistic in that it chooses the most promising states first, so the novel algorithm is called optimistic planning for sparsely stochastic systems. We note that the new algorithm can also be seen as modelpredictive (recedinghorizon) control. The algorithm obtains promising numerical results, including the successful online control of a simulated HIV infection with stochastic drug effectiveness.
Lipschitz Bandits without the Lipschitz Constant
 ALT 2011 22ND INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY, ESPOO: FINLAND
, 2011
"... We consider the setting of stochastic bandit problems with a continuum of arms indexed by [0,1] d. We first point out that the strategies considered so far in the literature only provided theoretical guarantees of the form: given some tuning parameters, the regret is small with respect to a class of ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
We consider the setting of stochastic bandit problems with a continuum of arms indexed by [0,1] d. We first point out that the strategies considered so far in the literature only provided theoretical guarantees of the form: given some tuning parameters, the regret is small with respect to a class of environments that depends on these parameters. This is however not the right perspective, as it is the strategy that should adapt to the specific bandit environment at hand, and not the other way round. Put differently, an adaptation issue is raised. We solve it for the special case of environments whose meanpayoff functions are globally Lipschitz. More precisely, we show that the minimax optimal orders of magnitude L d/(d+2) T (d+1)/(d+2) of the regret bound over T time instances against an environment whose meanpayoff function f is Lipschitz with constant L can be achieved without knowing L or T in advance. This is in contrast to all previously known strategies, which require to some extent the knowledge of L to achieve this performance guarantee.
Simple regret optimization in online planning for markov decision processes
 CoRR
, 2012
"... We consider online planning in Markov decision processes (MDPs). In online planning, the agent focuses on its current state only, deliberates about the set of possible policies from that state onwards and, when interrupted, uses the outcome of that exploratory deliberation to choose what action to p ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
We consider online planning in Markov decision processes (MDPs). In online planning, the agent focuses on its current state only, deliberates about the set of possible policies from that state onwards and, when interrupted, uses the outcome of that exploratory deliberation to choose what action to perform next. Formally, the performance of algorithms for online planning is assessed in terms of simple regret, the agent’s expected performance loss when the chosen action, rather than an optimal one, is followed. To date, stateoftheart algorithms for online planning in general MDPs are either best effort, or guarantee only polynomialrate reduction of simple regret over time. Here we introduce a new MonteCarlo tree search algorithm, BRUE, that guarantees exponentialrate and smooth reduction of simple regret. At a high level, BRUE is based on a simple yet nonstandard statespace sampling scheme, MCTS2e, in which different parts of each sample are dedicated to different exploratory objectives. We further extend BRUE with a variant of “learning by forgetting. ” The resulting parametrized algorithm, BRUE(α), exhibits even more attractive formal guarantees than BRUE. Our empirical evaluation shows that both BRUE and its generalization, BRUE(α), are also very effective in practice and compare favorably to the stateoftheart. 1.
Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections
, 2013
"... Most learning to rank research has assumed that the utility of different documents is independent, which results in learned ranking functions that return redundant results. The few approaches that avoid this have rather unsatisfyingly lacked theoretical foundations, or do not scale. We present a lea ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Most learning to rank research has assumed that the utility of different documents is independent, which results in learned ranking functions that return redundant results. The few approaches that avoid this have rather unsatisfyingly lacked theoretical foundations, or do not scale. We present a learningtorank formulation that optimizes the fraction of satisfied users, with several scalable algorithms that explicitly takes document similarity and ranking context into account. Our formulation is a nontrivial common generalization of two multiarmed bandit models from the literature: ranked bandits (Radlinski et al., 2008) and Lipschitz bandits (Kleinberg et al., 2008b). We present theoretical justifications for this approach, as well as a nearoptimal algorithm. Our evaluation adds optimizations that improve empirical performance, and shows that our algorithms learn orders of magnitude more quickly than previous approaches.
Planning in RewardRich Domains via PAC Bandits
"... In some decisionmaking environments, successful solutions are common. If the evaluation of candidate solutions is noisy, however, the challenge is knowing when a “good enough” answer has been found. We formalize this problem as an infinitearmed bandit and provide upper and lower bounds on the numb ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
In some decisionmaking environments, successful solutions are common. If the evaluation of candidate solutions is noisy, however, the challenge is knowing when a “good enough” answer has been found. We formalize this problem as an infinitearmed bandit and provide upper and lower bounds on the number of evaluations or “pulls ” needed to identify a solution whose evaluation exceeds a given threshold r0. We present several algorithms and use them to identify reliable strategies for solving screens from the video games Infinite Mario and Pitfall! We show order of magnitude improvements in sample complexity over a natural approach that pulls each arm until a good estimate of its success probability is known.
Optimistic planning for beliefaugmented Markov decision processes
 In IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL
, 2013
"... Abstract—This paper presents the Bayesian Optimistic Planning (BOP) algorithm, a novel modelbased Bayesian reinforcement learning approach. BOP extends the planning approach of the Optimistic Planning for Markov Decision Processes (OPMDP) algorithm [10], [9] to contexts where the transition mode ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Abstract—This paper presents the Bayesian Optimistic Planning (BOP) algorithm, a novel modelbased Bayesian reinforcement learning approach. BOP extends the planning approach of the Optimistic Planning for Markov Decision Processes (OPMDP) algorithm [10], [9] to contexts where the transition model of the MDP is initially unknown and progressively learned through interactions within the environment. The knowledge about the unknown MDP is represented with a probability distribution over all possible transition models using Dirichlet distributions, and the BOP algorithm plans in the beliefaugmented state space constructed by concatenating the original state vector with the current posterior distribution over transition models. We show that BOP becomes Bayesian optimal when the budget parameter increases to infinity. Preliminary empirical validations show promising performance. I.
Optimistic planning for continuous–action deterministic systems.
 In 2013 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL13),
, 2013
"... Abstract : We consider the optimal control of systems with deterministic dynamics, continuous, possibly largescale state spaces, and continuous, lowdimensional action spaces. We describe an online planning algorithm called SOOP, which like other algorithms in its class has no direct dependence on ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract : We consider the optimal control of systems with deterministic dynamics, continuous, possibly largescale state spaces, and continuous, lowdimensional action spaces. We describe an online planning algorithm called SOOP, which like other algorithms in its class has no direct dependence on the state space structure. Unlike previous algorithms, SOOP explores the true solution space, consisting of infinite sequences of continuous actions, without requiring knowledge about the smoothness of the system. To this end, it borrows the principle of the simultaneous optimistic optimization method, and develops a nontrivial adaptation of this principle to the planning problem. Experiments on four problems show SOOP reliably ranks among the best algorithms, fully dominating competing methods when the problem requires both long horizons and fine discretization.