#### DMCA

## F.: Optimized look-ahead tree policies: a bridge between look-ahead tree policies and direct policy search (2013)

Venue: | International Journal of Adaptive Control and Signal Processing |

Citations: | 1 - 1 self |

### Citations

5599 | Reinforcement Learning: An Introduction
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...y search Direct policy search is a widely used class of solutions that comes from reinforcement learning. For a general overview over the field of reinforcement learning, refer to one of the books of =-=[48, 2, 41, 4]-=-. Over the years, many DPS techniques have been proposed and giving credit to every one of them would be hardly possible. Individual techniques differentiate themselves by their policy parametrization... |

1521 |
A formal basis for heuristic determination of minimum path cost
- Hart, Nilsson, et al.
- 1968
(Show Context)
Citation Context ...2 Look-ahead tree search The algorithm studied in this paper can also be related to the larger field of tree-based planning and search. One of the most seminal works in this field is the A∗ algorithm =-=[21]-=- which uses a best-first search to find the shortest path from a source state/configuration to a goal state/configuration. The conceptual difference between A∗ and related methods on the one side and ... |

535 | Evolving neural networks through augmenting topologies
- Stanley, Miikkulainen
- 2002
(Show Context)
Citation Context ... issue in DPS is that the final performance strongly depends on the choice of an appropriate policy representation. Common policy representations include linear parametrizations [22], neural networks =-=[37, 47, 19]-=- or radial basis functions [6, 14] and typically have hyper-parameters that require tuning (e.g. the number of hidden neurons). For a given problem, choosing an appropriate representation and tuning i... |

512 | Dynamic Programming and
- Bertsekas
- 2005
(Show Context)
Citation Context ...y search Direct policy search is a widely used class of solutions that comes from reinforcement learning. For a general overview over the field of reinforcement learning, refer to one of the books of =-=[48, 2, 41, 4]-=-. Over the years, many DPS techniques have been proposed and giving credit to every one of them would be hardly possible. Individual techniques differentiate themselves by their policy parametrization... |

505 |
Real-time heuristic search
- Korf
- 1990
(Show Context)
Citation Context ... from this node to the goal (a so-called “admissible” heuristic). Several authors have sought to learn good admissible heuristics for the A∗ algorithm. For example, we can mention the LRTA∗ algorithm =-=[28]-=- which is a variant of A∗ and which learns over multiple trials an optimal admissible heuristic. More recent work for learning strategies to efficiently explore graphs have focused on the use of super... |

358 |
Lipschitzian optimization without the Lipschitz constant
- Jones, Stuckman
- 1993
(Show Context)
Citation Context ...h is expensive. Among the many alternatives that have been proposed in the past for this purpose, such as cross-entropy [44], various stochastic search alternatives [20], or Lipschitzian optimization =-=[39]-=-, Gaussian process optimization (GPO) is considered to be one of the most efficient methods to optimize expensive functions [3]. The purpose of this section is to provide the required background in GP... |

245 |
The cross-entropy method. A unified approach to combinatorial optimization, Monte-Carlo simulation, and machine Learning
- Rubinstein, Kroese
- 2004
(Show Context)
Citation Context ...nt; instead we have to simulate the system (or run real-world experiments), which is expensive. Among the many alternatives that have been proposed in the past for this purpose, such as cross-entropy =-=[44]-=-, various stochastic search alternatives [20], or Lipschitzian optimization [39], Gaussian process optimization (GPO) is considered to be one of the most efficient methods to optimize expensive functi... |

234 | A taxonomy of global optimization methods based on response surfaces
- Jones
- 2001
(Show Context)
Citation Context ...observations hyper-parameters ϑ and σ20, and then obtain for any new point x the distribution over function values p(f(x)|X,ϑ, x, y) = N (µ(x), σ2(x)). 4.4 Choosing an acquisition function Early work =-=[24, 35]-=- suggested to take as acquisition function the probability of improving over the current maximum x+ := argmaxx∈Dn f(x) within the training data Dn. The resulting PI acquisition function is given by P ... |

228 | Efficient selectivity and backup operators in Monte-Carlo Tree Search
- Coulom
- 2006
(Show Context)
Citation Context ...-nearest neighbors); for example, see [30, 34, 52, 21]. The objective of which node to best explore next also arises in the context of game-playing; here we can mention, e.g., Monte-Carlo tree search =-=[12]-=-. In particular progressive strategies which widen up the actions (nodes considered for expansion) such as in [45, 12, 9] and which can also be used for continuous action spaces [43] could be a promis... |

215 | Model predictive control: Past, present and future
- Morari, Lee
- 1999
(Show Context)
Citation Context ...tive control Model Predictive Control techniques have originally been introduced as ways to stabilize large-scale systems with constraints around equilibrium points (or around a reference trajectory) =-=[36, 10, 15]-=-. They exploit an explicitly formulated model of the problem and solve in a receding horizon manner a series of finite time open-loop deterministic optimal control problems. In such they are very much... |

133 |
The swing up control problem for the acrobot
- Spong
- 1995
(Show Context)
Citation Context ...ward in this domain, the policy has to both balance the poles and not collide with one of the walls. Low rewards usually occur because of such collisions. Acrobot Our third domain is the acrobot from =-=[46]-=-: a two-link robot that resembles a gymnast swinging up above a high bar (see Figure 5c). The acrobot freely swings around the first joint (the hands grasping the bar) and can exert force only at the ... |

123 | Machine learning for fast quadrupedal locomotion.
- Kohl, Stone
- 2004
(Show Context)
Citation Context ...methods during the experimental evaluation of our look-ahead trees. Another example for a policy parametrization is to use domain-specific building blocks, such as motor primitives, as it was done in =-=[27]-=- and [29] to optimize the gait of the AIBO quadrupedal robot. A different kind of policy representation is used in [50] to learn a policy for the game of Ms. Pac-Man; here the policy is represented by... |

105 | Evolutionary algorithms for reinforcement learning.
- Moriarty, Schultz, et al.
- 1999
(Show Context)
Citation Context ... issue in DPS is that the final performance strongly depends on the choice of an appropriate policy representation. Common policy representations include linear parametrizations [22], neural networks =-=[37, 47, 19]-=- or radial basis functions [6, 14] and typically have hyper-parameters that require tuning (e.g. the number of hidden neurons). For a given problem, choosing an appropriate representation and tuning i... |

96 | Progressive Strategies for Monte-Carlo Tree
- Chaslot, Winands, et al.
- 2008
(Show Context)
Citation Context ...es in the context of game-playing; here we can mention, e.g., Monte-Carlo tree search [12]. In particular progressive strategies which widen up the actions (nodes considered for expansion) such as in =-=[45, 12, 9]-=- and which can also be used for continuous action spaces [43] could be a promising enhancement for the policy search technique presented herein. 6.3 Model predictive control Model Predictive Control t... |

96 | The CMA evolution strategy: a comparing review
- Hansen
- 2006
(Show Context)
Citation Context ...r run real-world experiments), which is expensive. Among the many alternatives that have been proposed in the past for this purpose, such as cross-entropy [44], various stochastic search alternatives =-=[20]-=-, or Lipschitzian optimization [39], Gaussian process optimization (GPO) is considered to be one of the most efficient methods to optimize expensive functions [3]. The purpose of this section is to pr... |

91 | A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning.
- Brochu, Cora, et al.
- 2009
(Show Context)
Citation Context ...rious stochastic search alternatives [20], or Lipschitzian optimization [39], Gaussian process optimization (GPO) is considered to be one of the most efficient methods to optimize expensive functions =-=[3]-=-. The purpose of this section is to provide the required background in GPO to the reader who is not aware of this. Note that in our experiments we will illustrate the optimization of look-ahead trees ... |

91 |
Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Pr I Llc,
- Busoniu, Babuska, et al.
- 2010
(Show Context)
Citation Context ...of optimal control and reinforcement learning. Two important classes of techniques that are known to work well on difficult problems characterized by large state spaces are direct policy search (DPS) =-=[4]-=- and look-ahead tree (LT) policies [23]. In DPS, a policy is seen as a parameterized function that maps states to actions. In order to identify the best settings of the parameters, DPS techniques rely... |

84 | Pilco: A model-based and data-efficient approach to policy search.
- Deisenroth, Rasmussen
- 2011
(Show Context)
Citation Context ...ance strongly depends on the choice of an appropriate policy representation. Common policy representations include linear parametrizations [22], neural networks [37, 47, 19] or radial basis functions =-=[6, 14]-=- and typically have hyper-parameters that require tuning (e.g. the number of hidden neurons). For a given problem, choosing an appropriate representation and tuning its hyper-parameters is a difficult... |

60 |
Approximate Dynamic Programming
- Powell
- 2011
(Show Context)
Citation Context ...features), whereas the latter would require a far more expressive parametrization (and it is well-known that value function approximation scales badly when the dimensionality of the state space grows =-=[41]-=-). 3.6 Summary: the algorithm Figure 2 presents a simple algorithm based on a sorted list to implement policies as parameterized look-ahead trees. The algorithm requires as input a state xt and the pa... |

56 |
The Bayesian approach to global optimization.
- Mockus
- 1982
(Show Context)
Citation Context ...observations hyper-parameters ϑ and σ20, and then obtain for any new point x the distribution over function values p(f(x)|X,ϑ, x, y) = N (µ(x), σ2(x)). 4.4 Choosing an acquisition function Early work =-=[24, 35]-=- suggested to take as acquisition function the probability of improving over the current maximum x+ := argmaxx∈Dn f(x) within the training data Dn. The resulting PI acquisition function is given by P ... |

55 | Learning Tetris Using the Noisy Cross-Entropy Method.
- Szita, Lorincz
- 2006
(Show Context)
Citation Context ...ope well with stochastic returns and hidden states. For certain domains in reinforcement learning such as Tetris, the best performing policies known today have been obtained by gradient-free DPS (see =-=[49]-=- and follow-up work). The weakness of this approach is that conceptually it is less sample-efficient than policy-gradient methods and thus will require a substantially higher number of policy evaluati... |

48 | Active guidance for a finless rocket using neuroevolution
- Gomez, Miikkulainen
- 2003
(Show Context)
Citation Context ...nsisted of neural networks as policy representation, with the weights making up the policy parameters, and variants of genetic algorithms acting as global optimizer. Examples can be found in [37] and =-=[18]-=-, where later work also considered optimization of the network structure [47], or using recurrent neural networks to better cope with hidden states [19]. A recent comparison of these methods can also ... |

44 | Automatic gait optimization with Gaussian process regression.
- Lizotte, Wang, et al.
- 2007
(Show Context)
Citation Context ...ring function of optimized look-ahead trees can be learned using any derivative-free global optimizer. One such optimizer which was shown to be highly relevant to DPS is Gaussian process optimization =-=[17, 29]-=-. We give a brief (but for all practical purposes fully sufficient) summary of this approach in Section 4. Section 5 presents the results of extensive experimental evaluations, wherein we compare opti... |

41 | Gaussian processes for global optimization.
- Osborne, Garnett, et al.
- 2009
(Show Context)
Citation Context ...rameters. While GPO is quite efficient for problems that have a reasonable number of parameters, this approach requires sophisticated approximations to scale to higher-dimensional problems (e.g., see =-=[38]-=-). In this paper we use a naive textbook implementation of GPO that is able to solve OLT optimization problems, but that suffers from scaling problems when the number of samples increases, which happe... |

39 | Dynamic multidrug therapies for HIV: Optimal and STI approaches.
- Adams, Banks, et al.
- 2004
(Show Context)
Citation Context ... handstand), the discount factor is set to γ = 1, and each policy is evaluated for H = 500 steps6. HIV drug treatment Our last problem domain is taken from a real-world application in medical control =-=[1]-=-. The aim is to optimize the treatment of a patient infected by HIV over a period of a few years using what is known as structured treatment interruption (STI). The treatment of the patient consists o... |

28 | Optimistic planning of deterministic systems,”
- Hren, Munos
- 2008
(Show Context)
Citation Context ...earning. Two important classes of techniques that are known to work well on difficult problems characterized by large state spaces are direct policy search (DPS) [4] and look-ahead tree (LT) policies =-=[23]-=-. In DPS, a policy is seen as a parameterized function that maps states to actions. In order to identify the best settings of the parameters, DPS techniques rely on local or global optimization algori... |

24 | Clinical data based optimal STI strategies for HIV: A reinforcement learning approach,”
- Ernst, Stan, et al.
- 2006
(Show Context)
Citation Context ...he patient’s health and thus their use should be kept to a minimum. Finding an optimal treatment strategy is considered a challenging optimal control problem with highly nonlinear transition dynamics =-=[16]-=-. The system is represented by a six-dimensional state vector x ≡ ( T1, T2, T ∗ 1 , T ∗ 2 , V,E ) , where T1 ≥ 0 and T2 ≥ 0 is the count of healthy type-1 and type-2 cells, T ∗1 ≥ 0 and T ∗2 ≥ 0 is th... |

23 | Learning to play using low-complexity rule-based policies: Illustrations through Ms. Pac-Man
- Szita, Lőrincz
- 2007
(Show Context)
Citation Context ...use domain-specific building blocks, such as motor primitives, as it was done in [27] and [29] to optimize the gait of the AIBO quadrupedal robot. A different kind of policy representation is used in =-=[50]-=- to learn a policy for the game of Ms. Pac-Man; here the policy is represented by a list of domain-specific parameterized rules. Section 4 described a third option for the global optimization part: Ga... |

22 | Learning heuristic functions from relaxed plans
- Yoon, Fern, et al.
- 2006
(Show Context)
Citation Context ...e focused on the use of supervised regression techniques using various approximation structures to solve this problem (e.g., linear regression, neural networks, k-nearest neighbors); for example, see =-=[30, 34, 52, 21]-=-. The objective of which node to best explore next also arises in the context of game-playing; here we can mention, e.g., Monte-Carlo tree search [12]. In particular progressive strategies which widen... |

17 |
Optimal active learning through billiards and upper confidence trees in continous domains
- Rolet, Sebag, et al.
- 2009
(Show Context)
Citation Context ...Carlo tree search [12]. In particular progressive strategies which widen up the actions (nodes considered for expansion) such as in [45, 12, 9] and which can also be used for continuous action spaces =-=[43]-=- could be a promising enhancement for the policy search technique presented herein. 6.3 Model predictive control Model Predictive Control techniques have originally been introduced as ways to stabiliz... |

15 | Reinforcement learning versus model predictive control: a comparison on a power system problem
- ERNST, GLAVIC, et al.
(Show Context)
Citation Context ...tive control Model Predictive Control techniques have originally been introduced as ways to stabilize large-scale systems with constraints around equilibrium points (or around a reference trajectory) =-=[36, 10, 15]-=-. They exploit an explicitly formulated model of the problem and solve in a receding horizon manner a series of finite time open-loop deterministic optimal control problems. In such they are very much... |

14 | Using Gaussian processes to optimize expensive functions
- Frean, Boyle
- 2008
(Show Context)
Citation Context ...ring function of optimized look-ahead trees can be learned using any derivative-free global optimizer. One such optimizer which was shown to be highly relevant to DPS is Gaussian process optimization =-=[17, 29]-=-. We give a brief (but for all practical purposes fully sufficient) summary of this approach in Section 4. Section 5 presents the results of extensive experimental evaluations, wherein we compare opti... |

14 | C.: Variable metric reinforcement learning methods applied to the noisy mountain car problem
- Heidrich-Meisner, Igel
- 2008
(Show Context)
Citation Context ...fail. However, a major issue in DPS is that the final performance strongly depends on the choice of an appropriate policy representation. Common policy representations include linear parametrizations =-=[22]-=-, neural networks [37, 47, 19] or radial basis functions [6, 14] and typically have hyper-parameters that require tuning (e.g. the number of hidden neurons). For a given problem, choosing an appropria... |

12 | Optimistic planning for sparsely stochastic systems
- Busoniu, Munos, et al.
- 2011
(Show Context)
Citation Context ...ebased search [45, 12, 9, 43]. Deterministic transitions can be relaxed to weakly stochastic transitions (weakly meaning that there is only a small number of possible successor states) as was done in =-=[7]-=- to extend the optimistic strategy from [23] to sparse stochastic systems. And finally, in application scenarios where a generative model is not available from the beginning, one could also try to int... |

12 |
Machine learning methods for planning
- Minton
- 1993
(Show Context)
Citation Context ...e focused on the use of supervised regression techniques using various approximation structures to solve this problem (e.g., linear regression, neural networks, k-nearest neighbors); for example, see =-=[30, 34, 52, 21]-=-. The objective of which node to best explore next also arises in the context of game-playing; here we can mention, e.g., Monte-Carlo tree search [12]. In particular progressive strategies which widen... |

12 | P.: Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning
- Whiteson, Taylor, et al.
- 2010
(Show Context)
Citation Context ...r work also considered optimization of the network structure [47], or using recurrent neural networks to better cope with hidden states [19]. A recent comparison of these methods can also be found in =-=[51]-=- and [26]. As an alternative to genetic algorithms, more recent work started to explore the use of the cross-entropy method [44], or variants such as the covariance matrix adaptation evolution strateg... |

10 | D.: Learning exploration/exploitation strategies for single trajectory reinforcement learning
- Castronovo, Maes, et al.
- 2012
(Show Context)
Citation Context ...er by using the same kind of parameterizations (a simple linear function) and the same kind of optimizers (derivative-free global optimizers) as ours or by searching in a space of formulas. Reference =-=[8]-=- extends this idea to the exploration / exploitation dilemma that occurs in single-trajectory reinforcement learning. Here, the parameters are no more real-valued vectors, but rather small formulas th... |

10 |
Gaussian processes for sample efficient reinforcement learning with rmax-like exploration
- Jung, Stone
- 2012
(Show Context)
Citation Context ...ent strategies even when they have very large budgets. On the inverted pendulum problem, we see that all LT policies achieve a near-optimal performance (which we can compute for this domain, e.g. see =-=[25]-=-) already with a budget of 5. It thus seems that there is little interest in learning a specific node scoring function for this problem. Illustration of the HIV policy. In order to allow a direct comp... |

10 | Automatic discovery of ranking formulas for playing with multi-armed bandits
- Maes, Wehenkel, et al.
- 2011
(Show Context)
Citation Context ...upper confidence tree algorithm. Parameterized algorithms have also been shown to be relevant to solve various kinds of exploration / exploitation dilemma in a problem-driven way. References [33] and =-=[31]-=- propose to learn exploration / exploitation strategies for multi-armed bandit problems either by using the same kind of parameterizations (a simple linear function) and the same kind of optimizers (d... |

9 | Cross-entropy optimization of control policies with adaptive basis functions
- Busoniu, Ernst, et al.
(Show Context)
Citation Context ...ance strongly depends on the choice of an appropriate policy representation. Common policy representations include linear parametrizations [22], neural networks [37, 47, 19] or radial basis functions =-=[6, 14]-=- and typically have hyper-parameters that require tuning (e.g. the number of hidden neurons). For a given problem, choosing an appropriate representation and tuning its hyper-parameters is a difficult... |

9 |
D.: Learning to play K-armed bandit problems
- Maes, Wehenkel, et al.
(Show Context)
Citation Context ...thin the upper confidence tree algorithm. Parameterized algorithms have also been shown to be relevant to solve various kinds of exploration / exploitation dilemma in a problem-driven way. References =-=[33]-=- and [31] propose to learn exploration / exploitation strategies for multi-armed bandit problems either by using the same kind of parameterizations (a simple linear function) and the same kind of opti... |

8 | Multistage stochastic programming: A scenario tree based approach to planning under uncertainty.
- Defourny, Ernst, et al.
- 2012
(Show Context)
Citation Context ...nd has already inspired several authors. The system proposed in [5] places an optimisation layer on top of an approximate value iteration algorithm to optimize the location of its basis functions. In =-=[13]-=- (Section 5.3), the authors consider multi-stage stochastic programming techniques and optimize the scenario trees using Monte-Carlo methods. Closer to our work, it is proposed in [11] to parameterize... |

4 | Characterizing reinforcement learning methods through parameterized learning problems
- Kalyanakrishnan, Stone
(Show Context)
Citation Context ...so considered optimization of the network structure [47], or using recurrent neural networks to better cope with hidden states [19]. A recent comparison of these methods can also be found in [51] and =-=[26]-=-. As an alternative to genetic algorithms, more recent work started to explore the use of the cross-entropy method [44], or variants such as the covariance matrix adaptation evolution strategy CMA-ES ... |

3 | Fuzzy partition optimization for approximate fuzzy q-iteration
- Busoniu, Ernst, et al.
- 2008
(Show Context)
Citation Context ... DPS, i.e., through direct optimization of the algorithm performance. This methodology can be applied to a wide range of problem kinds and has already inspired several authors. The system proposed in =-=[5]-=- places an optimisation layer on top of an approximate value iteration algorithm to optimize the location of its basis functions. In [13] (Section 5.3), the authors consider multi-stage stochastic pro... |

3 | Improving the exploration in upper confidence trees
- Couetoux, Teytaud, et al.
- 2012
(Show Context)
Citation Context ...s functions. In [13] (Section 5.3), the authors consider multi-stage stochastic programming techniques and optimize the scenario trees using Monte-Carlo methods. Closer to our work, it is proposed in =-=[11]-=- to parameterize a tree-search technique for decision-making: upper confidence trees. In this work, the parameters enable to control the simulation policy used to estimate long-term returns within the... |

3 |
Learning in Markov Decision Processes for Structured Prediction
- Maes
- 2009
(Show Context)
Citation Context ...e focused on the use of supervised regression techniques using various approximation structures to solve this problem (e.g., linear regression, neural networks, k-nearest neighbors); for example, see =-=[30, 34, 52, 21]-=-. The objective of which node to best explore next also arises in the context of game-playing; here we can mention, e.g., Monte-Carlo tree search [12]. In particular progressive strategies which widen... |

2 |
Cruise control using model predictive control with constraints
- Coen, Anthonis, et al.
- 2008
(Show Context)
Citation Context ...tive control Model Predictive Control techniques have originally been introduced as ways to stabilize large-scale systems with constraints around equilibrium points (or around a reference trajectory) =-=[36, 10, 15]-=-. They exploit an explicitly formulated model of the problem and solve in a receding horizon manner a series of finite time open-loop deterministic optimal control problems. In such they are very much... |

1 |
Accelerated neuroevolution through cooperatively coevolved synapses
- Gomez, Schmidhuber, et al.
(Show Context)
Citation Context ... issue in DPS is that the final performance strongly depends on the choice of an appropriate policy representation. Common policy representations include linear parametrizations [22], neural networks =-=[37, 47, 19]-=- or radial basis functions [6, 14] and typically have hyper-parameters that require tuning (e.g. the number of hidden neurons). For a given problem, choosing an appropriate representation and tuning i... |

1 |
Optimized look-ahead tree policies
- Maes, Wehenkel, et al.
- 2011
(Show Context)
Citation Context ...ent strategies depend on the characteristics of the problem (e.g. to which extent are rewards informative about the long-term goal?), on the available online 1This technique was initially proposed in =-=[32]-=- and is here extended through a more mature exposition of the method and an extensive experimental study which covers much more aspects than the initial paper. Furthermore, we introduce the use of Gau... |

1 |
Q-learning with double progressive widening: application to robotics
- Sokolovska, Teytaud, et al.
- 2011
(Show Context)
Citation Context ...es in the context of game-playing; here we can mention, e.g., Monte-Carlo tree search [12]. In particular progressive strategies which widen up the actions (nodes considered for expansion) such as in =-=[45, 12, 9]-=- and which can also be used for continuous action spaces [43] could be a promising enhancement for the policy search technique presented herein. 6.3 Model predictive control Model Predictive Control t... |