#### DMCA

## Active policy learning for robot planning and exploration under uncertainty (2007)

### Cached

### Download Links

- [www.cs.ubc.ca]
- [users.isr.ist.utl.pt]
- [webdiis.unizar.es]
- [webdiis.unizar.es]
- [www.roboticsproceedings.org]
- [robots.unizar.es]
- [roboticsproceedings.org]
- DBLP

### Other Repositories/Bibliography

Venue: | IN PROCEEDINGS OF ROBOTICS: SCIENCE AND SYSTEMS |

Citations: | 39 - 5 self |

### Citations

712 |
Dynamic Programming and Optimal Control. Athena Scientific
- Bertsekas
- 1995
(Show Context)
Citation Context .... Moreover, in our domain, the robot only sees the landmarks within and observation gate. Since the models are not linear-Gaussian, one cannot use standard linear-quadratic-Gaussian (LQG) controllers =-=[20]-=- to solve our problem. Moreover, since the action and state spaces are large-dimensional and continuous, one cannot discretize the problem and use closed-loop control as suggested in [21]. That is, th... |

687 |
Predictive control: with constraints
- Maciejowski
- 2002
(Show Context)
Citation Context ...the planning horizon to recede. That is, as the robot moves, it keeps planning T steps ahead of its current position. This control framework is also known as receding-horizon model-predictive control =-=[25]-=-. In the following two subsections, we will describe a way of conducting the simulations to estimate the AMSE. The active policy update algorithm will be described in Section III. A. Simulation of the... |

481 |
The Design and Analysis of Computer Experiments
- Santner, Williams, et al.
- 2003
(Show Context)
Citation Context ...at policies are likely to result in higher expected returns. The method effectively balances the goals of exploration and exploitation in policy search. It is motivated by work on experimental design =-=[13, 14, 15]-=-. Simpler variations of our ideas appeared early in the reinforcement literature. In [16], the problem is treated in the framework of exploration/exploitation with bandits. An extension to continuous ... |

456 |
Efficient global optimization of expensive black-box functions
- Jones, Schonlau, et al.
- 1998
(Show Context)
Citation Context ...at policies are likely to result in higher expected returns. The method effectively balances the goals of exploration and exploitation in policy search. It is motivated by work on experimental design =-=[13, 14, 15]-=-. Simpler variations of our ideas appeared early in the reinforcement literature. In [16], the problem is treated in the framework of exploration/exploitation with bandits. An extension to continuous ... |

440 | Simple statistical gradient-following algorithms for connectionist reinforcement learning
- Williams
- 1992
(Show Context)
Citation Context ...That is, the discretized partially observed Markov decision process is too large for stochastic dynamic programming [22]. As a result of these considerations, we adopt the direct policy search method =-=[23, 24]-=-. In particular, the initial policy is set either randomly or using prior knowledge. Given this policy, we conduct simulations to estimate the AMSE. These simulations involve sampling states and obser... |

411 |
The Optimal Control of Partially Observable Markov Decision Processes over a Finite Horizon
- Smallwood, Sondik
- 1973
(Show Context)
Citation Context ...one cannot discretize the problem and use closed-loop control as suggested in [21]. That is, the discretized partially observed Markov decision process is too large for stochastic dynamic programming =-=[22]-=-. As a result of these considerations, we adopt the direct policy search method [23, 24]. In particular, the initial policy is set either randomly or using prior knowledge. Given this policy, we condu... |

352 |
Lipschitzian optimization without the lipschitz constant
- JONES, PERTTUNEN, et al.
- 1993
(Show Context)
Citation Context ... merely that we can quickly locate a point that is likely to be as good as possible. To deal with this nonlinear constrained optimization problem, we adopted the DIvided RECTangles (DIRECT) algorithm =-=[30, 31]-=-. DIRECT is a deterministic, derivative-free sampling algorithm. It uses the existing samples of the objective function to decide how to proceed to divide the feasible space into finer rectangles. For... |

348 |
Learning in Embedded Systems
- Kaelbling
- 1993
(Show Context)
Citation Context ...goals of exploration and exploitation in policy search. It is motivated by work on experimental design [13, 14, 15]. Simpler variations of our ideas appeared early in the reinforcement literature. In =-=[16]-=-, the problem is treated in the framework of exploration/exploitation with bandits. An extension to continuous spaces (infinite number of bandits) using locally weighted regression was proposed in [17... |

256 | Pegasus: A policy search method for large mdps and pomdps
- Ng, Jordan
- 2000
(Show Context)
Citation Context ...That is, the discretized partially observed Markov decision process is too large for stochastic dynamic programming [22]. As a result of these considerations, we adopt the direct policy search method =-=[23, 24]-=-. In particular, the initial policy is set either randomly or using prior knowledge. Given this policy, we conduct simulations to estimate the AMSE. These simulations involve sampling states and obser... |

228 | A taxonomy of global optimization methods based on response surfaces
- Jones
- 2001
(Show Context)
Citation Context ...pply. We present an alternative approach to gradient-based optimization for continuous policy spaces. This approach, which we refer to as active policy learning, is based on experimental design ideas =-=[27, 13, 28, 29]-=-. Active policy learning is an any-time, “black-box” statistical optimizations3.5 3 2.5 2 1.5 1 0.5 GP mean cost GP variance True cost Infill Data point 0 −1.5 −1 −0.5 0 Policy parameter 0.5 1 1.5 3.5... |

205 | Infinite-horizon policy-gradient estimation
- Baxter, Bartlett
- 2001
(Show Context)
Citation Context ...ed to significant achievements in control and robotics [1, 2, 3, 4]. The success of the method does often, however, hinge on our ability to formulate expressions for the gradient of the expected cost =-=[5, 4, 6]-=-. In some important applications in robotics, such as exploration, constraints in the robot motion and control models make it hard, and often impossible, to compute derivatives of the cost function wi... |

177 | Posterior Cramér-Rao bounds for discrete-time nonlinear filtering
- Tichavsky, Muravchik, et al.
- 1998
(Show Context)
Citation Context ... measurements and states are assumed random. It is defined as the inverse of the Fisher information matrix J and provides the following lower bound on the AMSE: C π AMSE ≥ C π P CRB = J −1 Tichavsk´y =-=[35]-=-, derived the following Riccati-like recursion to compute the PCRB for any unbiased estimator: where Jt+1 = Dt − C ′ t(Jt + Bt) −1 Ct + At+1, (3) At+1 = E[−∆xt+1,xt+1 log p(yt+1|xt+1)] Bt = E[−∆xt,xt ... |

168 | Inverted autonomous helicopter flight via reinforcement learning
- Ng, Coates, et al.
- 2004
(Show Context)
Citation Context ...o replan a new path in the spirit of open-loop feedback control. I. INTRODUCTION The direct policy search method for reinforcement learning has led to significant achievements in control and robotics =-=[1, 2, 3, 4]-=-. The success of the method does often, however, hinge on our ability to formulate expressions for the gradient of the expected cost [5, 4, 6]. In some important applications in robotics, such as expl... |

145 | Policy gradient reinforcement learning for fast quadrupedal locomotion
- Kohl, Stone
(Show Context)
Citation Context ...o replan a new path in the spirit of open-loop feedback control. I. INTRODUCTION The direct policy search method for reinforcement learning has led to significant achievements in control and robotics =-=[1, 2, 3, 4]-=-. The success of the method does often, however, hinge on our ability to formulate expressions for the gradient of the expected cost [5, 4, 6]. In some important applications in robotics, such as expl... |

136 |
Mobile Robot Localization and Map Building: A Multisensor Fusion Approach
- Castellanos, Tardós
- 2000
(Show Context)
Citation Context ...KF or particle filter) to compute the posterior mean state �x (i) 1:T . (In this paper, we adopt the EKFSLAM algorithm to estimate the mean and covariance of this distribution. We refer the reader to =-=[26]-=- for implementation details.) The evaluation of the cost function is therefore extremely expensive. Moreover, since the model is nonlinear, it is hard to quantify the uncertainty introduced by the sub... |

116 | Policy gradient methods for robotics
- Peters, Schaal
- 2006
(Show Context)
Citation Context ...o replan a new path in the spirit of open-loop feedback control. I. INTRODUCTION The direct policy search method for reinforcement learning has led to significant achievements in control and robotics =-=[1, 2, 3, 4]-=-. The success of the method does often, however, hinge on our ability to formulate expressions for the gradient of the expected cost [5, 4, 6]. In some important applications in robotics, such as expl... |

96 | Information gain-based exploration using RaoBlackwellized particle filters
- Stachniss, Grisetti, et al.
- 2005
(Show Context)
Citation Context ... We demonstrate the new approach on a hard robotics problem: planning and exploration under uncertainty. This problem plays a key role in simultaneous localization and mapping (SLAM), see for example =-=[7, 8]-=-. Mobile robots must maximize the size of the explored terrain, but, at the same time, they must ensure that localization errors are minimized. While exploration is needed to find new features, the ro... |

75 | Kernel methods for missing variables
- Smola, Vishwanathan, et al.
- 2005
(Show Context)
Citation Context ...er motivating factor is that DIRECT’s implementation is easily available [32]. However, we conjecture that for large dimensional spaces, sequential quadratic programming or concave-convex programming =-=[33]-=- might be better algorithm choices for infill optimization. A. Gaussian processes A Gaussian process, z(·) ∼ GP (m(·), K(·, ·)), is an infinite random process indexed by the vector θ, such that any re... |

69 | Flexibility and Efficiency Enhancements for Constrained Global Design Optimization with Kriging Approximations
- Sasena
- 2002
(Show Context)
Citation Context ...pply. We present an alternative approach to gradient-based optimization for continuous policy spaces. This approach, which we refer to as active policy learning, is based on experimental design ideas =-=[27, 13, 28, 29]-=-. Active policy learning is an any-time, “black-box” statistical optimizations3.5 3 2.5 2 1.5 1 0.5 GP mean cost GP variance True cost Infill Data point 0 −1.5 −1 −0.5 0 Policy parameter 0.5 1 1.5 3.5... |

54 | Stochastic Optimization
- Schneider, Kirkpatrick
- 2006
(Show Context)
Citation Context ...16], the problem is treated in the framework of exploration/exploitation with bandits. An extension to continuous spaces (infinite number of bandits) using locally weighted regression was proposed in =-=[17]-=-. Our paper presents richer criteria for active learning as well suitable optimization objectives. This paper also presents posterior Cramér-Rao bounds to approximate the cost function in robot explor... |

47 | Modifications of the DIRECT Algorithm
- Gablonsky
- 2001
(Show Context)
Citation Context ... merely that we can quickly locate a point that is likely to be as good as possible. To deal with this nonlinear constrained optimization problem, we adopted the DIvided RECTangles (DIRECT) algorithm =-=[30, 31]-=-. DIRECT is a deterministic, derivative-free sampling algorithm. It uses the existing samples of the objective function to decide how to proceed to divide the feasible space into finer rectangles. For... |

41 |
Global a-optimal robot exploration
- Sim, Roy
- 2005
(Show Context)
Citation Context ... We demonstrate the new approach on a hard robotics problem: planning and exploration under uncertainty. This problem plays a key role in simultaneous localization and mapping (SLAM), see for example =-=[7, 8]-=-. Mobile robots must maximize the size of the explored terrain, but, at the same time, they must ensure that localization errors are minimized. While exploration is needed to find new features, the ro... |

39 |
Multisensor resource deployment using posterior Cramér-Rao bounds
- Hernandez, Kirubarajan, et al.
- 2004
(Show Context)
Citation Context ...ocused on robot exploration and planning, our policy search framework extends naturally to other domains. Related problems appear the fields of terrainaided navigation [18, 9] and dynamic sensor nets =-=[19, 6]-=-. II. APPLICATION TO ROBOT EXPLORATION AND PLANNING Although the algorithm proposed in this paper applies to many sequential decision making settings, we will restrict attention to the robot explorati... |

32 | DIRECT Optimization Algorithm User Guide
- Finkel
- 2003
(Show Context)
Citation Context ..., DIRECT provides a better solution than gradient approaches because the infill function tends to have many local optima. Another motivating factor is that DIRECT’s implementation is easily available =-=[32]-=-. However, we conjecture that for large dimensional spaces, sequential quadratic programming or concave-convex programming [33] might be better algorithm choices for infill optimization. A. Gaussian p... |

28 |
A new method of locating the maximum of an arbitrary multi- peak curve in the presence of noise
- Kushner
- 1964
(Show Context)
Citation Context ...pply. We present an alternative approach to gradient-based optimization for continuous policy spaces. This approach, which we refer to as active policy learning, is based on experimental design ideas =-=[27, 13, 28, 29]-=-. Active policy learning is an any-time, “black-box” statistical optimizations3.5 3 2.5 2 1.5 1 0.5 GP mean cost GP variance True cost Infill Data point 0 −1.5 −1 −0.5 0 Policy parameter 0.5 1 1.5 3.5... |

18 | Efficient gradient estimation for motor control learning
- Lawrence, Cowan, et al.
- 2003
(Show Context)
Citation Context |

16 | Optimal sensor trajectories in bearings-only tracking
- Hernandez
(Show Context)
Citation Context ...ns. For instance, full observability is assumed in [9, 7], known robot location is assumed in [10], myopic planning is adopted in [8], and discretization of the state and/or actions spaces appears in =-=[11, 12, 7]-=-. The method proposed in this paper does not rely on any of these assumptions. Our direct policy solution uses an any-time probabilistic active learning algorithm to predict what policies are likely t... |

14 | Optimal estimation and Cramér-Rao bounds for partial non-Gaussian state space models
- Bergman, Doucet, et al.
- 2001
(Show Context)
Citation Context ...case in our setting and hence a potential source of error. An alternative PCRB approximation method that overcomes this shortcoming, in the context of jump Markov linear (JML) models, was proposed by =-=[36]-=-. We try both approximations in our experiments and refer to them as NL-PCRB and JMLPCRB respectively. The AMSE simulation approach of Section II-A using the EKF requires that we perform an expensive ... |

13 | Simulation-based Optimal Sensor Scheduling with Application to Observer Trajectory Planning
- S, Kantas, et al.
- 2007
(Show Context)
Citation Context ...ed to significant achievements in control and robotics [1, 2, 3, 4]. The success of the method does often, however, hinge on our ability to formulate expressions for the gradient of the expected cost =-=[5, 4, 6]-=-. In some important applications in robotics, such as exploration, constraints in the robot motion and control models make it hard, and often impossible, to compute derivatives of the cost function wi... |

13 | Using reinforcement learning to improve exploration trajectories for error minimization
- Kollar, Roy
- 2006
(Show Context)
Citation Context ...ns. For instance, full observability is assumed in [9, 7], known robot location is assumed in [10], myopic planning is adopted in [8], and discretization of the state and/or actions spaces appears in =-=[11, 12, 7]-=-. The method proposed in this paper does not rely on any of these assumptions. Our direct policy solution uses an any-time probabilistic active learning algorithm to predict what policies are likely t... |

13 |
Cadre: Optimal observer trajectory in bearings-only tracking for maneuvering sources
- Tremois, J
- 1999
(Show Context)
Citation Context ... controllers [20] to solve our problem. Moreover, since the action and state spaces are large-dimensional and continuous, one cannot discretize the problem and use closed-loop control as suggested in =-=[21]-=-. That is, the discretized partially observed Markov decision process is too large for stochastic dynamic programming [22]. As a result of these considerations, we adopt the direct policy search metho... |

8 |
Fast parameter optimization of large-scale electromagnetic objects using DIRECT with kriging metamodeling
- Siah, Sasene, et al.
(Show Context)
Citation Context ...at policies are likely to result in higher expected returns. The method effectively balances the goals of exploration and exploitation in policy search. It is motivated by work on experimental design =-=[13, 14, 15]-=-. Simpler variations of our ideas appeared early in the reinforcement literature. In [16], the problem is treated in the framework of exploration/exploitation with bandits. An extension to continuous ... |

4 | On the Cramer-Rao bound for terrain-aided navigation
- Bergman
- 1997
(Show Context)
Citation Context ...on. Although the discussion is focused on robot exploration and planning, our policy search framework extends naturally to other domains. Related problems appear the fields of terrainaided navigation =-=[18, 9]-=- and dynamic sensor nets [19, 6]. II. APPLICATION TO ROBOT EXPLORATION AND PLANNING Although the algorithm proposed in this paper applies to many sequential decision making settings, we will restrict ... |

3 |
Le Cadre, “Planification for terrain-aided navigation
- Paris, P
- 2002
(Show Context)
Citation Context ...ution). Even a toy problem requires enormous computational effort. As a result, it is not surprising that most existing approaches relax the constrains. For instance, full observability is assumed in =-=[9, 7]-=-, known robot location is assumed in [10], myopic planning is adopted in [8], and discretization of the state and/or actions spaces appears in [11, 12, 7]. The method proposed in this paper does not r... |

3 | Trajectory planning for multiple robots in bearing-only target localisation
- Leung, Huang, et al.
- 2005
(Show Context)
Citation Context ...s computational effort. As a result, it is not surprising that most existing approaches relax the constrains. For instance, full observability is assumed in [9, 7], known robot location is assumed in =-=[10]-=-, myopic planning is adopted in [8], and discretization of the state and/or actions spaces appears in [11, 12, 7]. The method proposed in this paper does not rely on any of these assumptions. Our dire... |