• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Pilco: A model-based and data-efficient approach to policy search. (2011)

by M Deisenroth, C E Rasmussen
Venue:In ICML,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 84
Next 10 →

Reinforcement Learning in Robotics: A Survey

by Jens Kober, J. Andrew Bagnell , Jan Peters
"... Reinforcement learning offers to robotics a framework and set oftoolsfor the design of sophisticated and hard-to-engineer behaviors. Conversely, the challenges of robotic problems provide both inspiration, impact, and validation for developments in reinforcement learning. The relationship between di ..."
Abstract - Cited by 39 (2 self) - Add to MetaCart
Reinforcement learning offers to robotics a framework and set oftoolsfor the design of sophisticated and hard-to-engineer behaviors. Conversely, the challenges of robotic problems provide both inspiration, impact, and validation for developments in reinforcement learning. The relationship between disciplines has sufficient promise to be likened to that between physics and mathematics. In this article, we attempt to strengthen the links between the two research communities by providing a survey of work in reinforcement learning for behavior generation in robots. We highlight both key challenges in robot reinforcement learning as well as notable successes. We discuss how contributions tamed the complexity of the domain and study the role of algorithms, representations, and prior knowledge in achieving these successes. As a result, a particular focus of our paper lies on the choice between modelbased and model-free as well as between value function-based and policy search methods. By analyzing a simple problem in some detail we demonstrate how reinforcement learning approaches may be profitably applied, and

Learning to Control a Low-Cost Manipulator using Data-Efficient Reinforcement Learning

by Marc Peter Deisenroth, Carl Edward Rasmussen, Dieter Fox
"... Abstract—Over the last years, there has been substantial progress in robust manipulation in unstructured environments. The long-term goal of our work is to get away from precise, but very expensive robotic systems and to develop affordable, potentially imprecise, self-adaptive manipulator systems th ..."
Abstract - Cited by 31 (13 self) - Add to MetaCart
Abstract—Over the last years, there has been substantial progress in robust manipulation in unstructured environments. The long-term goal of our work is to get away from precise, but very expensive robotic systems and to develop affordable, potentially imprecise, self-adaptive manipulator systems that can interactively perform tasks such as playing with children. In this paper, we demonstrate how a low-cost off-the-shelf robotic system can learn closed-loop policies for a stacking task in only a handful of trials—from scratch. Our manipulator is inaccurate and provides no pose feedback. For learning a controller in the work space of a Kinect-style depth camera, we use a model-based reinforcement learning technique. Our learning method is data efficient, reduces model bias, and deals with several noise sources in a principled way during long-term planning. We present a way of incorporating state-space constraints into the learning process and analyze the learning gain by exploiting the sequential structure of the stacking task. I.
(Show Context)

Citation Context

... (8), (9), (10), (11). This also involves computing the partial derivatives of ∂µ u/∂ψ and ∂Σu/∂ψ. We omit further lengthy details here, but point out that these derivatives are computed analytically =-=[6, 7]-=-. This allows for standard gradient-based non-convex optimization methods, e.g., CG or L-BFGS, which return an optimized parameter vector ψ ∗ . D. Planning with State-Space Constraints In a classical ...

Hierarchical relative entropy policy search

by Christian Daniel, Gerhard Neumann, Jan Peters - In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS , 2012
"... Many real-world problems are inherently hi-erarchically structured. The use of this struc-ture in an agent’s policy may well be the key to improved scalability and higher per-formance. However, such hierarchical struc-tures cannot be exploited by current policy search algorithms. We will concentrate ..."
Abstract - Cited by 25 (9 self) - Add to MetaCart
Many real-world problems are inherently hi-erarchically structured. The use of this struc-ture in an agent’s policy may well be the key to improved scalability and higher per-formance. However, such hierarchical struc-tures cannot be exploited by current policy search algorithms. We will concentrate on a basic, but highly relevant hierarchy — the ‘mixed option ’ policy. Here, a gating network first decides which of the options to execute and, subsequently, the option-policy deter-mines the action. In this paper, we reformulate learning a hi-erarchical policy as a latent variable estima-tion problem and subsequently extend the Relative Entropy Policy Search (REPS) to the latent variable case. We show that our Hierarchical REPS can learn versatile solu-tions while also showing an increased perfor-mance in terms of learning speed and quality of the found policy in comparison to the non-hierarchical approach. 1
(Show Context)

Citation Context

...the 15th International Conference on Artificial Intelligence and Statistics (AISTATS) 2012, La Palma, Canary Islands. Volume XX of JMLR: W&CP XX. Copyright 2012 by the authors. probabilistic modeling =-=[11]-=-. These policy search methods have been particularly successful in the domain of robot movement generation [12, 13, 14]. However, current methods cannot exploit the hierarchical structure inherent to ...

Data-Efficient Generalization of Robot Skills with Contextual Policy Search

by Andras Gabor Kupcsik, Marc Peter Deisenroth, Jan Peters, Gerhard Neumann
"... In robotics, controllers make the robot solve a task within a specific context. The context can describe the objectives of the robot or physical properties of the environment and is always specified before task execution. To generalize the controller to multiple contexts, we follow a hierarchical ap ..."
Abstract - Cited by 13 (2 self) - Add to MetaCart
In robotics, controllers make the robot solve a task within a specific context. The context can describe the objectives of the robot or physical properties of the environment and is always specified before task execution. To generalize the controller to multiple contexts, we follow a hierarchical approach for policy learning: A lower-level policy controls the robot for a given context and an upper-level policy generalizes among contexts. Current approaches for learning such upper-level policies are based on model-free policy search, which require an excessive number of interactions of the robot with its environment. More data-efficient policy search approaches are model based but, thus far, without the capability of learning hierarchical policies. We propose a new model-based policy search approach that can also learn contextual upper-level policies. Our approach is based on learning probabilistic forward models for long-term predictions. Using these predictions, we use information-theoretic insights to improve the upper-level policy. Our method achieves a substantial improvement in learning speed compared to existing methods on simulated and real robotic tasks.
(Show Context)

Citation Context

...rols ut for the robot. Policy search methods are divided into model-free (Williams 1992; Peters and Schaal 2008; Kober and Peters 2010) and modelbased (Bagnell and Schneider 2001; Abbeel et al. 2006; =-=Deisenroth and Rasmussen 2011-=-) approaches. Model-free policy search methods update the policy by executing roll-outs on the real system and collecting the corresponding rewards. Model-free policy search does not need an elaborate...

Variational policy search via trajectory optimization

by Sergey Levine, Vladlen Koltun , 2013
"... In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and high-dimensional tasks present a serious challenge, particularly when co ..."
Abstract - Cited by 12 (6 self) - Add to MetaCart
In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and high-dimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strat-egy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization al-gorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algo-rithm can outperform prior methods on two challenging locomotion tasks.
(Show Context)

Citation Context

... to apply model-free trajectory optimization techniques [7], which would avoid the need for a model of the system dynamics, or to learn the dynamics from data, for example by using Gaussian processes =-=[2]-=-. It would also be straightforward to use multiple trajectories optimized from different initial states to learn a single policy that is able to succeed under a variety of initial conditions. Overall,...

RTMBA: A Real-Time Model-Based Reinforcement Learning Architecture for Robot Control

by Todd Hester, Michael Quinlan, Peter Stone
"... Abstract—Reinforcement Learning (RL) is a paradigm for learning decision-making tasks that could enable robots to learn and adapt to their situation on-line. For an RL algorithm to be practical for robotic control tasks, it must learn in very few samples, while continually taking actions in real-tim ..."
Abstract - Cited by 11 (3 self) - Add to MetaCart
Abstract—Reinforcement Learning (RL) is a paradigm for learning decision-making tasks that could enable robots to learn and adapt to their situation on-line. For an RL algorithm to be practical for robotic control tasks, it must learn in very few samples, while continually taking actions in real-time. Existing model-based RL methods learn in relatively few samples, but typically take too much time between each action for practical on-line learning. In this paper, we present a novel parallel architecture for model-based RL that runs in real-time by 1) taking advantage of sample-based approximate planning methods and 2) parallelizing the acting, model learning, and planning processes in a novel way such that the acting process is sufficiently fast for typical robot control cycles. We demonstrate thatalgorithmsusingthisarchitectureperformnearlyaswellas methods using the typical sequential architecture when both are given unlimited time, and greatly out-perform these methods on tasks that require real-time actions such as controlling an autonomous vehicle. I.
(Show Context)

Citation Context

...ning off-line, but for on-line robot control, actions must be given at a fixed, fast frequency. Some modelbasedmethodsthatdotakeactionsatthisfastfrequency have been applied to robots in the past [3], =-=[5]-=-, but they perform learning off-line during pauses where they stop controlling the robot entirely. DYNA [6], which does run in real-time, uses a simplistic model and is not very sample efficient. Mode...

Learning neural network policies with guided policy search under unknown dynamics

by Sergey Levine, Pieter Abbeel - In Advances in Neural Information Processing Systems , 2014
"... We present a policy search method that uses iteratively refitted local linear models to optimize trajectory distributions for large, continuous problems. These tra-jectory distributions can be used within the framework of guided policy search to learn policies with an arbitrary parameterization. Our ..."
Abstract - Cited by 7 (3 self) - Add to MetaCart
We present a policy search method that uses iteratively refitted local linear models to optimize trajectory distributions for large, continuous problems. These tra-jectory distributions can be used within the framework of guided policy search to learn policies with an arbitrary parameterization. Our method fits time-varying linear dynamics models to speed up learning, but does not rely on learning a global model, which can be difficult when the dynamics are complex and discontinuous. We show that this hybrid approach requires many fewer samples than model-free methods, and can handle complex, nonsmooth dynamics that can pose a challenge for model-based techniques. We present experiments showing that our method can be used to learn complex neural network policies that successfully execute simulated robotic manipulation tasks in partially observed environments with nu-merous contact discontinuities and underactuation. 1
(Show Context)

Citation Context

...dynamics, which can be very difficult for complex systems, especially when the algorithm imposes restrictions on the dynamics representation to make the policy search efficient and numerically stable =-=[5]-=-. In this paper, we present a hybrid method that fits local, time-varying linear dynamics models, which are not accurate enough for standard model-based policy search. However, we can use these local ...

Learning Complex Neural Network Policies with Trajectory Optimization

by Sergey Levine, Vladlen Koltun
"... Direct policy search methods offer the promise of automatically learning controllers for com-plex, high-dimensional tasks. However, prior ap-plications of policy search often required spe-cialized, low-dimensional policy classes, limit-ing their generality. In this work, we introduce a policy search ..."
Abstract - Cited by 7 (3 self) - Add to MetaCart
Direct policy search methods offer the promise of automatically learning controllers for com-plex, high-dimensional tasks. However, prior ap-plications of policy search often required spe-cialized, low-dimensional policy classes, limit-ing their generality. In this work, we introduce a policy search algorithm that can directly learn high-dimensional, general-purpose policies, rep-resented by neural networks. We formulate the policy search problem as an optimization over trajectory distributions, alternating between opti-mizing the policy to match the trajectories, and optimizing the trajectories to match the policy and minimize expected cost. Our method can learn policies for complex tasks such as bipedal push recovery and walking on uneven terrain, while outperforming prior methods. 1.
(Show Context)

Citation Context

...optimization techniques best suited for policy search is a promising direction for future work. While we assume a known model of the dynamics, prior work has proposed learning the dynamics from data (=-=Deisenroth & Rasmussen, 2011-=-; Ross & Bagnell, 2012; Deisenroth et al., 2013), and using our method with learned models could allow for wider applications in the future. Our method also has several limitations that could be addre...

Deepmpc: Learning deep latent features for model predictive control

by Ian Lenz, Ashutosh Saxena - In RSS , 2015
"... Abstract—Designing controllers for tasks with complex non-linear dynamics is extremely challenging, time-consuming, and in many cases, infeasible. This difficulty is exacerbated in tasks such as robotic food-cutting, in which dynamics might vary both with environmental properties, such as material a ..."
Abstract - Cited by 4 (2 self) - Add to MetaCart
Abstract—Designing controllers for tasks with complex non-linear dynamics is extremely challenging, time-consuming, and in many cases, infeasible. This difficulty is exacerbated in tasks such as robotic food-cutting, in which dynamics might vary both with environmental properties, such as material and tool class, and with time while acting. In this work, we present DeepMPC, an online real-time model-predictive control approach designed to handle such difficult tasks. Rather than hand-design a dynamics model for the task, our approach uses a novel deep architecture and learning algorithm, learning controllers for complex tasks directly from data. We validate our method in experiments on a large-scale dataset of 1488 material cuts for 20 diverse classes, and in 450 real-world robotic experiments, demonstrating significant improvement over several other approaches. I.
(Show Context)

Citation Context

...re data and re-learn a new policy in an online learning approach. Levine and Abbeel [25] use a Gaussian mixture model (GMM) where linear models are fit to each cluster, while Deisenroth and Rasmussen =-=[10]-=- use a Gaussian process (GP.) Experimentally, both these models gave less accurate predictions than ours for robotic food-cutting. The GP also had very long inference times (roughly 106 times longer t...

Sample-Based Information-Theoretic Stochastic Optimal Control

by Rudolf Lioutikov, Ros Paraschos, Jan Peters, Gerhard Neumann
"... proaches rely on samples to either obtain an estimate of the value function or a linearisation of the underlying system model. However, these approaches typically neglect the fact that the accuracy of the policy update depends on the closeness of the resulting trajectory distribution to these sample ..."
Abstract - Cited by 4 (2 self) - Add to MetaCart
proaches rely on samples to either obtain an estimate of the value function or a linearisation of the underlying system model. However, these approaches typically neglect the fact that the accuracy of the policy update depends on the closeness of the resulting trajectory distribution to these samples. The greedy operator does not consider such closeness constraint to the samples. Hence, the greedy operator can lead to oscillations or even instabilities in the policy updates. Such undesired behaviour is likely to result in an inferior performance of the estimated policy. We reuse inspiration from the reinforcement learning community and relax the greedy operator used in SOC with an information theoretic bound that limits the ‘distance ’ of two subsequent trajectory distributions in a policy update. The introduced bound ensures a smooth and stable policy update. Our method is also well suited for model-based reinforcement learning, where we estimate the system dynamics model from data. As this model is likely to be inaccurate, it might be dangerous to exploit the model greedily. Instead, our bound ensures that we generate new data in the vicinity of the current data, such that we can improve our estimate of the system dynamics model. We show that our approach outperforms several state of the art approaches on challenging simulated robot control tasks.
(Show Context)

Citation Context

...r the real robot experiments we used a tendon driven arm called BioRob. His biologically inspired architecture makes controling it a very challenging task. for model-based reinforcement learning [7], =-=[8]-=- where we simultaneously estimate the system model from data. In this case, our bound provides us with the additional benefit that we avoid exploiting the possibly inaccurate model greedily. Instead, ...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University