Results 1  10
of
15
Learning neural network policies with guided policy search under unknown dynamics
 In Advances in Neural Information Processing Systems
, 2014
"... We present a policy search method that uses iteratively refitted local linear models to optimize trajectory distributions for large, continuous problems. These trajectory distributions can be used within the framework of guided policy search to learn policies with an arbitrary parameterization. Our ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
We present a policy search method that uses iteratively refitted local linear models to optimize trajectory distributions for large, continuous problems. These trajectory distributions can be used within the framework of guided policy search to learn policies with an arbitrary parameterization. Our method fits timevarying linear dynamics models to speed up learning, but does not rely on learning a global model, which can be difficult when the dynamics are complex and discontinuous. We show that this hybrid approach requires many fewer samples than modelfree methods, and can handle complex, nonsmooth dynamics that can pose a challenge for modelbased techniques. We present experiments showing that our method can be used to learn complex neural network policies that successfully execute simulated robotic manipulation tasks in partially observed environments with numerous contact discontinuities and underactuation. 1
Learning Complex Neural Network Policies with Trajectory Optimization
"... Direct policy search methods offer the promise of automatically learning controllers for complex, highdimensional tasks. However, prior applications of policy search often required specialized, lowdimensional policy classes, limiting their generality. In this work, we introduce a policy search ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
Direct policy search methods offer the promise of automatically learning controllers for complex, highdimensional tasks. However, prior applications of policy search often required specialized, lowdimensional policy classes, limiting their generality. In this work, we introduce a policy search algorithm that can directly learn highdimensional, generalpurpose policies, represented by neural networks. We formulate the policy search problem as an optimization over trajectory distributions, alternating between optimizing the policy to match the trajectories, and optimizing the trajectories to match the policy and minimize expected cost. Our method can learn policies for complex tasks such as bipedal push recovery and walking on uneven terrain, while outperforming prior methods. 1.
Towards learning hierarchical skills for multiphase manipulation tasks
 in International Conference on Robotics and Automation (ICRA
, 2015
"... Abstract—Most manipulation tasks can be decomposed into a sequence of phases, where the robot’s actions have different effects in each phase. The robot can perform actions to transition between phases and, thus, alter the effects of its actions, e.g. grasp an object in order to then lift it. The rob ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Most manipulation tasks can be decomposed into a sequence of phases, where the robot’s actions have different effects in each phase. The robot can perform actions to transition between phases and, thus, alter the effects of its actions, e.g. grasp an object in order to then lift it. The robot can thus reach a phase that affords the desired manipulation. In this paper, we present an approach for exploiting the phase structure of tasks in order to learn manipulation skills more efficiently. Starting with human demonstrations, the robot learns a probabilistic model of the phases and the phase transitions. The robot then employs modelbased reinforcement learning to create a library of motor primitives for transitioning between phases. The learned motor primitives generalize to new situations and tasks. Given this library, the robot uses a value function approach to learn a highlevel policy for sequencing the motor primitives. The proposed method was successfully evaluated on a real robot performing a bimanual grasping task. I.
Trust region policy optimization
 In ICML
, 2015
"... In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoreticallyjustified scheme, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is effective for o ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoreticallyjustified scheme, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters. 1
A PhysicsBased Model Prior for ObjectOriented MDPs
"... One of the key challenges in using reinforcement learning in robotics is the need for models that capture natural world structure. There are methods that formalize multiobject dynamics using relational representations, but these methods are not sufficiently compact for realworld robotics. We pres ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
One of the key challenges in using reinforcement learning in robotics is the need for models that capture natural world structure. There are methods that formalize multiobject dynamics using relational representations, but these methods are not sufficiently compact for realworld robotics. We present a physicsbased approach that exploits modern simulation tools to efficiently parameterize physical dynamics. Our results show that this representation can result in much faster learning, by virtue of its strong but appropriate inductive bias in physical environments. 1.
C.: A Frictionmodelbased Framework for Reinforcement Learning of Robotic Tasks in Nonrigid Environments
 In: ICRA (2015
"... Abstract—Learning motion tasks in a real environment with deformable objects requires not only a Reinforcement Learning (RL) algorithm, but also a good motion characterization, a preferably compliant robot controller, and an agent giving feedback for the rewards/costs in the RL algorithm. In this pa ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Learning motion tasks in a real environment with deformable objects requires not only a Reinforcement Learning (RL) algorithm, but also a good motion characterization, a preferably compliant robot controller, and an agent giving feedback for the rewards/costs in the RL algorithm. In this paper, we unify all these parts in a simple but effective way to properly learn safetycritical robotic tasks such as wrapping a scarf around the neck (so far, of a mannequin). We found that a suitable compliant controller ought to have a good Inverse Dynamic Model (IDM) of the robot. However, most approaches to build such a model do not consider the possibility of having hystheresis of the friction, which is the case for robots such as the Barrett WAM. For this reason, in order to improve the available IDM, we derived an analytical model of friction in the seven robot joints, whose parameters can be automatically tuned for each particular robot. This permits compliantly tracking diverse trajectories in the whole workspace. By using such frictionaware controller, Dynamic Movement Primitives (DMP) as motion characterization and visual/force feedback within the RL algorithm, experimental results demonstrate that the robot is consistently capable of learning tasks that could not be learned otherwise. I.
Programming by Feedback
"... This paper advocates a new MLbased programming framework, called Programming by Feedback (PF), which involves a sequence of interactions between the active computer and the user. The latter only provides preference judgments on pairs of solutions supplied by the active computer. The active comp ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
This paper advocates a new MLbased programming framework, called Programming by Feedback (PF), which involves a sequence of interactions between the active computer and the user. The latter only provides preference judgments on pairs of solutions supplied by the active computer. The active computer involves two components: the learning component estimates the user’s utility function and accounts for the user’s (possibly limited) competence; the optimization component explores the search space and returns the most appropriate candidate solution. A proof of principle of the approach is proposed, showing that PF requires a handful of interactions in order to solve some discrete and continuous benchmark problems. 1.
Robust Trajectory Optimization: A Cooperative Stochastic Game Theoretic Approach
"... AbstractWe present a novel trajectory optimization framework to address the issue of robustness, scalability and efficiency in optimal control and reinforcement learning. Based on prior work in Cooperative Stochastic Differential Game (CSDG) theory, our method performs local trajectory optimizatio ..."
Abstract
 Add to MetaCart
(Show Context)
AbstractWe present a novel trajectory optimization framework to address the issue of robustness, scalability and efficiency in optimal control and reinforcement learning. Based on prior work in Cooperative Stochastic Differential Game (CSDG) theory, our method performs local trajectory optimization using cooperative controllers. The resulting framework is called Cooperative GameDifferential Dynamic Programming (CGDDP). Compared to related methods, CGDDP exhibits improved performance in terms of robustness and efficiency. The proposed framework is also applied in a datadriven fashion for belief space trajectory optimization under learned dynamics. We present experiments showing that CGDDP can be used for optimal control and reinforcement learning under external disturbances and internal model errors.
Learning Deep Neural Network Policies with Continuous Memory States
"... Abstract Policy learning for partially observed control tasks requires policies that can remember salient information from past observations. In this paper, we present a method for learning policies with internal memory for highdimensional, continuous systems, such as robotic manipulators. Our app ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract Policy learning for partially observed control tasks requires policies that can remember salient information from past observations. In this paper, we present a method for learning policies with internal memory for highdimensional, continuous systems, such as robotic manipulators. Our approach consists of augmenting the state and action space of the system with continuousvalued memory states that the policy can read from and write to. Learning generalpurpose policies with this type of memory representation directly is difficult, because the policy must automatically figure out the most salient information to memorize at each time step. We show that, by decomposing this policy search problem into a trajectory optimization phase and a supervised learning phase through a method called guided policy search, we can acquire policies with effective memorization and recall strategies.
ModelFree Trajectory Optimization for Reinforcement Learning Riad Akrour Hany Abdulsamad
"... Abstract Many of the recent Trajectory Optimization algorithms alternate between local approximation of the dynamics and conservative policy update. However, linearly approximating the dynamics in order to derive the new policy can bias the update and prevent convergence to the optimal policy. In t ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract Many of the recent Trajectory Optimization algorithms alternate between local approximation of the dynamics and conservative policy update. However, linearly approximating the dynamics in order to derive the new policy can bias the update and prevent convergence to the optimal policy. In this article, we propose a new modelfree algorithm that backpropagates a local quadratic timedependent QFunction, allowing the derivation of the policy update in closed form. Our policy update ensures exact KLconstraint satisfaction without simplifying assumptions on the system dynamics demonstrating improved performance in comparison to related Trajectory Optimization algorithms linearizing the dynamics.