#### DMCA

## Learning Complex Neural Network Policies with Trajectory Optimization

Citations: | 7 - 3 self |

### Citations

7459 | Convex Optimization - Boyd, Vandenberghe - 2004 |

195 | Learning attractor landscapes for learning motor primitives,”
- Ijspeert, Nakanishi, et al.
- 2002
(Show Context)
Citation Context ...hat a good policy can be found without falling into poor local optima. Research into new, specialized policy classes is an active area that has provided substantial improvements on realworld systems (=-=Ijspeert et al., 2003-=-; Paraschos et al., 2013). This specialization is necessary because most model-free policy search methods can only feasibly be applied to policies with a few hundred parameters (Deisenroth et al., Pro... |

125 |
A generalized iterative LQG method for locallyoptimal feedback control of constrained nonlinear stochastic systems.
- Todorov, Li
- 2005
(Show Context)
Citation Context ...e DKL(q(τ)‖ρ(τ)), where ρ(τ) ∝ exp(`(τ)), making q(τ) an I-projection of ρ(τ). In the absence of policy constraints, a Gaussian q(τ) can be approximately optimized by a variant of the iLQG algorithm (=-=Todorov & Li, 2005-=-), as described in previous work (Levine & Koltun, 2013b). In the next section, we derive a similar algorithm that also gradually enforces a constraint on the action conditionals q(ut|xt), to force th... |

123 | Reinforcement learning of motor skills with policy gradients. - Ghavamzadeh, Peters, et al. - 2008 |

84 | Pilco: A model-based and data-efficient approach to policy search.
- Deisenroth, Rasmussen
- 2011
(Show Context)
Citation Context ...optimization techniques best suited for policy search is a promising direction for future work. While we assume a known model of the dynamics, prior work has proposed learning the dynamics from data (=-=Deisenroth & Rasmussen, 2011-=-; Ross & Bagnell, 2012; Deisenroth et al., 2013), and using our method with learned models could allow for wider applications in the future. Our method also has several limitations that could be addre... |

67 | A reduction of imitation learning and structured prediction to no-regret online learning.
- Ross, Gordon, et al.
- 2011
(Show Context)
Citation Context ...olicy on individual trajectories usually fails to produce effective policies, since a small error at each time step can quickly compound and place the policy in costly, unexplored parts of the space (=-=Ross et al., 2011-=-). To avoid compounding errors, the policy must be trained on data sampled from a distribution over states. The ideal distribution is the one induced by the optimal policy, but it is unknown. The init... |

57 | Reinforcement learning of motor skills in high dimensions: A path integral approach.
- Theodorou, Buchli, et al.
- 2010
(Show Context)
Citation Context ...ion Direct policy search offers the promise of automatically learning controllers for complex, high-dimensional tasks. It has seen applications in fields ranging from robotics (Peters & Schaal, 2008; =-=Theodorou et al., 2010-=-; Deisenroth et al., 2013; Kober et al., 2013) and autonomous flight (Ross et al., 2013) to energy generation (Kolter et al., 2012). However, existing policy search methods usually require the policy ... |

57 | Simbicon: Simple biped locomotion control. - Yin, Loken, et al. - 2007 |

50 | Learning motor primitives for robotics,”
- Kober, Peters
- 2009
(Show Context)
Citation Context ...y into agreement. We also compared to cost-weighted regression, which fits the policy to previous on-policy samples weighted by the exponential of their reward (negative cost) (Peters & Schaal, 2007; =-=Kober & Peters, 2009-=-). This approach is representative of a broad class of reinforcement learning methods, which use model-free random exploration and optimize the policy to increase the probability of low cost samples (... |

47 | Optimal control as a graphical model inference problem. - Kappen, Gómez, et al. - 2012 |

39 | Reinforcement learning in robotics: A survey. - Kober, Bagnell, et al. - 2013 |

27 | Mujoco: A physics engine for model-based control.
- Todorov, Erez, et al.
- 2012
(Show Context)
Citation Context ... unstable are off the scale, and are clamped to the maximum cost. Frequent vertical oscillations indicate a policy that oscillates between stable and unstable solutions. the MuJoCo physics simulator (=-=Todorov et al., 2012-=-). The policies were general-purpose neural networks that mapped joint angles directly to torques at each time step. The cost function for each task consisted of a sum of three terms: `(xt,ut) = wu‖ut... |

22 | On stochastic optimal control and reinforcement learning by approximate inference.
- Rawlik, Toussaint, et al.
- 2012
(Show Context)
Citation Context ...nd computational benefits (Todorov, 2006). The related area of stochastic optimal control has developed model-free reinforcement learning algorithms under a similar objective (Theodorou et al., 2010; =-=Rawlik et al., 2012-=-). Concurrently with our work, Mordatch and Todorov (2014) proposed another trajectory optimization algorithm for guiding policy search. Further research into trajectory optimization techniques best s... |

21 | Probabilistic movement primitives
- Paraschos, Daniel, et al.
- 2013
(Show Context)
Citation Context ...e found without falling into poor local optima. Research into new, specialized policy classes is an active area that has provided substantial improvements on realworld systems (Ijspeert et al., 2003; =-=Paraschos et al., 2013-=-). This specialization is necessary because most model-free policy search methods can only feasibly be applied to policies with a few hundred parameters (Deisenroth et al., Proceedings of the 31 st In... |

19 | Learning monocular reactive uav control in cluttered natural environments
- Ross, Melik-Barkhudarov, et al.
- 2013
(Show Context)
Citation Context ...x, high-dimensional tasks. It has seen applications in fields ranging from robotics (Peters & Schaal, 2008; Theodorou et al., 2010; Deisenroth et al., 2013; Kober et al., 2013) and autonomous flight (=-=Ross et al., 2013-=-) to energy generation (Kolter et al., 2012). However, existing policy search methods usually require the policy class to be chosen carefully, so that a good policy can be found without falling into p... |

15 |
A survey on policy search for robotics. Foundations and Trends in Robotics,
- Deisenroth, Neumann, et al.
- 2013
(Show Context)
Citation Context ... offers the promise of automatically learning controllers for complex, high-dimensional tasks. It has seen applications in fields ranging from robotics (Peters & Schaal, 2008; Theodorou et al., 2010; =-=Deisenroth et al., 2013-=-; Kober et al., 2013) and autonomous flight (Ross et al., 2013) to energy generation (Kolter et al., 2012). However, existing policy search methods usually require the policy class to be chosen carefu... |

12 | Variational policy search via trajectory optimization. - Levine, Koltun - 2013 |

12 | Agnostic system identification for model-based reinforcement learning.
- Ross, Bagnell
- 2012
(Show Context)
Citation Context ...uited for policy search is a promising direction for future work. While we assume a known model of the dynamics, prior work has proposed learning the dynamics from data (Deisenroth & Rasmussen, 2011; =-=Ross & Bagnell, 2012-=-; Deisenroth et al., 2013), and using our method with learned models could allow for wider applications in the future. Our method also has several limitations that could be addressed in future work. O... |

11 | Guided policy search. - Levine, Koltun - 2013 |

5 | Design, Analysis, and Learning Control of a Fully Actuated Micro Wind Turbine." Under review
- Kolter, Jackowski, et al.
- 2012
(Show Context)
Citation Context ...plications in fields ranging from robotics (Peters & Schaal, 2008; Theodorou et al., 2010; Deisenroth et al., 2013; Kober et al., 2013) and autonomous flight (Ross et al., 2013) to energy generation (=-=Kolter et al., 2012-=-). However, existing policy search methods usually require the policy class to be chosen carefully, so that a good policy can be found without falling into poor local optima. Research into new, specia... |

2 | Exploring deep and recurrent architectures for optimal control
- Levine
- 2013
(Show Context)
Citation Context ...s, and tested on four other random terrains. The policies used 50 hidden units, for a total of 1206 parameters. We omitted the foot contact and forward kinematics features that we used in prior work (=-=Levine, 2013-=-), and instead trained policies that map the state vector directly to joint torques. This significantly increased the difficulty of the problem, to the point that prior methods could not discover a re... |

2 | Combining the benefits of function approximation and trajectory optimization. - Mordatch, Todorov - 2014 |