#### DMCA

## Policy search for motor primitives in robotics (2009)

### Cached

### Download Links

Venue: | Advances in Neural Information Processing Systems 22 (NIPS 2008 |

Citations: | 116 - 24 self |

### Citations

1193 |
Reinforcement Learning
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...nforcement learning and show how it relates to policy gradient methods [7, 8, 11, 10]. Despite that many real-world motor learning tasks are essentially episodic [14], episodic reinforcement learning =-=[1]-=- is a largely undersubscribed topic. The resulting framework allows us to derive a new algorithm called Policy Learning by Weighting Exploration with the Returns (PoWER) which is particularly well-sui... |

440 | Simple statistical gradient-following algorithms for connectionist reinforcement learning
- Williams
- 1992
(Show Context)
Citation Context ..., 3, 10, 18, 5, 6, 4]. In this paper, we will extend the previous work in [17, 18] from the immediate reward case to episodic reinforcement learning and show how it relates to policy gradient methods =-=[7, 8, 11, 10]-=-. Despite that many real-world motor learning tasks are essentially episodic [14], episodic reinforcement learning [1] is a largely undersubscribed topic. The resulting framework allows us to derive a... |

427 | Policy gradient methods for reinforcement learning with function approximation
- Sutton, McAllester, et al.
- 2000
(Show Context)
Citation Context ...t to show the suitability of our algorithm and show that it outperforms previous methods such as Finite Difference Gradient (FDG) methods [10], ‘Vanilla’ Policy Gradients (VPG) with optimal baselines =-=[7, 8, 11, 10]-=-, the Episodic Natural Actor Critic (eNAC) [9, 10], and the episodic version of the Reward-Weighted Regression (RWR) algorithm [18]. For both tasks, we use the same rewards as in [10] but we use the n... |

256 | Pegasus: A policy search method for large mdps and pomdps
- Ng, Jordan
- 2000
(Show Context)
Citation Context ...roach has previously proven successful as it allows the usage of domainappropriate pre-structured policies, the straightforward integration of a teacher’s presentation as well as fast online learning =-=[2, 3, 10, 18, 5, 6, 4]-=-. In this paper, we will extend the previous work in [17, 18] from the immediate reward case to episodic reinforcement learning and show how it relates to policy gradient methods [7, 8, 11, 10]. Despi... |

191 | Learning attractor landscapes for learning motor primitives - Ijspeert, Nakanishi, et al. - 2003 |

176 | Adaptive probabilistic networks with hidden variables
- Binder, Koller, et al.
- 1997
(Show Context)
Citation Context ...updates which directly result from Section 2.2.1. First, we show that policy gradients [7, 8, 11, 10] can be derived from the lower bound Lθ(θ ′ ) (as was to be expected from supervised learning, see =-=[13]-=-). Subsequently, we show that natural policy gradients can be seen as an additional constraint regularizing the change in the path distribution resulting from a policy update when improving the policy... |

116 | Policy gradient methods for robotics
- Peters, Schaal
- 2006
(Show Context)
Citation Context ...roach has previously proven successful as it allows the usage of domainappropriate pre-structured policies, the straightforward integration of a teacher’s presentation as well as fast online learning =-=[2, 3, 10, 18, 5, 6, 4]-=-. In this paper, we will extend the previous work in [17, 18] from the immediate reward case to episodic reinforcement learning and show how it relates to policy gradient methods [7, 8, 11, 10]. Despi... |

95 |
Optimal Control Theory
- Kirk
- 1970
(Show Context)
Citation Context ...less, in the analytically tractable cases, it has been studied deeply in the optimal control community where it is well-known that for a finite horizon problem, the optimal solution is non-stationary =-=[15]-=- and, in general, cannot be represented by a time-independent policy. The motor primitives based on dynamical systems [22, 23] are a particular type of time-variant policy representation as they have ... |

69 | Policy search by dynamic programming - Bagnell, Kakade, et al. - 2004 |

62 | Covariant policy search
- Bagnell, Schneider
- 2003
(Show Context)
Citation Context ...quivalent to the policy gradient theorem [8] for θ ′ → θ in the infinite horizon case where the dependence on time t can be dropped. The derivation results in the Natural Actor Critic as discussed in =-=[9, 10]-=- when adding an additional punishment to prevent large steps away from the observed path distribution. This can be achieved by restricting the amount of change in the path distribution and, subsequent... |

52 | Planning by probabilistic inference
- Attias
- 2003
(Show Context)
Citation Context ...e a new EM-inspired algorithm called Policy Learning by Weighting Exploration with the Returns (PoWER) in Section 2.3 and show how the general framework is related to policy gradients methods in 2.2. =-=[12]-=- extends the [17] algorithm to episodic reinforcement learning for discrete states; we use continuous states. Subsequently, we discuss how we can turn the parametrized motor primitives [22, 23] into e... |

52 | Using expectation-maximization for reinforcement learning
- Dayan, Hinton
- 1997
(Show Context)
Citation Context ...priate pre-structured policies, the straightforward integration of a teacher’s presentation as well as fast online learning [2, 3, 10, 18, 5, 6, 4]. In this paper, we will extend the previous work in =-=[17, 18]-=- from the immediate reward case to episodic reinforcement learning and show how it relates to policy gradient methods [7, 8, 11, 10]. Despite that many real-world motor learning tasks are essentially ... |

52 | Using local trajectory optimizers to speed up global optimization in dynamic programming
- Atkeson
- 1994
(Show Context)
Citation Context ...the presented algorithm works well when employed in the context of learning dynamic motor primitives in four different settings, i.e., the two benchmark problems from [10], the Underactuated Swing-Up =-=[21]-=- and the complex task of Ball-in-a-Cup [24, 20]. Both the Underactuated Swing-Up as well as the Ball-in-a-Cup are achieved on a real Barrett WAM TM robot arm. Please also refer to the video on the fir... |

51 | Reinforcement learning for imitating constrained reaching movements
- Guenter, Hersch, et al.
- 2007
(Show Context)
Citation Context ...roach has previously proven successful as it allows the usage of domainappropriate pre-structured policies, the straightforward integration of a teacher’s presentation as well as fast online learning =-=[2, 3, 10, 18, 5, 6, 4]-=-. In this paper, we will extend the previous work in [17, 18] from the immediate reward case to episodic reinforcement learning and show how it relates to policy gradient methods [7, 8, 11, 10]. Despi... |

50 |
The EM Algorithm and Extensions
- McLachan, Krishnan
- 1997
(Show Context)
Citation Context ....1 Bounds on Policy Improvements Unlike in reinforcement learning, other machine learning branches have focused on optimizing lower bounds, e.g., resulting in expectation-maximization (EM) algorithms =-=[16]-=-. The reasons for this preference apply in policy learning: if the lower bound also becomes an equality for the sampling policy, 2we can guarantee that the policy will be improved by optimizing the l... |

40 | Dynamics systems vs. optimal control a unifying view
- Schaal, Mohajerian, et al.
- 2007
(Show Context)
Citation Context ...hich is particularly well-suited for learning of trial-based tasks in motor control. We are especially interested in a particular kind of motor control policies also known as dynamic motor primitives =-=[22, 23]-=-. In this approach, dynamical systems are being used in order to encode a policy, i.e., we have a special kind of parametrized policy which is well-suited for robotics problems. We show that the prese... |

37 |
Attention and motor skill learning
- Wulf
- 2007
(Show Context)
Citation Context ... immediate reward case to episodic reinforcement learning and show how it relates to policy gradient methods [7, 8, 11, 10]. Despite that many real-world motor learning tasks are essentially episodic =-=[14]-=-, episodic reinforcement learning [1] is a largely undersubscribed topic. The resulting framework allows us to derive a new algorithm called Policy Learning by Weighting Exploration with the Returns (... |

34 | Reinforcement learning by reward-weighted regression for operational space control
- Peters, Schaal
- 2007
(Show Context)
Citation Context ...on algorithms are well-known to avoid this problem in supervised learning while even yielding faster convergence [16]. Previously, similar ideas have been explored in immediate reinforcement learning =-=[17, 18]-=-. In general, an EMalgorithm would choose the next policy parameters θn+1 such that θn+1 = argmaxθ ′ Lθ(θ ′ ). In the case where π(at|st, t) belongs to the exponential family, the next policy can be d... |

30 | Learning perceptual coupling for motor primitives - Kober, Mohler, et al. - 2008 |

22 | State-Dependent Exploration for policy gradient methods - Rückstieß, Felder, et al. - 2008 |

18 | Efficient gradient estimation for motor control learning
- Lawrence, Cowan, et al.
- 2003
(Show Context)
Citation Context ...t to show the suitability of our algorithm and show that it outperforms previous methods such as Finite Difference Gradient (FDG) methods [10], ‘Vanilla’ Policy Gradients (VPG) with optimal baselines =-=[7, 8, 11, 10]-=-, the Episodic Natural Actor Critic (eNAC) [9, 10], and the episodic version of the Reward-Weighted Regression (RWR) algorithm [18]. For both tasks, we use the same rewards as in [10] but we use the n... |

14 |
Bayesian policy learning with transdimensional MCMC
- Hoffman, Doucet, et al.
- 2007
(Show Context)
Citation Context |

13 |
Probabilistic inference for structured planning in robotics
- Toussaint, Goerick
- 2007
(Show Context)
Citation Context |

11 |
Teaching by showing in kendama based on optimization principle
- Kawato, Gandolfo, et al.
- 1994
(Show Context)
Citation Context ...loyed in the context of learning dynamic motor primitives in four different settings, i.e., the two benchmark problems from [10], the Underactuated Swing-Up [21] and the complex task of Ball-in-a-Cup =-=[24, 20]-=-. Both the Underactuated Swing-Up as well as the Ball-in-a-Cup are achieved on a real Barrett WAM TM robot arm. Please also refer to the video on the first author’s website. Looking at these tasks fro... |