Results 1  10
of
304
An application of reinforcement learning to aerobatic helicopter flight
 In Advances in Neural Information Processing Systems 19
, 2007
"... Autonomous helicopter flight is widely regarded to be a highly challenging control problem. This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tailin funnel, and nosein funne ..."
Abstract

Cited by 129 (10 self)
 Add to MetaCart
(Show Context)
Autonomous helicopter flight is widely regarded to be a highly challenging control problem. This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tailin funnel, and nosein funnel. Our experimental results significantly extend the state of the art in autonomous helicopter flight. We used the following approach: First we had a pilot fly the helicopter to help us find a helicopter dynamics model and a reward (cost) function. Then we used a reinforcement learning (optimal control) algorithm to find a controller that is optimized for the resulting model and reward function. More specifically, we used differential dynamic programming (DDP), an extension of the linear quadratic regulator (LQR). 1
Autonomous Helicopter Control using Reinforcement Learning Policy Search Methods
 In International Conference on Robotics and Automation
, 2001
"... Many control problems in the robotics field can be cast as Partially Observed Markovian Decision Problems (POMDPs), an optimal control formalism. Finding optimal solutions to such problems in general, however is known to be intractable. It has often been observed that in practice, simple structured ..."
Abstract

Cited by 118 (1 self)
 Add to MetaCart
(Show Context)
Many control problems in the robotics field can be cast as Partially Observed Markovian Decision Problems (POMDPs), an optimal control formalism. Finding optimal solutions to such problems in general, however is known to be intractable. It has often been observed that in practice, simple structured controllers suffice for good suboptimal control, and recent research in the artificial intelligence community has focused on policy search methods as techniques for finding suboptimal controllers when such structured controllers do exist. Traditional modelbased reinforcement learning algorithms make a certainty equivalence assumption on their learned models and calculate optimal policies for a maximumlikelihood Markovian model. In this work, we consider algorithms that evaluate and synthesize controllers under distributions of Markovian models. Previous work has demonstrated that algorithms that maximize mean reward with respect to model uncertainty leads to safer and more robust controll...
The jackknife—a review.
 Biometrika
, 1974
"... The Light Beyond, By Raymond A. Moody, Jr. with Paul Perry. New York, NY: Bantam Books, 1988, 161 pp., $18.95 In his foreword to this book, Andrew Greeley, a prominent priest and sociologist, introduces his comments with the following statement: "Raymond Moody has achieved a rare feat in th ..."
Abstract

Cited by 104 (0 self)
 Add to MetaCart
(Show Context)
The Light Beyond, By Raymond A. Moody, Jr. with Paul Perry. New York, NY: Bantam Books, 1988, 161 pp., $18.95 In his foreword to this book, Andrew Greeley, a prominent priest and sociologist, introduces his comments with the following statement: "Raymond Moody has achieved a rare feat in the quest for human knowledge; he has created a paradigm." He then refers to Thomas Kuhn, who pointed out in The Structure of Scientific Revolutions that scientific revolutions occur when someone creates a new perspective, a new model, a new approach to reality. Although Greeley acknowledges that Moody did not discover the neardeath experience (NDE), he contends that because Moody put a name to it in his previous bestseller Life After Life (1975), he therefore deserves credit for the new para digm that has evolved. Greeley then refers to The Light Beyond as characterized by Moody's "openness, sensitivity and modesty." This he attributes to Moody's acknowledgement that the NDE does not repre sent proof of life after death; rather, it indicates only the existence and widespread prevalence of the NDE. I must question why Greeley does not comment more on the content of the book, and why Moody felt it was appropriate to be credited with creating a new paradigm. During the last fourteen years since Life
Exploration and apprenticeship learning in reinforcement learning
 In ICML
, 2005
"... We consider reinforcement learning in systems with unknown dynamics. Algorithms such as E3 (Kearns and Singh, 2002) learn nearoptimal policies by using “exploration policies ” to drive the system towards poorly modeled states, so as to encourage exploration. But this makes these algorithms impracti ..."
Abstract

Cited by 102 (3 self)
 Add to MetaCart
(Show Context)
We consider reinforcement learning in systems with unknown dynamics. Algorithms such as E3 (Kearns and Singh, 2002) learn nearoptimal policies by using “exploration policies ” to drive the system towards poorly modeled states, so as to encourage exploration. But this makes these algorithms impractical for many systems; for example, on an autonomous helicopter, overly aggressive exploration may well result in a crash. In this paper, we consider the apprenticeship learning setting in which a teacher demonstration of the task is available. We show that, given the initial demonstration, no explicit exploration is necessary, and we can attain nearoptimal performance (compared to the teacher) simply by repeatedly executing “exploitation policies ” that try to maximize rewards. In finitestate MDPs, our algorithm scales polynomially in the number of states; in continuousstate linear dynamical systems, it scales polynomially in the dimension of the state. These results are proved using a martingale construction over relative losses. 1.
Nearoptimal Regret Bounds for Reinforcement Learning
"... For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s ..."
Abstract

Cited by 98 (11 self)
 Add to MetaCart
(Show Context)
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s ′ there is a policy which moves from s to s ′ in at most D steps (on average). We present a reinforcement learning algorithm with total regret Õ(DS √ AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Ω ( √ DSAT) on the total regret of any learning algorithm. 1
Action Elimination and Stopping Conditions for the MultiArmed Bandit and . . .
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... We incorporate statistical confidence intervals in both the multiarmed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O ) log(1/d) times to find an eoptimal arm with probability of at least 1d. Thi ..."
Abstract

Cited by 82 (5 self)
 Add to MetaCart
We incorporate statistical confidence intervals in both the multiarmed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O ) log(1/d) times to find an eoptimal arm with probability of at least 1d. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Qfunction and eliminating actions that are not optimal (with high probability). We provide a modelbased and a modelfree variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over egreedy Qlearning.
A theoretical analysis of modelbased interval estimation
 Proceedings of the Twentysecond International Conference on Machine Learning (ICML05
, 2005
"... Several algorithms for learning nearoptimal policies in Markov Decision Processes have been analyzed and proven efficient. Empirical results have suggested that Modelbased Interval Estimation (MBIE) learns efficiently in practice, effectively balancing exploration and exploitation. This paper pres ..."
Abstract

Cited by 82 (9 self)
 Add to MetaCart
(Show Context)
Several algorithms for learning nearoptimal policies in Markov Decision Processes have been analyzed and proven efficient. Empirical results have suggested that Modelbased Interval Estimation (MBIE) learns efficiently in practice, effectively balancing exploration and exploitation. This paper presents the first theoretical analysis of MBIE, proving its efficiency even under worstcase conditions. The paper also introduces a new performance metric, average loss, and relates it to its less “online ” cousins from the literature. 1.
Accelerating Reinforcement Learning through Implicit Imitation
 JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2003
"... Imitation can be viewed as a means of enhancing learning in multiagent environments. It augments ..."
Abstract

Cited by 79 (0 self)
 Add to MetaCart
Imitation can be viewed as a means of enhancing learning in multiagent environments. It augments