Results 1  10
of
26
Between MDPs and SemiMDPs: A Framework for Temporal Abstraction in Reinforcement Learning
 Artificial Intelligence
, 1999
"... Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key, longstanding challenges for AI. In this paper we consider how these challenges can be addressed within the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We ..."
Abstract

Cited by 560 (37 self)
 Add to MetaCart
(Show Context)
Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key, longstanding challenges for AI. In this paper we consider how these challenges can be addressed within the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action in this framework to include optionsclosedloop policies for taking action over a period of time. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Overall, we show that options enable temporally abstract knowledge and action to be included in the reinforcement learning framework in a natural and general way. In particular, we show that options may be used interchangeably with primitive actions in planning methods such as dynamic programming and in learning methods such as Qlearning.
Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition
 Journal of Artificial Intelligence Research
, 2000
"... This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. Th ..."
Abstract

Cited by 439 (6 self)
 Add to MetaCart
This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semanticsas a subroutine hierarchyand a declarative semanticsas a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consisten...
A Stochastic Model of HumanMachine Interaction for Learning Dialog Strategies
 IEEE Trans. on Speech and Audio Processing
, 2000
"... ..."
(Show Context)
Computing factored value functions for policies in structured MDPs
 In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
, 1999
"... Many large Markov decision processes (MDPs) can be represented compactly using a structured representation such as a dynamic Bayesian network. Unfortunately, the compact representation does not help standard MDP algorithms, because the value function for the MDP does not retain the structure of the ..."
Abstract

Cited by 101 (8 self)
 Add to MetaCart
Many large Markov decision processes (MDPs) can be represented compactly using a structured representation such as a dynamic Bayesian network. Unfortunately, the compact representation does not help standard MDP algorithms, because the value function for the MDP does not retain the structure of the process description. We argue that in many such MDPs, structure is approximately retained. That is, the value functions are nearly additive: closely approximated by a linear function over factors associated with small subsets of problem features. Based on this idea, we present a convergent, approximate value determination algorithm for structured MDPs. The algorithm maintains an additive value function, alternating dynamic programming steps with steps that project the result back into the restricted space of additive functions. We show that both the dynamic programming and the projection steps can be computed efficiently, despite the fact that the number of states is exponential in the numbe...
Multiple modelbased reinforcement learning
 Neural Computation
, 2002
"... We propose a modular reinforcement learning architecture for nonlinear, nonstationary control tasks, which we call multiple modelbased reinforcement learning (MMRL). The basic idea is to decompose a complex task into multiple domains in space and time based on the predictability of the environme ..."
Abstract

Cited by 82 (5 self)
 Add to MetaCart
(Show Context)
We propose a modular reinforcement learning architecture for nonlinear, nonstationary control tasks, which we call multiple modelbased reinforcement learning (MMRL). The basic idea is to decompose a complex task into multiple domains in space and time based on the predictability of the environmental dynamics. The 1 system is composed of multiple modules, each of which consists of a state prediction model and a reinforcement learning controller. The “responsibility signal,” which is given by the softmax function of the prediction errors, is used to weight the outputs of multiple modules as well as to gate the learning of the prediction models and the reinforcement learning controllers. We formulate MMRL for both discretetime, finite state case and continuoustime, continuous state case. The performance of MMRL was demonstrated for discrete case in a nonstationary hunting task in a grid world and for continuous case in a nonlinear, nonstationary control task of swinging up a pendulum with variable physical parameters. 1
Temporal Abstraction in Reinforcement Learning
, 2000
"... Decision making usually involves choosing among different courses of action over a broad range of time scales. For instance, a person planning a trip to a distant location makes highlevel decisions regarding what means of transportation to use, but also chooses lowlevel actions, such as the moveme ..."
Abstract

Cited by 64 (2 self)
 Add to MetaCart
Decision making usually involves choosing among different courses of action over a broad range of time scales. For instance, a person planning a trip to a distant location makes highlevel decisions regarding what means of transportation to use, but also chooses lowlevel actions, such as the movements for getting into a car. The problem of picking an appropriate time scale for reasoning and learning has been explored in artificial intelligence, control theory and robotics. In this dissertation we develop a framework that allows novel solutions to this problem, in the context of Markov Decision Processes (MDPs) and reinforcement learning. In this dissertation, we present a general framework for prediction, control and learning at multipl...
An Overview of 3D Object Grasp Synthesis Algorithms
, 2011
"... This overview presents computational algorithms for generating 3D object grasps with autonomous multifingered robotic hands. Robotic grasping has been an active research subject for decades, and a great deal of effort has been spent on grasp synthesis algorithms. Existing papers focus on reviewing ..."
Abstract

Cited by 31 (3 self)
 Add to MetaCart
This overview presents computational algorithms for generating 3D object grasps with autonomous multifingered robotic hands. Robotic grasping has been an active research subject for decades, and a great deal of effort has been spent on grasp synthesis algorithms. Existing papers focus on reviewing the mechanics of grasping and the fingerobject contact interactions [7] or robot hand design and their control [1]. Robot grasp synthesis algorithms have been reviewed in [63], but since then an important progress has been made toward applying learning techniques to the grasping problem. This overview focuses on analytical as well as empirical grasp synthesis approaches.
Emotiontriggered Learning in Autonomous Robot Control
 CYBERNETICS AND SYSTEMS
, 2001
"... The fact that emotions are considered to be essential to human reasoning suggests that they might play an important role in autonomous robots as well. In particular, the decision of when to interrupt ongoing behaviour is often associated with emotions in natural systems. The question under exami ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
The fact that emotions are considered to be essential to human reasoning suggests that they might play an important role in autonomous robots as well. In particular, the decision of when to interrupt ongoing behaviour is often associated with emotions in natural systems. The question under examination here is whether this role of emotions can be useful for a robot which adapts to its environment. For this purpose, an emotion model was developed and integrated in a reinforcement learning framework. Robot experiments were done to test an emotiondependent mechanism for the automatic detection of the relevant events of a learning task, against more traditional approaches. Experimental results are presented that conrm that emotions can be useful in this role, specically by improving the efficiency of the learning algorithm.
Generating hierarchical structure in reinforcement learning from state variables
 Lecture Notes in Artificial Intelligence
, 2000
"... Abstract. This paper presents the CQ algorithm which decomposes and solves a Markov Decision Process (MDP) by automatically generating a hierarchy of smaller MDPs using state variables. The CQ algorithm uses a heuristic which is applicable for problems that can be modelled by a set of state variable ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
Abstract. This paper presents the CQ algorithm which decomposes and solves a Markov Decision Process (MDP) by automatically generating a hierarchy of smaller MDPs using state variables. The CQ algorithm uses a heuristic which is applicable for problems that can be modelled by a set of state variables that conform to a special ordering, defined in this paper as a “nested Markov ordering”. The benefits of this approach are: (1) the automatic generation of actions and termination conditions at all levels in the hierarchy, and (2) linear scaling with the number of variables under certain conditions. This approach draws heavily on Dietterich's MAXQ value function decomposition and Hauskrecht, Meuleau, Kaelbling, Dean, Boutilier's and others region based decomposition of MDPs. The CQ algorithm is described and its functionality illustrated using a four room example. Different solutions are generated with different numbers of hierarchical levels to solve Dietterich's taxi tasks. 1
Sequential Decision Making Based on Direct Search
, 2001
"... Credit Assignment Hierarchical learning of macros and reusable subprograms is of interest but limited. Often there are nonhierarchical (nevertheless exploitable) regularities in solution space. For instance, suppose we can obtain solution B by replacing every action "turn(right)" in solut ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Credit Assignment Hierarchical learning of macros and reusable subprograms is of interest but limited. Often there are nonhierarchical (nevertheless exploitable) regularities in solution space. For instance, suppose we can obtain solution B by replacing every action "turn(right)" in solution A by "turn(left)." B will then be regular in the sense that it conveys little additional conditional algorithmic information, given A (Solomono, 1964; Kolmogorov, 1965; Chaitin, 1969; Li & Vitanyi, 1993), that is, there is a short algorithm computing B from A. Hence B should not be hard to learn by a smart RL system that already found A. While DPRL cannot exploit such regularities in any obvious manner, DS in general algorithm spaces does not encounter any fundamental problems in this context. For instance, all that is necessary to nd B may be a modication of the parameter \right" of a single instruction \turn(right)" in a repetitive loop computing A (Schmidhuber et al., 1997b). 2.5 DS Advant...