• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Between MDPs and semi-MDPs: Learning, planning, and representing knowledge at multiple temporal scales (1998)

by R S Sutton, D Precup, S Singh
Add To MetaCart

Tools

Sorted by:
Results 11 - 20 of 63
Next 10 →

Learning from Human Teachers with Socially Guided Exploration

by Cynthia Breazeal
"... Abstract — We present a learning mechanism, Socially Guided Exploration, in which a robot learns new tasks through a combination of self-exploration and social interaction. The system’s motivational drives (novelty, mastery), along with social scaffolding from a human partner, bias behavior to creat ..."
Abstract - Cited by 18 (1 self) - Add to MetaCart
Abstract — We present a learning mechanism, Socially Guided Exploration, in which a robot learns new tasks through a combination of self-exploration and social interaction. The system’s motivational drives (novelty, mastery), along with social scaffolding from a human partner, bias behavior to create learning opportunities for a Reinforcement Learning mechanism. The system is able to learn on its own, but can flexibly use the guidance of a human partner to improve performance. An experiment with non-expert human subjects shows a human is able to shape the learning process through suggesting actions and drawing attention to goal states. Human guidance results in a task set that is significantly more focused and efficient, while self exploration results in a broader set. I.
(Show Context)

Citation Context

...ask attempts made. B. Task and Goal Representation Tasks and their goals are represented with Task Option Policies. This name reflects its similarity to the Options approach in Reinforcement Learning =-=[21]-=-. Each T ∈ T asks is a Task Option Policy, and is defined by a variation of the three Options constructs: I, π, β. To define these we use two subsets of states related to the task. Let Stask ⊂ S be th...

Reusing Old Policies to Accelerate Learning on New MDPs

by Daniel S. Bernstein , 1999
"... We consider the reuse of policies for previous MDPs in learning on a new MDP, under the assumption that the vector of parameters of each MDP is drawn from a fixed probability distribution. We use the options framework, in which an option consists of a set of initiation states, a policy, and a te ..."
Abstract - Cited by 18 (0 self) - Add to MetaCart
We consider the reuse of policies for previous MDPs in learning on a new MDP, under the assumption that the vector of parameters of each MDP is drawn from a fixed probability distribution. We use the options framework, in which an option consists of a set of initiation states, a policy, and a termination condition. We use an option called a reuse option, for which the set of initiation states is the set of all states, the policy is a combination of policies from the old MDPs, and the termination condition is based on the number of time steps since the option was initiated. Given policies for m of the MDPs from the distribution, we construct reuse options from the policies and compare performance on an m + 1st MDP both with and without various reuse options. We find that reuse options can speed initial learning of the m+ 1st task. We also present a distribution of MDPs for which reuse options can slow initial learning. We discuss reasons for this and suggest other ways to design reuse options.

Decision-Theoretic Control of Planetary Rovers

by Shlomo Zilberstein, Richard Washington, Daniel S. Bernstein, Abdel-Illah Mouaddib - Lecture Notes in Computer Science , 2002
"... Planetary rovers are small unmanned vehicles equipped with cameras and a variety of sensors used for scientific experiments. They must operate under tight constraints over such resources as operation time, power, storage capacity, and communication bandwidth. Moreover, the limited computational r ..."
Abstract - Cited by 17 (3 self) - Add to MetaCart
Planetary rovers are small unmanned vehicles equipped with cameras and a variety of sensors used for scientific experiments. They must operate under tight constraints over such resources as operation time, power, storage capacity, and communication bandwidth. Moreover, the limited computational resources of the rover limit the complexity of on-line planning and scheduling. We describe two decision-theoretic approaches to maximize the productivity of planetary rovers: one based on adaptive planning and the other on hierarchical reinforcement learning. Both approaches map the problem into a Markov decision problem and attempt to solve a large part of the problem off-line, exploiting the structure of the plan and independence between plan components. We examine the advantages and limitations of these techniques and their scalability.
(Show Context)

Citation Context

...he bottleneck states. This can be beneficial for problems where only a simulator or actual experience are available. The algorithm fits into the category of hierarchical reinforcement learning (e.g., =-=[32]-=-) because it learns simultaneously at the state level and at the subprocess level. We note that other researchers have proposed methods for solving weakly-coupled MDPs [16, 20, 26], but very little wo...

acQuire-macros: An Algorithm for Automatically Learning Macro-actions

by Amy Mcgovern - In NIPS'98 Workshop on Abstraction and Hierarchy in Reinforcement Learning , 1998
"... ion and Hierarchy in Reinforcement Learning 1 acQuire-macros: An Algorithm for Automatically Learning Macro-actions Amy McGovern amy@cs.umass.edu Computer Science Department University of Massachusetts, Amherst Amherst, MA 01003 November 23, 1998 Abstract We present part of a new algorithm for a ..."
Abstract - Cited by 15 (1 self) - Add to MetaCart
ion and Hierarchy in Reinforcement Learning 1 acQuire-macros: An Algorithm for Automatically Learning Macro-actions Amy McGovern amy@cs.umass.edu Computer Science Department University of Massachusetts, Amherst Amherst, MA 01003 November 23, 1998 Abstract We present part of a new algorithm for automatically growing macro-actions online in a reinforcement learning framework. We call this algorithm acQuire-macros. We present preliminary empirical results of using acQuire-macros in a simulated dynamical robot task where acQuire-macros enables the robot to discover useful macro-actions during learning. 1 Introduction Much of the current research in reinforcement learning focuses on temporal abstraction, modularity, and hierarchy in learning. Although learning at the level of the most primitive actions allows the agent to discover the optimal policy, learning can be very slow. Temporal abstraction can enable the agent improve performance more rapidly and to use this solution to reduce ...

Learning Multiple Models for Reward Maximization

by Dani Goldberg, Maja Mataric - In Seventeenth International Conference on Machine Learning , 2000
"... We present an approach to reward maximization in a non-stationary mobile robot environment. The approach works within the realistic constraints of limited local sensing and limited a priori knowledge of the environment. It is based on the use of augmented Markov models (AMMs), a general modeli ..."
Abstract - Cited by 13 (5 self) - Add to MetaCart
We present an approach to reward maximization in a non-stationary mobile robot environment. The approach works within the realistic constraints of limited local sensing and limited a priori knowledge of the environment. It is based on the use of augmented Markov models (AMMs), a general modeling tool we have developed. AMMs are essentially Markov chains having additional statistics associated with states and state transitions. We have developed an algorithm that constructs AMMs on-line and in real-time with little computational and space overhead, making it practical to learn multiple models of the interaction dynamics between a robot and its environment during the execution of a task. For the purposes of reward maximization in a non-stationary environment, these models monitor events at increasing intervals of time and provide statistics used to discard redundant or outdated information while reducing the probability of conforming to noise. We have successfully i...

A time aggregation approach to Markov decision processes

by Xi-ren Cao, Zhiyuan Ren, Shalabh Bhatnagar, Michael Fu, Steven Marcus , 2001
"... We propose a time aggregation approach for the solution of infinite horizon average cost Markov decision processes via policy iteration. In this approach, policy update is only carried out when the process visits a subset of the state space. As in state aggregation, this approach leads to a reduce ..."
Abstract - Cited by 13 (4 self) - Add to MetaCart
We propose a time aggregation approach for the solution of infinite horizon average cost Markov decision processes via policy iteration. In this approach, policy update is only carried out when the process visits a subset of the state space. As in state aggregation, this approach leads to a reduced state space, which may lead to a substantial reduction in computational and storage requirements, especially for problems with certain structural properties. However, in contrast to state aggregation, which generally results in an approximate model due to the loss of Markov property, time aggregation suffers no loss of accuracy, because the Markov property is preserved. Single sample path-based estimation algorithms are developed that allow the time aggregation approach to be implemented online for practical systems. Some numerical and simulation examples are presented to illustrate the ideas and potential computational savings.
(Show Context)

Citation Context

...pace. Various approximation approaches have been proposed to attack this problem, including approximate policy iteration, aggregation, supervisory control and randomization [1], [3], [8], [11], [13], =-=[14]-=-. Approximate policy iteration can be carried out in a number of ways that involve suitable approximation techniques in the evaluation and improvement steps of policy iteration. More recently, this ap...

Bounding the Suboptimality of Reusing Subproblems

by Michael Bowling, Manuela Veloso , 1998
"... We are interested in the problem of determining a course of action to achieve a desired objective in a nondeterministic environment. Markov decision processes (MDPs) provide a framework for representing this action selection problem, and there are a number of algorithms that learn optimal policies w ..."
Abstract - Cited by 13 (5 self) - Add to MetaCart
We are interested in the problem of determining a course of action to achieve a desired objective in a nondeterministic environment. Markov decision processes (MDPs) provide a framework for representing this action selection problem, and there are a number of algorithms that learn optimal policies within this formulation. This framework has also been used to study state space abstraction, problem decomposition, and policy reuse. These techniques sacrifice optimality of their solution for improved learning speed. In this paper we examine the suboptimality of reusing policies that are solutions to subproblems. This is done within a restricted class of MDPs, namely those where non-zero reward is received only upon reaching a goal state. We introduce the definition of a subproblem within this class and provide motivation for how reuse of subproblem solutions can speed up learning. The contribution of this paper is the derivation of a tight bound on the loss in optimality from this reuse. We examine a bound that is based on Bellman error, which applies to all MDPs, but does not provide us with a tight bound. We contribute our own theoretical result that gives an empirically tight bound on this suboptimality. 1

Effective control knowledge transfer through learning skill and representation hierarchies, in

by Mehran Asadi, Manfred Huber - IJCAI’07: Proceedings of the 20th International Joint Conference on Artificial Intelligence
"... Learning capabilities of computer systems still lag far behind biological systems. One of the reasons can be seen in the inefficient re-use of control knowledge acquired over the lifetime of the artificial learning system. To address this deficiency, this paper presents a learning architecture which ..."
Abstract - Cited by 11 (0 self) - Add to MetaCart
Learning capabilities of computer systems still lag far behind biological systems. One of the reasons can be seen in the inefficient re-use of control knowledge acquired over the lifetime of the artificial learning system. To address this deficiency, this paper presents a learning architecture which transfers control knowledge in the form of behavioral skills and corresponding representation concepts from one task to subsequent learning tasks. The presented system uses this knowledge to construct a more compact state space representation for learning while assuring bounded optimality of the learned task policy by utilizing a representation hierarchy. Experimental results show that the presented method can significantly outperform learning on a flat state space representation and the MAXQ method for hierarchical reinforcement learning. 1

Event-Learning And Robust Policy Heuristics

by András Lörincz, Imre Pólik, István Szita , 2001
"... In this paper we introduce a novel form of reinforcement learning called event-learning or E-learning. Events are ordered pairs of consecutive states. We define the corresponding event-value function. Learning rules which are guaranteed to converge to the optimal event-value function are derived. C ..."
Abstract - Cited by 11 (5 self) - Add to MetaCart
In this paper we introduce a novel form of reinforcement learning called event-learning or E-learning. Events are ordered pairs of consecutive states. We define the corresponding event-value function. Learning rules which are guaranteed to converge to the optimal event-value function are derived. Combining our method with a known robust control method, the SDS algorithm, we introduce Robust Policy Heuristics (RPH). It is shown that RPH (a fast-adapting non-Markovian policy) is particularly useful for coarse models of the environment and for partially observed systems. Fast adaptation may allow to separate the time scale of learning to control a Markovian process and the time scale of adaptation of a non-Markovian policy. In our E-learning framework the de nition of modules is straightforward. E-learning is well suited for policy switching and planning, whereas RPH alleviates the `curse of dimensionality' problem. Computer simulations of a two-link pendulum with coarse discretization and noisy controller are shown to demonstrate the underlying principle.

Generating hierarchical structure in reinforcement learning from state variables

by Bernhard Hengst - Lecture Notes in Artificial Intelligence , 2000
"... Abstract. This paper presents the CQ algorithm which decomposes and solves a Markov Decision Process (MDP) by automatically generating a hierarchy of smaller MDPs using state variables. The CQ algorithm uses a heuristic which is applicable for problems that can be modelled by a set of state variable ..."
Abstract - Cited by 9 (3 self) - Add to MetaCart
Abstract. This paper presents the CQ algorithm which decomposes and solves a Markov Decision Process (MDP) by automatically generating a hierarchy of smaller MDPs using state variables. The CQ algorithm uses a heuristic which is applicable for problems that can be modelled by a set of state variables that conform to a special ordering, defined in this paper as a “nested Markov ordering”. The benefits of this approach are: (1) the automatic generation of actions and termination conditions at all levels in the hierarchy, and (2) linear scaling with the number of variables under certain conditions. This approach draws heavily on Dietterich's MAXQ value function decomposition and Hauskrecht, Meuleau, Kaelbling, Dean, Boutilier's and others region based decomposition of MDPs. The CQ algorithm is described and its functionality illustrated using a four room example. Different solutions are generated with different numbers of hierarchical levels to solve Dietterich's taxi tasks. 1
(Show Context)

Citation Context

...ould save itself the effort of relearning each sub-task for every situation in which it was required and help scale up RL problems. An example will help to clarify the issue. Sutton, Precup and Singh =-=[10]-=- discuss a traveller journeying to a distant city who needs to decide whether to fly, drive or take a taxi. Each of these possible actions is a sub-task that requires still smaller steps for its execu...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University