Results 11 - 20
of
63
Learning from Human Teachers with Socially Guided Exploration
"... Abstract — We present a learning mechanism, Socially Guided Exploration, in which a robot learns new tasks through a combination of self-exploration and social interaction. The system’s motivational drives (novelty, mastery), along with social scaffolding from a human partner, bias behavior to creat ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
(Show Context)
Abstract — We present a learning mechanism, Socially Guided Exploration, in which a robot learns new tasks through a combination of self-exploration and social interaction. The system’s motivational drives (novelty, mastery), along with social scaffolding from a human partner, bias behavior to create learning opportunities for a Reinforcement Learning mechanism. The system is able to learn on its own, but can flexibly use the guidance of a human partner to improve performance. An experiment with non-expert human subjects shows a human is able to shape the learning process through suggesting actions and drawing attention to goal states. Human guidance results in a task set that is significantly more focused and efficient, while self exploration results in a broader set. I.
Reusing Old Policies to Accelerate Learning on New MDPs
, 1999
"... We consider the reuse of policies for previous MDPs in learning on a new MDP, under the assumption that the vector of parameters of each MDP is drawn from a fixed probability distribution. We use the options framework, in which an option consists of a set of initiation states, a policy, and a te ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
We consider the reuse of policies for previous MDPs in learning on a new MDP, under the assumption that the vector of parameters of each MDP is drawn from a fixed probability distribution. We use the options framework, in which an option consists of a set of initiation states, a policy, and a termination condition. We use an option called a reuse option, for which the set of initiation states is the set of all states, the policy is a combination of policies from the old MDPs, and the termination condition is based on the number of time steps since the option was initiated. Given policies for m of the MDPs from the distribution, we construct reuse options from the policies and compare performance on an m + 1st MDP both with and without various reuse options. We find that reuse options can speed initial learning of the m+ 1st task. We also present a distribution of MDPs for which reuse options can slow initial learning. We discuss reasons for this and suggest other ways to design reuse options.
Decision-Theoretic Control of Planetary Rovers
- Lecture Notes in Computer Science
, 2002
"... Planetary rovers are small unmanned vehicles equipped with cameras and a variety of sensors used for scientific experiments. They must operate under tight constraints over such resources as operation time, power, storage capacity, and communication bandwidth. Moreover, the limited computational r ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
(Show Context)
Planetary rovers are small unmanned vehicles equipped with cameras and a variety of sensors used for scientific experiments. They must operate under tight constraints over such resources as operation time, power, storage capacity, and communication bandwidth. Moreover, the limited computational resources of the rover limit the complexity of on-line planning and scheduling. We describe two decision-theoretic approaches to maximize the productivity of planetary rovers: one based on adaptive planning and the other on hierarchical reinforcement learning. Both approaches map the problem into a Markov decision problem and attempt to solve a large part of the problem off-line, exploiting the structure of the plan and independence between plan components. We examine the advantages and limitations of these techniques and their scalability.
acQuire-macros: An Algorithm for Automatically Learning Macro-actions
- In NIPS'98 Workshop on Abstraction and Hierarchy in Reinforcement Learning
, 1998
"... ion and Hierarchy in Reinforcement Learning 1 acQuire-macros: An Algorithm for Automatically Learning Macro-actions Amy McGovern amy@cs.umass.edu Computer Science Department University of Massachusetts, Amherst Amherst, MA 01003 November 23, 1998 Abstract We present part of a new algorithm for a ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
ion and Hierarchy in Reinforcement Learning 1 acQuire-macros: An Algorithm for Automatically Learning Macro-actions Amy McGovern amy@cs.umass.edu Computer Science Department University of Massachusetts, Amherst Amherst, MA 01003 November 23, 1998 Abstract We present part of a new algorithm for automatically growing macro-actions online in a reinforcement learning framework. We call this algorithm acQuire-macros. We present preliminary empirical results of using acQuire-macros in a simulated dynamical robot task where acQuire-macros enables the robot to discover useful macro-actions during learning. 1 Introduction Much of the current research in reinforcement learning focuses on temporal abstraction, modularity, and hierarchy in learning. Although learning at the level of the most primitive actions allows the agent to discover the optimal policy, learning can be very slow. Temporal abstraction can enable the agent improve performance more rapidly and to use this solution to reduce ...
Learning Multiple Models for Reward Maximization
- In Seventeenth International Conference on Machine Learning
, 2000
"... We present an approach to reward maximization in a non-stationary mobile robot environment. The approach works within the realistic constraints of limited local sensing and limited a priori knowledge of the environment. It is based on the use of augmented Markov models (AMMs), a general modeli ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
We present an approach to reward maximization in a non-stationary mobile robot environment. The approach works within the realistic constraints of limited local sensing and limited a priori knowledge of the environment. It is based on the use of augmented Markov models (AMMs), a general modeling tool we have developed. AMMs are essentially Markov chains having additional statistics associated with states and state transitions. We have developed an algorithm that constructs AMMs on-line and in real-time with little computational and space overhead, making it practical to learn multiple models of the interaction dynamics between a robot and its environment during the execution of a task. For the purposes of reward maximization in a non-stationary environment, these models monitor events at increasing intervals of time and provide statistics used to discard redundant or outdated information while reducing the probability of conforming to noise. We have successfully i...
A time aggregation approach to Markov decision processes
, 2001
"... We propose a time aggregation approach for the solution of infinite horizon average cost Markov decision processes via policy iteration. In this approach, policy update is only carried out when the process visits a subset of the state space. As in state aggregation, this approach leads to a reduce ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
(Show Context)
We propose a time aggregation approach for the solution of infinite horizon average cost Markov decision processes via policy iteration. In this approach, policy update is only carried out when the process visits a subset of the state space. As in state aggregation, this approach leads to a reduced state space, which may lead to a substantial reduction in computational and storage requirements, especially for problems with certain structural properties. However, in contrast to state aggregation, which generally results in an approximate model due to the loss of Markov property, time aggregation suffers no loss of accuracy, because the Markov property is preserved. Single sample path-based estimation algorithms are developed that allow the time aggregation approach to be implemented online for practical systems. Some numerical and simulation examples are presented to illustrate the ideas and potential computational savings.
Bounding the Suboptimality of Reusing Subproblems
, 1998
"... We are interested in the problem of determining a course of action to achieve a desired objective in a nondeterministic environment. Markov decision processes (MDPs) provide a framework for representing this action selection problem, and there are a number of algorithms that learn optimal policies w ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
We are interested in the problem of determining a course of action to achieve a desired objective in a nondeterministic environment. Markov decision processes (MDPs) provide a framework for representing this action selection problem, and there are a number of algorithms that learn optimal policies within this formulation. This framework has also been used to study state space abstraction, problem decomposition, and policy reuse. These techniques sacrifice optimality of their solution for improved learning speed. In this paper we examine the suboptimality of reusing policies that are solutions to subproblems. This is done within a restricted class of MDPs, namely those where non-zero reward is received only upon reaching a goal state. We introduce the definition of a subproblem within this class and provide motivation for how reuse of subproblem solutions can speed up learning. The contribution of this paper is the derivation of a tight bound on the loss in optimality from this reuse. We examine a bound that is based on Bellman error, which applies to all MDPs, but does not provide us with a tight bound. We contribute our own theoretical result that gives an empirically tight bound on this suboptimality. 1
Effective control knowledge transfer through learning skill and representation hierarchies, in
- IJCAI’07: Proceedings of the 20th International Joint Conference on Artificial Intelligence
"... Learning capabilities of computer systems still lag far behind biological systems. One of the reasons can be seen in the inefficient re-use of control knowledge acquired over the lifetime of the artificial learning system. To address this deficiency, this paper presents a learning architecture which ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Learning capabilities of computer systems still lag far behind biological systems. One of the reasons can be seen in the inefficient re-use of control knowledge acquired over the lifetime of the artificial learning system. To address this deficiency, this paper presents a learning architecture which transfers control knowledge in the form of behavioral skills and corresponding representation concepts from one task to subsequent learning tasks. The presented system uses this knowledge to construct a more compact state space representation for learning while assuring bounded optimality of the learned task policy by utilizing a representation hierarchy. Experimental results show that the presented method can significantly outperform learning on a flat state space representation and the MAXQ method for hierarchical reinforcement learning. 1
Event-Learning And Robust Policy Heuristics
, 2001
"... In this paper we introduce a novel form of reinforcement learning called event-learning or E-learning. Events are ordered pairs of consecutive states. We define the corresponding event-value function. Learning rules which are guaranteed to converge to the optimal event-value function are derived. C ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
In this paper we introduce a novel form of reinforcement learning called event-learning or E-learning. Events are ordered pairs of consecutive states. We define the corresponding event-value function. Learning rules which are guaranteed to converge to the optimal event-value function are derived. Combining our method with a known robust control method, the SDS algorithm, we introduce Robust Policy Heuristics (RPH). It is shown that RPH (a fast-adapting non-Markovian policy) is particularly useful for coarse models of the environment and for partially observed systems. Fast adaptation may allow to separate the time scale of learning to control a Markovian process and the time scale of adaptation of a non-Markovian policy. In our E-learning framework the de nition of modules is straightforward. E-learning is well suited for policy switching and planning, whereas RPH alleviates the `curse of dimensionality' problem. Computer simulations of a two-link pendulum with coarse discretization and noisy controller are shown to demonstrate the underlying principle.
Generating hierarchical structure in reinforcement learning from state variables
- Lecture Notes in Artificial Intelligence
, 2000
"... Abstract. This paper presents the CQ algorithm which decomposes and solves a Markov Decision Process (MDP) by automatically generating a hierarchy of smaller MDPs using state variables. The CQ algorithm uses a heuristic which is applicable for problems that can be modelled by a set of state variable ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
(Show Context)
Abstract. This paper presents the CQ algorithm which decomposes and solves a Markov Decision Process (MDP) by automatically generating a hierarchy of smaller MDPs using state variables. The CQ algorithm uses a heuristic which is applicable for problems that can be modelled by a set of state variables that conform to a special ordering, defined in this paper as a “nested Markov ordering”. The benefits of this approach are: (1) the automatic generation of actions and termination conditions at all levels in the hierarchy, and (2) linear scaling with the number of variables under certain conditions. This approach draws heavily on Dietterich's MAXQ value function decomposition and Hauskrecht, Meuleau, Kaelbling, Dean, Boutilier's and others region based decomposition of MDPs. The CQ algorithm is described and its functionality illustrated using a four room example. Different solutions are generated with different numbers of hierarchical levels to solve Dietterich's taxi tasks. 1