Results 11  20
of
438
A tutorial on Bayesian optimization of expensive cost functions, withapplicationtoactiveusermodeling andhierarchical reinforcement learning
, 2009
"... We present a tutorial on Bayesian optimization, a method of finding the maximum of expensive cost functions. Bayesian optimization employs the Bayesian technique of setting a prior over the objective function and combining it with evidence to get a posterior function. This permits a utilitybased se ..."
Abstract

Cited by 83 (11 self)
 Add to MetaCart
We present a tutorial on Bayesian optimization, a method of finding the maximum of expensive cost functions. Bayesian optimization employs the Bayesian technique of setting a prior over the objective function and combining it with evidence to get a posterior function. This permits a utilitybased selection of the next observation to make on the objective function, which must take into account both exploration (sampling from areas of high uncertainty) and exploitation (sampling areas likely to offer improvement over the current best observation). We also present two detailed extensions of Bayesian optimization, with experiments—active user modelling with preferences, and hierarchical reinforcement learning— and a discussion of the pros and cons of Bayesian optimization based on our experiences. 1
Using relative novelty to identify useful temporal abstractions in reinforcement learning
 In Proceedings of the TwentyFirst International Conference on Machine Learning
, 2004
"... We present a new method for automatically creating useful temporal abstractions in reinforcement learning. We argue that states that allow the agent to transition to a different region of the state space are useful subgoals, and propose a method for identifying them using the concept of relative nov ..."
Abstract

Cited by 79 (12 self)
 Add to MetaCart
We present a new method for automatically creating useful temporal abstractions in reinforcement learning. We argue that states that allow the agent to transition to a different region of the state space are useful subgoals, and propose a method for identifying them using the concept of relative novelty. When such a state is identified, a temporallyextended activity (e.g., an option) is generated that takes the agent efficiently to this state. We illustrate the utility of the method in a number of tasks. 1.
Identifying useful subgoals in reinforcement learning by local graph partitioning
 In Proceedings of the TwentySecond International Conference on Machine Learning
, 2005
"... We present a new subgoalbased method for automatically creating useful skills in reinforcement learning. Our method identifies subgoals by partitioning local state transition graphs—those that are constructed using only the most recent experiences of the agent. The local scope of our subgoal discov ..."
Abstract

Cited by 69 (11 self)
 Add to MetaCart
(Show Context)
We present a new subgoalbased method for automatically creating useful skills in reinforcement learning. Our method identifies subgoals by partitioning local state transition graphs—those that are constructed using only the most recent experiences of the agent. The local scope of our subgoal discovery method allows it to successfully identify the type of subgoals we seek—states that lie between two denselyconnected regions of the state space—while producing an algorithm with low computational cost.
SoarRL: Integrating Reinforcement Learning with Soar
 Cognitive Systems
, 2005
"... In this paper, we describe an architectural modification to Soar that gives a Soar agent the opportunity to learn statistical information about the past success of its actions and utilize this information when selecting an operator. This mechanism serves the same purpose as production utilities in A ..."
Abstract

Cited by 66 (13 self)
 Add to MetaCart
(Show Context)
In this paper, we describe an architectural modification to Soar that gives a Soar agent the opportunity to learn statistical information about the past success of its actions and utilize this information when selecting an operator. This mechanism serves the same purpose as production utilities in ACTR, but the implementation is more directly tied to the standard definition of the reinforcement learning (RL) problem. The paper explains our implementation, gives a rationale for adding an RL capability to Soar, and shows results for SoarRL agents ’ performance on two tasks.
Between MDPs and semiMDPs: Learning, planning, and representing knowledge at multiple temporal scales
 Journal of Artificial Intelligence Research
, 1998
"... Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key challenges for AI. In this paper we develop an approach to these problems based on the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion o ..."
Abstract

Cited by 63 (7 self)
 Add to MetaCart
Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key challenges for AI. In this paper we develop an approach to these problems based on the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action to include options—whole courses of behavior that may be temporally extended, stochastic, and contingent on events. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Options may be given a priori, learned by experience, or both. They may be used interchangeably with actions in a variety of planning and learning methods. The theory of semiMarkov decision processes (SMDPs) can be applied to model the consequences of options and as a basis for planning and learning methods using them. In this paper we develop these connections, building on prior work by Bradtke and Duff (1995), Parr (in prep.) and others. Our main novel results concern the interface between the MDP and SMDP levels of analysis. We show how a set of options can be altered by changing only their termination conditions
Theoretical results on reinforcement learning with temporally abstract options
 in: Proc. 10th European Conference on Machine Learning
, 1998
"... Abstract. We present new theoretical results on planning within the framework of temporally abstract reinforcement learning (Precup & Sutton, 1997; Sutton, 1995). Temporal abstraction is a key step in any decision making system that involves planning and prediction. In temporally abstract reinfo ..."
Abstract

Cited by 61 (10 self)
 Add to MetaCart
(Show Context)
Abstract. We present new theoretical results on planning within the framework of temporally abstract reinforcement learning (Precup & Sutton, 1997; Sutton, 1995). Temporal abstraction is a key step in any decision making system that involves planning and prediction. In temporally abstract reinforcement learning, the agent is allowed to choose among ”options”, whole courses of action that may be temporally extended, stochastic, and contingent on previous events. Examples of options include closedloop policies such as picking up an object, as well as primitive actions such as joint torques. Knowledge about the consequences of options is represented by special structures called multitime models. In this paper we focus on the theory of planning with multitime models. We define new Bellman equations that are satisfied for sets of multitime models. As a consequence, multitime models can be used interchangeably with models of primitive actions in a variety of wellknown planning methods including value iteration, policy improvement and policy iteration. 1
Efficient structure learning in factoredstate MDPs
, 2007
"... We consider the problem of reinforcement learning in factoredstate MDPs in the setting in which learning is conducted in one long trial with no resets allowed. We show how to extend existing efficient algorithms that learn the conditional probability tables of dynamic Bayesian networks (DBNs) given ..."
Abstract

Cited by 60 (9 self)
 Add to MetaCart
We consider the problem of reinforcement learning in factoredstate MDPs in the setting in which learning is conducted in one long trial with no resets allowed. We show how to extend existing efficient algorithms that learn the conditional probability tables of dynamic Bayesian networks (DBNs) given their structure to the case in which DBN structure is not known in advance. Our method learns the DBN structures as part of the reinforcementlearning process and provably provides an efficient learning algorithm when combined with factored Rmax.
Offpolicy temporaldifference learning with function approximation
 Proceedings of the 18th International Conference on Machine Learning
, 2001
"... We introduce the first algorithm for offpolicy temporaldifference learning that is stable with linear function approximation. Offpolicy learning is of interest because it forms the basis for popular reinforcement learning methods such as Qlearning, which has been known to diverge with linear fun ..."
Abstract

Cited by 59 (12 self)
 Add to MetaCart
(Show Context)
We introduce the first algorithm for offpolicy temporaldifference learning that is stable with linear function approximation. Offpolicy learning is of interest because it forms the basis for popular reinforcement learning methods such as Qlearning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multiscale, multigoal, learning frameworks such as options, HAMs, and MAXQ. Our new algorithm combines TD(λ) over state–action pairs with importance sampling ideas from our previous work. We prove that, given training under any ɛsoft policy, the algorithm converges w.p.1 to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic, 2001) to the actionvalue function for an arbitrary target policy. Variations of the algorithm designed to reduce variance introduce additional bias but are also guaranteed convergent. We also illustrate our method empirically on a small policy evaluation problem. Our current results are limited to episodic tasks with episodes of bounded length. 1 Although Qlearning remains the most popular of all reinforcement learning algorithms, it has been known since about 1996 that it is unsound with linear function approximation (see Gordon, 1995; Bertsekas and Tsitsiklis, 1996). The most telling counterexample, due to Baird (1995) is a sevenstate Markov decision process with linearly independent feature vectors, for which an exact solution exists, yet 1 This is a retypeset version of an article published in the Proceedings
Towards a Unified Theory of State Abstraction for MDPs
 In Proceedings of the Ninth International Symposium on Artificial Intelligence and Mathematics
, 2006
"... extensively studied in the fields of artificial intelligence and operations research. Instead of working in the ground state space, the decision maker usually finds solutions in the abstract state space much faster by treating groups of states as a unit by ignoring irrelevant state information. A nu ..."
Abstract

Cited by 59 (5 self)
 Add to MetaCart
(Show Context)
extensively studied in the fields of artificial intelligence and operations research. Instead of working in the ground state space, the decision maker usually finds solutions in the abstract state space much faster by treating groups of states as a unit by ignoring irrelevant state information. A number of abstractions have been proposed and studied in the reinforcementlearning and planning literatures, and positive and negative results are known. We provide a unified treatment of state abstraction for Markov decision processes. We study five particular abstraction schemes, some of which have been proposed in the past in di#erent forms, and analyze their usability for planning and learning.
Probabilistic Policy Reuse in a Reinforcement Learning Agent
 IN AAMAS ’06: PROCEEDINGS OF THE FIFTH INTERNATIONAL JOINT CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS
, 2006
"... We contribute Policy Reuse as a technique to improve a reinforcement learning agent with guidance from past learned similar policies. Our method relies on using the past policies as a probabilistic bias where the learning agent faces three choices: the exploitation of the ongoing learned policy, the ..."
Abstract

Cited by 57 (3 self)
 Add to MetaCart
(Show Context)
We contribute Policy Reuse as a technique to improve a reinforcement learning agent with guidance from past learned similar policies. Our method relies on using the past policies as a probabilistic bias where the learning agent faces three choices: the exploitation of the ongoing learned policy, the exploration of random unexplored actions, and the exploitation of past policies. We introduce the algorithm and its major components: an exploration strategy to include the new reuse bias, and a similarity function to estimate the similarity of past policies with respect to a new one. We provide empirical results demonstrating that Policy Reuse improves the learning performance over different strategies that learn without reuse. Interestingly and almost as a side effect, Policy Reuse also identifies classes of similar policies revealing a basis of core policies of the domain. We demonstrate that such a basis can be built incrementally, contributing the learning of the structure of a domain.