Results 11  20
of
141
Approximate linear programming for firstorder MDPs
 In Proc. UAI05, 509– 517
, 2005
"... We introduce a new approximate solution technique for firstorder Markov decision processes (FOMDPs). Representing the value function linearly w.r.t. a set of firstorder basis functions, we compute suitable weights by casting the corresponding optimization as a firstorder linear program and show h ..."
Abstract

Cited by 37 (9 self)
 Add to MetaCart
(Show Context)
We introduce a new approximate solution technique for firstorder Markov decision processes (FOMDPs). Representing the value function linearly w.r.t. a set of firstorder basis functions, we compute suitable weights by casting the corresponding optimization as a firstorder linear program and show how offtheshelf theorem prover and LP software can be effectively used. This technique allows one to solve FOMDPs independent of a specific domain instantiation; furthermore, it allows one to determine bounds on approximation error that apply equally to all domain instantiations. We apply this solution technique to the task of elevator scheduling with a rich feature space and multicriteria additive reward, and demonstrate that it outperforms a number of intuitive, heuristicallyguided policies. 1
Value Functions for RLBased Behavior Transfer: A Comparative Study
 In Proceedings of the 20th National Conference on Artificial Intelligence
, 2005
"... Temporal difference (TD) learning methods (Sutton & Barto 1998) have become popular reinforcement learning techniques in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit so ..."
Abstract

Cited by 36 (8 self)
 Add to MetaCart
Temporal difference (TD) learning methods (Sutton & Barto 1998) have become popular reinforcement learning techniques in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but have often been found slow in practice. This paper presents methods for further generalizing across tasks, thereby speeding up learning, via a novel form of behavior transfer. We compare learning on a complex task with three function approximators, a CMAC, a neural network, and an RBF, and demonstrate that behavior transfer works well with all three. Using behavior transfer, agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCupsoccer keepaway domain.
On Using Guidance in Relational Reinforcement Learning
 Machine Learning
, 2004
"... Reinforcement learning, and Qlearning in particular, encounter two major problems when dealing with large state spaces. First, learning the Qfunction in tabular form may be infeasible because of the excessive amount of memory needed to store the table and because the Qfunction only converges afte ..."
Abstract

Cited by 36 (8 self)
 Add to MetaCart
(Show Context)
Reinforcement learning, and Qlearning in particular, encounter two major problems when dealing with large state spaces. First, learning the Qfunction in tabular form may be infeasible because of the excessive amount of memory needed to store the table and because the Qfunction only converges after each state has been visited multiple times. Second, rewards in the state space may be so sparse that with random exploration they will only be discovered extremely slowly. The first problem is often solved by learning a generalization of the encountered examples (e.g., using a neural net or decision tree). Relational reinforcement learning (RRL) is such an approach; it makes Qlearning feasible in structural domains by incorporating a relational learner into Qlearning. To solve the second problem a use of "reasonable policies" to provide guidance has been suggested. In this paper we investigate the best ways to provide guidance in two different domains.
Analysis of a Classificationbased Policy Iteration Algorithm
"... Wepresentaclassificationbasedpolicyiteration algorithm, called Direct PolicyIteration, and provide its finitesample analysis. Our results state a performance bound in terms of the number of policy improvement steps, the number of rollouts used in each iteration, the capacity of the considered poli ..."
Abstract

Cited by 31 (9 self)
 Add to MetaCart
Wepresentaclassificationbasedpolicyiteration algorithm, called Direct PolicyIteration, and provide its finitesample analysis. Our results state a performance bound in terms of the number of policy improvement steps, the number of rollouts used in each iteration, the capacity of the considered policy space, and a new capacity measure which indicates how well the policy space can approximate policiesthataregreedyw.r.t. anyofitsmembers. The analysis reveals a tradeoff between the estimation and approximation errors in this classificationbased policy iteration setting. We also study the consistency of the method when there exists a sequence of policy spaces with increasing capacity. 1.
Rollout sampling approximate policy iteration
 Machine Learning
, 2008
"... Several researchers have recently investigated the connection between reinforcement learning and classification. We are motivated by proposals of approximate policy iteration schemes without value functions, which focus on policy representation using classifiers and address policy learning as a supe ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
(Show Context)
Several researchers have recently investigated the connection between reinforcement learning and classification. We are motivated by proposals of approximate policy iteration schemes without value functions, which focus on policy representation using classifiers and address policy learning as a supervised learning problem. This paper proposes variants of an improved policy iteration scheme which addresses the core sampling problem in evaluating a policy through simulation as a multiarmed bandit machine. The resulting algorithm offers comparable performance to the previous algorithm achieved, however, with significantly less computational effort. An order of magnitude improvement is demonstrated experimentally in two standard reinforcement learning domains: inverted pendulum and mountaincar. 1
Practical linear valueapproximation techniques for firstorder MDPs
 Proc. of the Conference on Uncertainty in Artificial Intelligence
, 2006
"... Recent work on approximate linear programming (ALP) techniques for firstorder Markov Decision Processes (FOMDPs) represents the value function linearly w.r.t. a set of firstorder basis functions and uses linear programming techniques to determine suitable weights. This approach offers the advantag ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
(Show Context)
Recent work on approximate linear programming (ALP) techniques for firstorder Markov Decision Processes (FOMDPs) represents the value function linearly w.r.t. a set of firstorder basis functions and uses linear programming techniques to determine suitable weights. This approach offers the advantage that it does not require simplification of the firstorder value function, and allows one to solve FOMDPs independent of a specific domain instantiation. In this paper, we address several questions to enhance the applicability of this work: (1) Can we extend the firstorder ALP framework to approximate policy iteration and if so, how do these two algorithms compare? (2) Can we automatically generate basis functions and evaluate their impact on value function quality? (3) How can we decompose intractable problems with universally quantified rewards into tractable subproblems? We propose answers to these questions along with a number of novel optimizations and provide a comparative empirical evaluation on problems from the ICAPS 2004 Probabilistic Planning Competition. 1
Planning with Noisy Probabilistic Relational Rules
"... Noisy probabilistic relational rules are a promising world model representation for several reasons. They are compact and generalize over world instantiations. They are usually interpretable and they can be learned effectively from the action experiences in complex worlds. We investigate reasoning w ..."
Abstract

Cited by 26 (6 self)
 Add to MetaCart
Noisy probabilistic relational rules are a promising world model representation for several reasons. They are compact and generalize over world instantiations. They are usually interpretable and they can be learned effectively from the action experiences in complex worlds. We investigate reasoning with such rules in grounded relational domains. Our algorithms exploit the compactness of rules for efficient and flexible decisiontheoretic planning. As a first approach, we combine these rules with the Upper Confidence Bounds applied to Trees (UCT) algorithm based on lookahead trees. Our second approach converts these rules into a structured dynamic Bayesian network representation and predicts the effects of action sequences using approximate inference and beliefs over world states. We evaluate the effectiveness of our approaches for planning in a simulated complex 3D robot manipulation scenario with an articulated manipulator and realistic physics and in domains of the probabilistic planning competition. Empirical results show that our methods can solve problems where existing methods fail. 1.
First order decision diagrams for relational MDPs
 In Proceedings of the International Joint Conference of Artificial Intelligence
, 2007
"... Markov decision processes capture sequential decision making under uncertainty, where an agent must choose actions so as to optimize long term reward. The paper studies efficient reasoning mechanisms for Relational Markov Decision Processes (RMDP) where world states have an internal relational struc ..."
Abstract

Cited by 26 (10 self)
 Add to MetaCart
(Show Context)
Markov decision processes capture sequential decision making under uncertainty, where an agent must choose actions so as to optimize long term reward. The paper studies efficient reasoning mechanisms for Relational Markov Decision Processes (RMDP) where world states have an internal relational structure that can be naturally described in terms of objects and relations among them. Two contributions are presented. First, the paper develops First Order Decision Diagrams (FODD), a new compact representation for functions over relational structures, together with a set of operators to combine FODDs, and novel reduction techniques to keep the representation small. Second, the paper shows how FODDs can be used to develop solutions for RMDPs, where reasoning is performed at the abstract level and the resulting optimal policy is independent of domain size (number of objects) or instantiation. In particular, a variant of the value iteration algorithm is developed by using special operations over FODDs, and the algorithm is shown to converge to the optimal policy. 1.
Practical solution techniques for firstorder mdps
 Artificial Intelligence
"... Many traditional solution approaches to relationally specified decisiontheoretic planning problems (e.g., those stated in the probabilistic planning domain description language, or PPDDL) ground the specification with respect to a specific instantiation of domain objects and apply a solution approa ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
(Show Context)
Many traditional solution approaches to relationally specified decisiontheoretic planning problems (e.g., those stated in the probabilistic planning domain description language, or PPDDL) ground the specification with respect to a specific instantiation of domain objects and apply a solution approach directly to the resulting ground Markov decision process (MDP). Unfortunately, the space and time complexity of these grounded solution approaches are polynomial in the number of domain objects and exponential in the predicate arity and the number of nested quantifiers in the relational problem specification. An alternative to grounding a relational planning problem is to tackle the problem directly at the relational level. In this article, we propose one such approach that translates an expressive subset of the PPDDL representation to a firstorder MDP (FOMDP) specification and then derives a domainindependent policy without grounding at any intermediate step. However, such generality does not come without its own set of challenges—the purpose of this article is to explore practical solution techniques for solving FOMDPs. To demonstrate the applicability of our techniques, we present proofofconcept results of our firstorder approximate linear programming (FOALP) planner on problems from the probabilistic track
On learning linear ranking functions for beam search
 In Z. Ghahramani (Ed.) Proceedings of the 24’th international conference on machine learning (ICML2007
, 2007
"... Beam search is used to maintain tractability in large search spaces at the expense of completeness and optimality. We study supervised learning of linear ranking functions for controlling beam search. The goal is to learn ranking functions that allow for beam search to perform nearly as well as unco ..."
Abstract

Cited by 24 (10 self)
 Add to MetaCart
(Show Context)
Beam search is used to maintain tractability in large search spaces at the expense of completeness and optimality. We study supervised learning of linear ranking functions for controlling beam search. The goal is to learn ranking functions that allow for beam search to perform nearly as well as unconstrained search while gaining computational efficiency. We first study the computational complexity of the learning problem, showing that even for exponentially large search spaces the general consistency problem is in NP. We also identify tractable and intractable subclasses of the learning problem. Next, we analyze the convergence of recently proposed and modified online learning algorithms. We first provide a counterexample to an existing convergence result and then introduce alternative notions of “margin ” that do imply convergence. Finally, we study convergence properties for ambiguous training data. 1.