Results 1  10
of
66
Dynamic Programming for Partially Observable Stochastic Games
 IN PROCEEDINGS OF THE NINETEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 2004
"... We develop an exact dynamic programming algorithm for partially observable stochastic games (POSGs). The algorithm is a synthesis of dynamic programming for partially observable Markov decision processes (POMDPs) and iterated elimination of dominated strategies in normal form games. ..."
Abstract

Cited by 154 (25 self)
 Add to MetaCart
(Show Context)
We develop an exact dynamic programming algorithm for partially observable stochastic games (POSGs). The algorithm is a synthesis of dynamic programming for partially observable Markov decision processes (POMDPs) and iterated elimination of dominated strategies in normal form games.
Formal Theory of Creativity, Fun, and Intrinsic Motivation (19902010)
"... The simple but general formal theory of fun & intrinsic motivation & creativity (1990) is based on the concept of maximizing intrinsic reward for the active creation or discovery of novel, surprising patterns allowing for improved prediction or data compression. It generalizes the traditio ..."
Abstract

Cited by 73 (16 self)
 Add to MetaCart
(Show Context)
The simple but general formal theory of fun & intrinsic motivation & creativity (1990) is based on the concept of maximizing intrinsic reward for the active creation or discovery of novel, surprising patterns allowing for improved prediction or data compression. It generalizes the traditional field of active learning, and is related to old but less formal ideas in aesthetics theory and developmental psychology. It has been argued that the theory explains many essential aspects of intelligence including autonomous development, science, art, music, humor. This overview first describes theoretically optimal (but not necessarily practical) ways of implementing the basic computational principles on exploratory, intrinsically motivated agents or robots, encouraging them to provoke event sequences exhibiting previously unknown but learnable algorithmic regularities. Emphasis is put on the importance of limited computational resources for online prediction and compression. Discrete and continuous time formulations are given. Previous practical but nonoptimal implementations (1991, 1995, 19972002) are reviewed, as well as several recent variants by others (2005). A simplified typology addresses current confusion concerning the precise nature of intrinsic motivation.
Knows What It Knows: A Framework For SelfAware Learning
"... We introduce a learning framework that combines elements of the wellknown PAC and mistakebound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is ..."
Abstract

Cited by 70 (20 self)
 Add to MetaCart
We introduce a learning framework that combines elements of the wellknown PAC and mistakebound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is true in reinforcementlearning and activelearning problems. We catalog several KWIKlearnable classes and open problems. 1.
Efficient structure learning in factoredstate MDPs
, 2007
"... We consider the problem of reinforcement learning in factoredstate MDPs in the setting in which learning is conducted in one long trial with no resets allowed. We show how to extend existing efficient algorithms that learn the conditional probability tables of dynamic Bayesian networks (DBNs) given ..."
Abstract

Cited by 59 (9 self)
 Add to MetaCart
(Show Context)
We consider the problem of reinforcement learning in factoredstate MDPs in the setting in which learning is conducted in one long trial with no resets allowed. We show how to extend existing efficient algorithms that learn the conditional probability tables of dynamic Bayesian networks (DBNs) given their structure to the case in which DBN structure is not known in advance. Our method learns the DBN structures as part of the reinforcementlearning process and provably provides an efficient learning algorithm when combined with factored Rmax.
Online linear regression and its application to modelbased reinforcement learning
 In Advances in Neural Information Processing Systems 20 (NIPS07
, 2007
"... We provide a provably efficient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Specifically, we take a modelbased approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernaliz ..."
Abstract

Cited by 47 (9 self)
 Add to MetaCart
(Show Context)
We provide a provably efficient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Specifically, we take a modelbased approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh’s work that provides a provably efficient algorithm for finite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.
An Analysis of ModelBased Interval Estimation for Markov Decision Processes
, 2007
"... Several algorithms for learning nearoptimal policies in Markov Decision Processes have been analyzed and proven efficient. Empirical results have suggested that Modelbased Interval Estimation (MBIE) learns efficiently in practice, effectively balancing exploration and exploitation. This paper pres ..."
Abstract

Cited by 45 (5 self)
 Add to MetaCart
(Show Context)
Several algorithms for learning nearoptimal policies in Markov Decision Processes have been analyzed and proven efficient. Empirical results have suggested that Modelbased Interval Estimation (MBIE) learns efficiently in practice, effectively balancing exploration and exploitation. This paper presents a theoretical analysis of MBIE and a new variation called MBIEEB, proving their efficiency even under worstcase conditions. The paper also introduces a new performance metric, average loss, and relates it to its less “online” cousins from the literature.
Potentialbased shaping in modelbased reinforcement learning
 In Proceedings of AAAI Conference on Artificial Intelligence
, 2008
"... Potentialbased shaping was designed as a way of introducing background knowledge into modelfree reinforcementlearning algorithms. By identifying states that are likely to have high value, this approach can decrease experience complexity—the number of trials needed to find nearoptimal behavior. A ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
(Show Context)
Potentialbased shaping was designed as a way of introducing background knowledge into modelfree reinforcementlearning algorithms. By identifying states that are likely to have high value, this approach can decrease experience complexity—the number of trials needed to find nearoptimal behavior. An orthogonal way of decreasing experience complexity is to use a modelbased learning approach, building and exploiting an explicit transition model. In this paper, we show how potentialbased shaping can be redefined to work in the modelbased setting to produce an algorithm that shares the benefits of both ideas.
Markov decision processes with arbitrary reward processes
 Mathematics of Operations Research
, 2009
"... We consider a learning problem where the decision maker interacts with a standard Markov decision process, with the exception that the reward functions vary arbitrarily over time. We show that, against every possible realization of the reward process, the agent can perform as well—in hindsight—as ev ..."
Abstract

Cited by 20 (7 self)
 Add to MetaCart
(Show Context)
We consider a learning problem where the decision maker interacts with a standard Markov decision process, with the exception that the reward functions vary arbitrarily over time. We show that, against every possible realization of the reward process, the agent can perform as well—in hindsight—as every stationary policy. This generalizes the classical noregret result for repeated games. Specifically, we present an efficient online algorithm— in the spirit of reinforcement learning—that ensures that the agent’s average performance loss vanishes over time, provided that the environment is oblivious to the agent’s actions. Moreover, it is possible to modify the basic algorithm to cope with instances where reward observations are limited to the agent’s trajectory. We present further modifications that reduce the computational cost by using function approximation and that track the optimal policy through infrequent changes.
Corl: A continuousstate offsetdynamics reinforcement learnerTechnical Report).
, 2008
"... Abstract Continuous state spaces and stochastic, switching dynamics characterize a number of rich, realworld domains, such as robot navigation across varying terrain. We describe a reinforcementlearning algorithm for learning in these domains and prove for certain environments the algorithm is prob ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
(Show Context)
Abstract Continuous state spaces and stochastic, switching dynamics characterize a number of rich, realworld domains, such as robot navigation across varying terrain. We describe a reinforcementlearning algorithm for learning in these domains and prove for certain environments the algorithm is probably approximately correct with a sample complexity that scales polynomially with the statespace dimension. Unfortunately, no optimal planning techniques exist in general for such problems; instead we use fitted value iteration to solve the learned MDP, and include the error due to approximate planning in our bounds. Finally, we report an experiment using a robotic car driving over varying terrain to demonstrate that these dynamics representations adequately capture realworld dynamics and that our algorithm can be used to efficiently solve such problems.
Learning is planning: near Bayesoptimal reinforcement learning via MonteCarlo tree search
"... Bayesoptimal behavior, while welldefined, is often difficult to achieve. Recent advances in the use of MonteCarlo tree search (MCTS) have shown that it is possible to act nearoptimally in Markov Decision Processes (MDPs) with very large or infinite state spaces. Bayesoptimal behavior in an unkn ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
Bayesoptimal behavior, while welldefined, is often difficult to achieve. Recent advances in the use of MonteCarlo tree search (MCTS) have shown that it is possible to act nearoptimally in Markov Decision Processes (MDPs) with very large or infinite state spaces. Bayesoptimal behavior in an unknown MDP is equivalent to optimal behavior in the known beliefspace MDP, although the size of this beliefspace MDP grows exponentially with the amount of history retained, and is potentially infinite. We show how an agent can use one particular MCTS algorithm, Forward Search Sparse Sampling (FSSS), in an efficient way to act nearly Bayesoptimally for all but a polynomial number of steps, assuming that FSSS can be used to act efficiently in any possible underlying MDP. 1