Results 11 - 20
of
2,080
Near-optimal reinforcement learning in polynomial time
- Machine Learning
, 1998
"... We present new algorithms for reinforcement learning, and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the m ..."
Abstract
-
Cited by 174 (3 self)
- Add to MetaCart
We present new algorithms for reinforcement learning, and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off. 1
Probabilistic Algorithms in Robotics
- AI Magazine
, 2000
"... This article describes a methodology for programming robots known as probabilistic robotics. The probabilistic paradigm pays tribute to the inherent uncertainty in robot perception, relying on explicit representations of uncertainty when determining what to do. This article surveys some of the progr ..."
Abstract
-
Cited by 147 (7 self)
- Add to MetaCart
This article describes a methodology for programming robots known as probabilistic robotics. The probabilistic paradigm pays tribute to the inherent uncertainty in robot perception, relying on explicit representations of uncertainty when determining what to do. This article surveys some of the progress in the field, using in-depth examples to illustrate some of the nuts and bolts of the basic approach. Our central conjecture is that the probabilistic approach to robotics scales better to complex real-world applications than approaches that ignore a robot's uncertainty.
Actor-Critic Algorithms
- SIAM JOURNAL ON CONTROL AND OPTIMIZATION
, 2001
"... In this paper, we propose and analyze a class of actor-critic algorithms. These are two-time-scale algorithms in which the critic uses temporal difference (TD) learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction based on in ..."
Abstract
-
Cited by 140 (1 self)
- Add to MetaCart
In this paper, we propose and analyze a class of actor-critic algorithms. These are two-time-scale algorithms in which the critic uses temporal difference (TD) learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction based on information provided by the critic. We show that the features for the critic should ideally span a subspace prescribed by the choice of parameterization of the actor. We study actorcritic algorithms for Markov decision processes with general state and action spaces. We state and prove two results regarding their convergence.
Probabilistic Algorithms and the Interactive Museum Tour-Guide Robot Minerva
, 2000
"... This paper describes Minerva, an interactive tour-guide robot that was successfully deployed in a Smithsonian museum. Minerva's software is pervasively probabilistic, relying on explicit representations of uncertainty in perception and control. This article describes ..."
Abstract
-
Cited by 128 (34 self)
- Add to MetaCart
This paper describes Minerva, an interactive tour-guide robot that was successfully deployed in a Smithsonian museum. Minerva's software is pervasively probabilistic, relying on explicit representations of uncertainty in perception and control. This article describes
A Stochastic Model of Human-Machine Interaction for learning dialog Strategies
- IEEE Transactions on Speech and Audio Processing
, 2000
"... Abstract—In this paper, we propose a quantitative model for dialog systems that can be used for learning the dialog strategy. We claim that the problem of dialog design can be formalized as an optimization problem with an objective function reflecting different dialog dimensions relevant for a given ..."
Abstract
-
Cited by 122 (3 self)
- Add to MetaCart
Abstract—In this paper, we propose a quantitative model for dialog systems that can be used for learning the dialog strategy. We claim that the problem of dialog design can be formalized as an optimization problem with an objective function reflecting different dialog dimensions relevant for a given application. We also show that any dialog system can be formally described as a sequential decision process in terms of its state space, action set, and strategy. With additional assumptions about the state transition probabilities and cost assignment, a dialog system can be mapped to a stochastic model known as Markov decision process (MDP). A variety of data driven algorithms for finding the optimal strategy (i.e., the one that optimizes the criterion) is available within the MDP framework, based on reinforcement learning. For an effective use of the available training data we propose a combination of supervised and reinforcement learning: the supervised learning is used to estimate a model of the user, i.e., the MDP parameters that quantify the user’s behavior. Then a reinforcement learning algorithm is used to estimate the optimal strategy while the system interacts with the simulated user. This approach is tested for learning the strategy in an air travel information system (ATIS) task. The experimental results we present in this paper show that it is indeed possible to find a simple criterion, a state space representation, and a simulated user parameterization in order to automatically learn a relatively complex dialog behavior, similar to one that was heuristically designed by several research groups. Index Terms—Dialog systems, Markov decision process, reinforcement learning, sequential decision process, speech, spoken
Stochastic Dynamic Programming with Factored Representations
, 1997
"... Markov decision processes(MDPs) have proven to be popular models for decision-theoretic planning, but standard dynamic programming algorithms for solving MDPs rely on explicit, state-based specifications and computations. To alleviate the combinatorial problems associated with such methods, we propo ..."
Abstract
-
Cited by 120 (9 self)
- Add to MetaCart
Markov decision processes(MDPs) have proven to be popular models for decision-theoretic planning, but standard dynamic programming algorithms for solving MDPs rely on explicit, state-based specifications and computations. To alleviate the combinatorial problems associated with such methods, we propose new representational and computational techniques for MDPs that exploit certain types of problem structure. We use dynamic Bayesian networks (with decision trees representing the local families of conditional probability distributions) to represent stochastic actions in an MDP, together with a decision-tree representation of rewards. Based on this representation, we develop versions of standard dynamic programming algorithms that directly manipulate decision-tree representations of policies and value functions. This generally obviates the need for state-by-state computation, aggregating states at the leaves of these trees and requiring computations only for each aggregate state. The key to these algorithms is a decision-theoretic generalization of classic regression analysis, in which we determine the features relevant to predicting expected value. We demonstrate the method empirically on several planning problems,
Recent advances in hierarchical reinforcement learning
, 2003
"... A preliminary unedited version of this paper was incorrectly published as part of Volume ..."
Abstract
-
Cited by 119 (18 self)
- Add to MetaCart
A preliminary unedited version of this paper was incorrectly published as part of Volume
Infinite-horizon policy-gradient estimation
- Journal of Artificial Intelligence Research
, 2001
"... Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce � � , a si ..."
Abstract
-
Cited by 119 (5 self)
- Add to MetaCart
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce � � , a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes ( � s) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm’s chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter � � (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of � � , and show how the correct choice of the parameter is related to the mixing time of the controlled �. We briefly describe extensions of � � to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by � � can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward. 1.
Improved Inference for Unlexicalized Parsing
, 2007
"... We present several improvements to unlexicalized parsing with hierarchically state-split PCFGs. First, we present a novel coarse-to-fine method in which a grammar’s own hierarchical projections are used for incremental pruning, including a method for efficiently computing projections of a grammar wi ..."
Abstract
-
Cited by 115 (17 self)
- Add to MetaCart
We present several improvements to unlexicalized parsing with hierarchically state-split PCFGs. First, we present a novel coarse-to-fine method in which a grammar’s own hierarchical projections are used for incremental pruning, including a method for efficiently computing projections of a grammar without a treebank. In our experiments, hierarchical pruning greatly accelerates parsing with no loss in empirical accuracy. Second, we compare various inference procedures for state-split PCFGs from the standpoint of risk minimization, paying particular attention to their practical tradeoffs. Finally, we present multilingual experiments which show that parsing with hierarchical state-splitting is fast and accurate in multiple languages and domains, even without any language-specific tuning.
Learning to Cooperate via Policy Search
, 2000
"... Cooperative games are those in which both agents share the same payoff structure. Valuebased reinforcement-learning algorithms, such as variants of Q-learning, have been applied to learning cooperative games, but they only apply when the game state is completely observable to both agents. Poli ..."
Abstract
-
Cited by 114 (4 self)
- Add to MetaCart
Cooperative games are those in which both agents share the same payoff structure. Valuebased reinforcement-learning algorithms, such as variants of Q-learning, have been applied to learning cooperative games, but they only apply when the game state is completely observable to both agents. Policy search methods are a reasonable alternative to value-based methods for partially observable environments. In this paper, we provide a gradient-based distributed policysearch method for cooperative games and compare the notion of local optimum to that of Nash equilibrium. We demonstrate the effectiveness of this method experimentally in a small, partially observable simulated soccer domain. 1 INTRODUCTION The interaction of decision makers who share an environment is traditionally studied in game theory and economics. The game theoretic formalism is very general, and analyzes the problem in terms of solution concepts such as Nash equilibrium [12], but usually works under the assu...

