Results 1  10
of
13
Perception, Action and Utility: The Tangled Skein
, 2011
"... Normative theories of learning and decisionmaking are motivated by a computationallevel analysis of the task facing an animal: what should the animal do to maximize future reward? However, much of the recent excitement in this field originates in how the animal arrives at its decisions and reward ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Normative theories of learning and decisionmaking are motivated by a computationallevel analysis of the task facing an animal: what should the animal do to maximize future reward? However, much of the recent excitement in this field originates in how the animal arrives at its decisions and reward predictions—algorithmic questions about which the computationallevel analysis is silent.
Efficient Inference in Markov Control Problems
"... Markov control algorithms that perform smooth, nongreedy updates of the policy have been shown to be very general and versatile, with policy gradient and Expectation Maximisation algorithms being particularly popular. For these algorithms, marginal inference of the reward weighted trajectory distri ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Markov control algorithms that perform smooth, nongreedy updates of the policy have been shown to be very general and versatile, with policy gradient and Expectation Maximisation algorithms being particularly popular. For these algorithms, marginal inference of the reward weighted trajectory distribution is required to perform policy updates. We discuss a new exact inference algorithm for these marginals in the finite horizon case that is more efficient than the standard approach based on classical forwardbackward recursions. We also provide a principled extension to infinite horizon Markov Decision Problems that explicitly accounts for an infinite horizon. This extension provides a novel algorithm for both policy gradients and Expectation Maximisation in infinite horizon problems. 1 MARKOV DECISION PROBLEMS A Markov Decision Problem (MDP) is described by an initial state distribution p1(s1), transition distributions p(st+1st, at) and reward function Rt(st, at), where the state and action at time t are denoted by st and at respectively1 (Sutton and Barto, 1998). The state and action spaces can be either discrete or continuous. For a discount factor γ the reward is defined as Rt(st, at) = γt−1R(st, at) for a stationary reward R(st, at), where γ ∈ [0, 1). We assume a stationary policy, π, defined as a set of conditional distributions over the action space, πa,s = p(at = ast = s, π). The total expected reward of the MDP (the policy utility) 1 To avoid cumbersome notation we also use the notation zt = {st, at} to denote a stateaction pair. We use the bold typeface, zt, to denote a vector. is given by U(π) = H ∑ ∑ Rt(st, at)p(st, atπ) (1) t=1 st,at where H is the horizon, which can be either finite or infinite, and p(st, atπ) is the marginal of the joint stateaction trajectory distribution p(s1:H, a1:Hπ) = p(aHsH, π)p1(s1) p(st+1st, at)p(atst, π). (2)
Solving deterministic policy (PO)MDPs using expectationmaximisation and antifreeze
 In European Conference on Machine Learning (LEMIR workshop
, 2009
"... Abstract. The viewpoint of solving Markov Decision Processes and their partially observable extension refers to finding policies that maximise the expected reward. We follow the rephrasing of this problem as learning in a related probabilistic model. Our transdimensional distribution formulation ob ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract. The viewpoint of solving Markov Decision Processes and their partially observable extension refers to finding policies that maximise the expected reward. We follow the rephrasing of this problem as learning in a related probabilistic model. Our transdimensional distribution formulation obtains equivalent results to previous work in the infinite horizon case and also rigorously handles the finite horizon case without discounting. In contrast to previous expositions, our framework elides auxiliary variables, simplifying the algorithm development. For any MDP the optimal policy is deterministic, meaning that this important case needs to be dealt with explicitly. Whilst this case has been discussed by previous authors, their treatment has not been formally equivalent to an EM algorithm, but rather based on a fixed point iteration analogous to policy iteration. In contrast we derive a true EM approach for this case and show that this has a significantly faster convergence rate than nondeterministic EM. Our approach extends naturally to the POMDP
Reviewed by:
, 2010
"... We suggested recently that attention can be understood as inferring the level of uncertainty or precision during hierarchical perception. In this paper, we try to substantiate this claim using neuronal simulations of directed spatial attention and biased competition. These simulations assume that ne ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We suggested recently that attention can be understood as inferring the level of uncertainty or precision during hierarchical perception. In this paper, we try to substantiate this claim using neuronal simulations of directed spatial attention and biased competition. These simulations assume that neuronal activity encodes a probabilistic representation of the world that optimizes freeenergy in a Bayesian fashion. Because freeenergy bounds surprise or the (negative) logevidence for internal models of the world, this optimization can be regarded as evidence accumulation or (generalized) predictive coding. Crucially, both predictions about the state of the world generating sensory data and the precision of those data have to be optimized. Here, we show that if the precision depends on the states, one can explain many aspects of attention. We illustrate this in the context of the Posner paradigm, using the simulations to generate both psychophysical and electrophysiological responses. These simulated responses are consistent with attentional bias or gating, competition for attentional resources, attentional capture and associated speedaccuracy tradeoffs. Furthermore, if we present both attended and nonattended stimuli simultaneously, biased competition for neuronal representation emerges as a principled and straightforward property of Bayesoptimal perception.
Inference strategies for solving semiMarkov decision processes
"... SemiMarkov decision processes (SMDPs) generalize standard MDPs to domains where time is not discretized equally between every set of states and actions [3]. Instead we can define a jumpMarkov process where the amount of time spent in each state is a stochastic random variable. This formulation giv ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
SemiMarkov decision processes (SMDPs) generalize standard MDPs to domains where time is not discretized equally between every set of states and actions [3]. Instead we can define a jumpMarkov process where the amount of time spent in each state is a stochastic random variable. This formulation gives us an intuitive way to reason about actions where it is also necessary to take into account how long these actions will take to perform. Formally we can define an SMDP as a continuoustime controlled stochastic process (x(t), u(t)) consisting, respectively, of states and actions at every point in time t where state transitions occur at random arrival times Tn. In particular, the process is stationary in between jumps, i.e. x(t) = xn and u(t) = un
Lagrange Dual Decomposition for Finite Horizon Markov Decision Processes
"... Solving finitehorizon Markov Decision Processes with stationary policies is a computationally difficult problem. Our dynamic dual decomposition approach uses Lagrange duality to decouple this hard problem into a sequence of tractable subproblems. The resulting procedure is a straightforward modif ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Solving finitehorizon Markov Decision Processes with stationary policies is a computationally difficult problem. Our dynamic dual decomposition approach uses Lagrange duality to decouple this hard problem into a sequence of tractable subproblems. The resulting procedure is a straightforward modification of standard nonstationary Markov Decision Process solvers and gives an upperbound on the total expected reward. The empirical performance of the method suggests that not only is it a rapidly convergent algorithm, but that it also performs favourably compared to standard planning algorithms such as policy gradients and lowerbound procedures such as Expectation Maximisation.
Compositional Policy Priors
, 2013
"... This paper describes a probabilistic framework for incorporating structured inductive biases into reinforcement learning. These inductive biases arise from policy priors, probability distributions over optimal policies. Borrowing recent ideas from computational linguistics and Bayesian nonparametr ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
This paper describes a probabilistic framework for incorporating structured inductive biases into reinforcement learning. These inductive biases arise from policy priors, probability distributions over optimal policies. Borrowing recent ideas from computational linguistics and Bayesian nonparametrics, we define several families of policy priors that express compositional, abstract structure in a domain. Compositionality is expressed using probabilistic contextfree grammars, enabling a compact representation of hierarchically organized subtasks. Useful sequences of subtasks can be cached and reused by extending the grammars nonparametrically using Fragment Grammars. We present Monte Carlo methods for performing inference, and show how structured policy priors lead to substantially faster learning in complex domains compared to methods without inductive biases. 1
Approximate Newton Methods for Policy Search in Markov Decision Processes
, 2016
"... Abstract Approximate Newton methods are standard optimization tools which aim to maintain the benefits of Newton's method, such as a fast rate of convergence, while alleviating its drawbacks, such as computationally expensive calculation or estimation of the inverse Hessian. In this work we in ..."
Abstract
 Add to MetaCart
Abstract Approximate Newton methods are standard optimization tools which aim to maintain the benefits of Newton's method, such as a fast rate of convergence, while alleviating its drawbacks, such as computationally expensive calculation or estimation of the inverse Hessian. In this work we investigate approximate Newton methods for policy optimization in Markov decision processes (MDPs). We first analyse the structure of the Hessian of the total expected reward, which is a standard objective function for MDPs. We show that, like the gradient, the Hessian exhibits useful structure in the context of MDPs and we use this analysis to motivate two GaussNewton methods for MDPs. Like the GaussNewton method for nonlinear least squares, these methods drop certain terms in the Hessian. The approximate Hessians possess desirable properties, such as negative definiteness, and we demonstrate several important performance guarantees including guaranteed ascent directions, invariance to affine transformation of the parameter space and convergence guarantees. We finally provide a unifying perspective of key policy search algorithms, demonstrating that our second GaussNewton algorithm is closely related to both the EMalgorithm and natural gradient ascent applied to MDPs, but performs significantly better in practice on a range of challenging domains.
Probabilistic Inference Techniques for Scalable Multiagent Decision Making
, 2015
"... Decentralized POMDPs provide an expressive framework for multiagent sequential decision making. However, the complexity of these models—NEXPComplete even for two agents—has limited their scalability. We present a promising new class of approximation algorithms by developing novel connections betwe ..."
Abstract
 Add to MetaCart
Decentralized POMDPs provide an expressive framework for multiagent sequential decision making. However, the complexity of these models—NEXPComplete even for two agents—has limited their scalability. We present a promising new class of approximation algorithms by developing novel connections between multiagent planning and machine learning. We show how the multiagent planning problem can be reformulated as inference in a mixture of dynamic Bayesian networks (DBNs). This planningasinference approach paves the way for the application of efficient inference techniques in DBNs to multiagent decision making. To further improve scalability, we identify certain conditions that are sufficient to extend the approach to multiagent systems with dozens of agents. Specifically, we show that the necessary inference within the expectationmaximization framework can be decomposed into processes that often involve a small subset of agents, thereby facilitating scalability. We further show that a number of existing multiagent planning models satisfy these conditions. Experiments on large planning benchmarks confirm the benefits of our approach in terms of runtime and scalability with respect to existing techniques.