Results 21  30
of
92
Towards Feature Selection In ActorCritic Algorithms
"... Choosing features for the critic in actorcritic algorithms with function approximation is known to be a challenge. Too few critic features can lead to degeneracy of the actor gradient, and too many features may lead to slower convergence of the learner. In this paper, we show that a wellstudied cla ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Choosing features for the critic in actorcritic algorithms with function approximation is known to be a challenge. Too few critic features can lead to degeneracy of the actor gradient, and too many features may lead to slower convergence of the learner. In this paper, we show that a wellstudied class of actor policies satisfy the known requirements for convergence when the actor features are selected carefully. We demonstrate that two popular representations for value methodsthe barycentric interpolators and the graph Laplacian protovalue functions can be used to represent the actor in order to satisfy these conditions. A consequence of this work is a generalization of the protovalue function methods to the continuous action actorcritic domain. Finally, we analyze the performance of this approach using a simulation of a torquelimited inverted pendulum. 1.
Sparse Approximate Policy Evaluation using Graphbased Basis Functions
"... Protovalue functions and diffusion wavelets are graphbased basis functions that capture topological structure of the MDP state space. A subset of these basis functions must be selected when approximating value functions in order to maintain computational efficiency and prevent overfitting. We eval ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Protovalue functions and diffusion wavelets are graphbased basis functions that capture topological structure of the MDP state space. A subset of these basis functions must be selected when approximating value functions in order to maintain computational efficiency and prevent overfitting. We evaluated four basis selection algorithms for performing this task. This is an enhancement over the previously used heuristic of always selecting the most global, or smoothest, subset of basis functions regardless of the policy being evaluated. We analyzed two schemes, one direct and one indirect, for combining basis selection and approximate policy evaluation. The indirect scheme requires more computation than the direct scheme, but gains flexibility in the manner in which basis functions are selected. The coefficients applied to the basis functions were set using leastsquares methods. We also described how leastsquares methods can be altered to include regularization. Laplacianbased regularization provides a bias toward smoother approximate value functions which can prevent overfitting and can be useful in stochastic domains. A thorough set of experiments was conducted on a simple chain MDP to understand how basis selection and the different leastsquares policy evaluation algorithms impact one another. Although the experiments used graphbased basis functions, the algorithms described in this paper can be applied to any set of basis functions. 1
Convergence analysis of onpolicy LSPI for multidimensional continuous state and actionspace mdps and extension with orthogonal polynomial approximation. Working paper
, 2010
"... We propose an online, onpolicy leastsquares policy iteration (LSPI) algorithm which can be applied to infinite horizon problems with where states and controls are vectorvalued and continuous. We do not use special structure such as linear, additive noise, and we assume that the expectation cannot ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
We propose an online, onpolicy leastsquares policy iteration (LSPI) algorithm which can be applied to infinite horizon problems with where states and controls are vectorvalued and continuous. We do not use special structure such as linear, additive noise, and we assume that the expectation cannot be computed exactly. We use the concept of the postdecision state variable to eliminate the expectation inside the optimization problem. We provide a formal convergence analysis of the algorithm under the assumption that value functions are spanned by finitely many known basis functions. Furthermore, the convergence result extends to the Central to the solution of Markov decision processes is Bellman’s equation, which is often written in the standard form (Puterman (1994)) Vt(xt) = max ut∈U {C(xt, ut) + γ ∑
Multiagent behaviour segmentation via spectral clustering
 in Proceedings of the AAAI2007, PAIR Workshop
, 2007
"... We examine the application of spectral clustering for breaking up the behaviour of a multiagent system in space and time into smaller, independent elements. We extend the clustering into the temporal domain and propose a novel similarity measure, which is shown to possess desirable temporal proper ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
We examine the application of spectral clustering for breaking up the behaviour of a multiagent system in space and time into smaller, independent elements. We extend the clustering into the temporal domain and propose a novel similarity measure, which is shown to possess desirable temporal properties when clustering multiagent behaviour. We also propose a technique to add knowledge about events of multiagent interaction with different importance. We apply spectral clustering with this measure for analysing behaviour in a strategic game.
Basis Expansion in Natural Actor Critic Methods
"... Abstract. In reinforcement learning, the aim of the agent is to find a policy that maximizes its expected return. Policy gradient methods try to accomplish this goal by directly approximating the policy using a parametric function approximator; the expected return of the current policy is estimated ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Abstract. In reinforcement learning, the aim of the agent is to find a policy that maximizes its expected return. Policy gradient methods try to accomplish this goal by directly approximating the policy using a parametric function approximator; the expected return of the current policy is estimated and its parameters are updated by steepest ascent in the direction of the gradient of the expected return with respect to the policy parameters. In general, the policy is defined in terms of a set of basis functions that capture important features of the problem. Since the quality of the resulting policies directly depend on the set of basis functions, and defining them gets harder as the complexity of the problem increases, it is important to be able to find them automatically. In this paper, we propose a new approach which uses cascadecorrelation learning architecture for automatically constructing a set of basis functions within the context of Natural ActorCritic (NAC) algorithms. Such basis functions allow more complex policies be represented, and consequently improve the performance of the resulting policies. We also present the effectiveness of the method empirically. 1
Basis Construction from Power Series Expansions of Value Functions
"... This paper explores links between basis construction methods in Markov decision processes and power series expansions of value functions. This perspective provides a useful framework to analyze properties of existing bases, as well as provides insight into constructing more effective bases. Krylov a ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
This paper explores links between basis construction methods in Markov decision processes and power series expansions of value functions. This perspective provides a useful framework to analyze properties of existing bases, as well as provides insight into constructing more effective bases. Krylov and Bellman error bases are based on the Neumann series expansion. These bases incur very large initial Bellman errors, and can converge rather slowly as the discount factor approaches unity. The Laurent series expansion, which relates discounted and averagereward formulations, provides both an explanation for this slow convergence as well as suggests a way to construct more efficient basis representations. The first two terms in the Laurent series represent the scaled averagereward and the averageadjusted sum of rewards, and subsequent terms expand the discounted value function using powers of a generalized inverse called the Drazin (or group inverse) of a singular matrix derived from the transition matrix. Experiments show that Drazin bases converge considerably more quickly than several other bases, particularly for large values of the discount factor. An incremental variant of Drazin bases called Bellman averagereward bases (BARBs) is described, which provides some of the same benefits at lower computational cost. 1
Representation Discovery in Sequential Decision Making
"... Automatically constructing novel representations of tasks from analysis of state spaces is a longstanding fundamental challenge in AI. I review recent progress on this problem for sequential decision making tasks modeled as Markov decision processes. Specifically, I discuss three classes of represen ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Automatically constructing novel representations of tasks from analysis of state spaces is a longstanding fundamental challenge in AI. I review recent progress on this problem for sequential decision making tasks modeled as Markov decision processes. Specifically, I discuss three classes of representation discovery problems: finding functional, state, and temporal abstractions. I describe solution techniques varying along several dimensions: diagonalization or dilation methods using approximate or exact transition models; rewardspecific vs rewardinvariant methods; global vs. local representation construction methods; multiscale vs. flat discovery methods; and finally, orthogonal vs. redundant representation discovery methods. I conclude by describing a number of open problems for future work.
Metric learning for reinforcement learning agents
 In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS
, 2011
"... A key component of any reinforcement learning algorithm is the underlying representation used by the agent. While reinforcement learning (RL) agents have typically relied on handcoded state representations, there has been a growing interest in learning this representation. While inputs to an agen ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
A key component of any reinforcement learning algorithm is the underlying representation used by the agent. While reinforcement learning (RL) agents have typically relied on handcoded state representations, there has been a growing interest in learning this representation. While inputs to an agent are typically fixed (i.e., state variables represent sensors on a robot), it is desirable to automatically determine the optimal relative scaling of such inputs, as well as to diminish the impact of irrelevant features. This work introduces HOLLER, a novel distance metric learning algorithm, and combines it with an existing instancebased RL algorithm to achieve precisely these goals. The algorithms ’ success is highlighted via empirical measurements on a set of six tasks within the mountain car domain.
Stochastic enforced hillclimbing
, 2008
"... Abstract Enforced hillclimbing is an effective deterministic hillclimbing technique that deals with local optima using breadthfirst search (a process called "basin flooding"). We propose and evaluate a stochastic generalization of enforced hillclimbing for online use in goaloriented ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Abstract Enforced hillclimbing is an effective deterministic hillclimbing technique that deals with local optima using breadthfirst search (a process called "basin flooding"). We propose and evaluate a stochastic generalization of enforced hillclimbing for online use in goaloriented probabilistic planning problems. We assume a provided heuristic function estimating expected cost to the goal with flaws such as local optima and plateaus that thwart straightforward greedy action choice. While breadthfirst search is effective in exploring basins around local optima in deterministic problems, for stochastic problems we dynamically build and solve a heuristicbased Markov decision process (MDP) model of the basin in order to find a good escape policy exiting the local optimum. We note that building this model involves integrating the heuristic into the MDP problem because the local goal is to improve the heuristic. We evaluate our proposal in twentyfour recent probabilistic planningcompetition benchmark domains and twelve probabilistically interesting problems from recent literature. For evaluation, we show that stochastic enforced hillclimbing (SEH) produces better policies than greedy heuristic following for value/cost functions derived in two very different ways: one type derived by using deterministic heuristics on a deterministic relaxation and a second type derived by automatic learning of Bellmanerror features from domainspecific experience. Using the first type of heuristic, SEH is shown to generally outperform all planners from the first three international probabilistic planning competitions.