Results 11  20
of
46
Dynamic Policy Programming
 Journal of Machine Learning Research
"... The following full text is a preprint version which may differ from the publisher's version. ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
The following full text is a preprint version which may differ from the publisher's version.
Finitesample analysis of Bellman residual minimization
 In Proceedings of the Second Asian Conference on Machine Learning
, 2010
"... HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Regularized Fitted Qiteration: Application to Planning
"... Abstract. We consider planning in a Markovian decision problem, i.e., the problem of finding a good policy given access to a generative model of the environment. We propose to use fitted Qiteration with penalized (or regularized) leastsquares regression as the regression subroutine to address the ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We consider planning in a Markovian decision problem, i.e., the problem of finding a good policy given access to a generative model of the environment. We propose to use fitted Qiteration with penalized (or regularized) leastsquares regression as the regression subroutine to address the problem of controlling modelcomplexity. The algorithm is presented in detail for the case when the function space is a reproducingkernel Hilbert space underlying a userchosen kernel function. We derive bounds on the quality of the solution and argue that datadependent penalties can lead to almost optimal performance. A simple example is used to illustrate the benefits of using a penalized procedure. 1
Automatic state abstraction from demonstration
 In Proceedings of the 22nd Second International Joint Conference on Articial Intelligence (IJCAI
, 2011
"... Learning from Demonstration (LfD) is a popular technique for building decisionmaking agents from human help. Traditional LfD methods use demonstrations as training examples for supervised learning, but complex tasks can require more examples than is practical to obtain. We present Abstraction from ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Learning from Demonstration (LfD) is a popular technique for building decisionmaking agents from human help. Traditional LfD methods use demonstrations as training examples for supervised learning, but complex tasks can require more examples than is practical to obtain. We present Abstraction from Demonstration (AfD), a novel form of LfD that uses demonstrations to infer state abstractions and reinforcement learning (RL) methods in those abstract state spaces to build a policy. Empirical results show that AfD is greater than an order of magnitude more sample efficient than just using demonstrations as training examples, and exponentially faster than RL alone. 1
A Brief Survey of Parametric value Function approximation
"... Reinforcement learning is a machine learning answer to the optimal control problem. It consists in learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important subtopic of reinforce ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Reinforcement learning is a machine learning answer to the optimal control problem. It consists in learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important subtopic of reinforcement learning is to compute an approximation of this value function when the system is too large for an exact representation. This survey reviews state of the art methods for (parametric) value function approximation by grouping them into three main categories: bootstrapping, residuals and projected fixedpoint approaches. Related algorithms are derived by considering one of the associated cost functions and a specific way to minimize it, almost always a stochastic gradient descent or a recursive
Sparse Approximate Policy Evaluation using Graphbased Basis Functions
"... Protovalue functions and diffusion wavelets are graphbased basis functions that capture topological structure of the MDP state space. A subset of these basis functions must be selected when approximating value functions in order to maintain computational efficiency and prevent overfitting. We eval ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Protovalue functions and diffusion wavelets are graphbased basis functions that capture topological structure of the MDP state space. A subset of these basis functions must be selected when approximating value functions in order to maintain computational efficiency and prevent overfitting. We evaluated four basis selection algorithms for performing this task. This is an enhancement over the previously used heuristic of always selecting the most global, or smoothest, subset of basis functions regardless of the policy being evaluated. We analyzed two schemes, one direct and one indirect, for combining basis selection and approximate policy evaluation. The indirect scheme requires more computation than the direct scheme, but gains flexibility in the manner in which basis functions are selected. The coefficients applied to the basis functions were set using leastsquares methods. We also described how leastsquares methods can be altered to include regularization. Laplacianbased regularization provides a bias toward smoother approximate value functions which can prevent overfitting and can be useful in stochastic domains. A thorough set of experiments was conducted on a simple chain MDP to understand how basis selection and the different leastsquares policy evaluation algorithms impact one another. Although the experiments used graphbased basis functions, the algorithms described in this paper can be applied to any set of basis functions. 1
Bellman Error Based Feature Generation Using Random Projections
"... The accuracy of parametrized policy evaluation depends on the quality of the features used for estimating the value function. Hence, feature generation/selection in reinforcement learning (RL) has received a lot of attention (Di Castro and Mannor, 2010). We focus on methods that aim to generate feat ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
The accuracy of parametrized policy evaluation depends on the quality of the features used for estimating the value function. Hence, feature generation/selection in reinforcement learning (RL) has received a lot of attention (Di Castro and Mannor, 2010). We focus on methods that aim to generate features in the direction of the Bellman error of the current value estimates (Bellman Error
R.: Regularized least squares temporal difference learning with nested ℓ2 and ℓ1 penalization
 In: European Workshop on Reinforcement Learning (2011
"... novel scheme (L21) which uses ℓ2 Abstract. The construction of a suitable set of features to approximate value functions is a central problem in reinforcement learning (RL). A popular approach to this problem is to use highdimensional feature spaces together with leastsquares temporal difference l ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
novel scheme (L21) which uses ℓ2 Abstract. The construction of a suitable set of features to approximate value functions is a central problem in reinforcement learning (RL). A popular approach to this problem is to use highdimensional feature spaces together with leastsquares temporal difference learning (LSTD). Although this combination allows for very accurate approximations, it often exhibits poor prediction performance because of overfitting when the number of samples is small compared to the number of features in the approximation space. In the linear regression setting, regularization is commonly used to overcome this problem. In this paper, we review some regularized approaches to policy evaluation and we introduce a regularization in the projection operator and an ℓ1 penalty in the fixedpoint step. We show that such formulation reduces to a standard Lasso problem. As a result, any offtheshelf solver can be used to compute its solution and standardization techniques can be applied to the data. We report experimental results showing that L21 is effective in avoiding overfitting and that it compares favorably to existing ℓ1 regularized methods. 1
Automatic Task Decomposition and State Abstraction from Demonstration
"... Both Learning from Demonstration (LfD) and Reinforcement Learning (RL) are popular approaches for building decisionmakingagents. LfDappliessupervisedlearningtoa set of human demonstrations to infer and imitate the human policy, while RL uses only a reward signal and exploration to find an optimal p ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Both Learning from Demonstration (LfD) and Reinforcement Learning (RL) are popular approaches for building decisionmakingagents. LfDappliessupervisedlearningtoa set of human demonstrations to infer and imitate the human policy, while RL uses only a reward signal and exploration to find an optimal policy. For complex tasks both of these techniques may be ineffective. LfD may require many more demonstrationsthan it is feasible to obtain, and RL can take an inadmissible amount of time to converge. We present Automatic Decomposition and Abstraction from demonstration (ADA), an algorithm that uses mutual information measures over a set of human demonstrations to decompose a sequential decision process into several subtasks, finding state abstractions for each one of these subtasks. ADA then projects the human demonstrations into the abstracted state space to build a policy. This policy can later be improved using RL algorithms to surpass the performance of the human teacher. We find empirically that ADA can find satisficing policies for problems that are too complex to be solved with traditional LfD and RL algorithms. In particular, we show that we can use mutual information across state features to leverage human demonstrations to reduce the effects of the curse of dimensionality by finding subtasks and abstractions in sequential decision processes.