Results 1  10
of
23
Nonlinear Inverse Reinforcement Learning with Gaussian Processes
"... We present a probabilistic algorithm for nonlinear inverse reinforcement learning. The goal of inverse reinforcement learning is to learn the reward function in a Markov decision process from expert demonstrations. While most prior inverse reinforcement learning algorithms represent the reward as a ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
(Show Context)
We present a probabilistic algorithm for nonlinear inverse reinforcement learning. The goal of inverse reinforcement learning is to learn the reward function in a Markov decision process from expert demonstrations. While most prior inverse reinforcement learning algorithms represent the reward as a linear combination of a set of features, we use Gaussian processes to learn the reward as a nonlinear function, while also determining the relevance of each feature to the expert’s policy. Our probabilistic algorithm allows complex behaviors to be captured from suboptimal stochastic demonstrations, while automatically balancing the simplicity of the learned reward structure against its consistency with the observed actions. 1
Inverse Reinforcement Learning through Structured Classification
"... This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the socalled feature expectation of the expert as the parameterization of the score fu ..."
Abstract

Cited by 13 (12 self)
 Add to MetaCart
(Show Context)
This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the socalled feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably nearoptimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1
Guided policy search
, 2013
"... Direct policy search can effectively scale to highdimensional systems, but complex policies with hundreds of parameters often present a challenge for such methods, requiring numerous samples and often falling into poor local optima. We present a guided policy search algorithm that uses trajectory ..."
Abstract

Cited by 9 (6 self)
 Add to MetaCart
(Show Context)
Direct policy search can effectively scale to highdimensional systems, but complex policies with hundreds of parameters often present a challenge for such methods, requiring numerous samples and often falling into poor local optima. We present a guided policy search algorithm that uses trajectory optimization to direct policy learning and avoid poor local optima. We show how differential dynamic programming can be used to generate suitable guiding samples, and describe a regularized importance sampled policy optimization that incorporates these samples into the policy search. We evaluate the method by learning neural network controllers for planar swimming, hopping, and walking, as well as simulated 3D humanoid running. 1.
A cascaded supervised learning approach to inverse reinforcement learning
 In European Conference on Machine Learning (ECML
, 2013
"... Abstract. This paper considers the Inverse Reinforcement Learning (IRL) problem, that is inferring a reward function for which a demonstrated expert policy is optimal. We propose to break the IRL problem down into two generic Supervised Learning steps: this is the Cascaded Supervised IRL (CSI) appr ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
Abstract. This paper considers the Inverse Reinforcement Learning (IRL) problem, that is inferring a reward function for which a demonstrated expert policy is optimal. We propose to break the IRL problem down into two generic Supervised Learning steps: this is the Cascaded Supervised IRL (CSI) approach. A classification step that defines a score function is followed by a regression step providing a reward function. A theoretical analysis shows that the demonstrated expert policy is nearoptimal for the computed reward function. Not needing to repeatedly solve a Markov Decision Process (MDP) and the ability to leverage existing techniques for classification and regression are two important advantages of the CSI approach. It is furthermore empirically demonstrated to compare positively to stateoftheart approaches when using only transitions sampled according to the expert policy, up to the use of some heuristics. This is exemplified on two classical benchmarks (the mountain car problem and a highway driving simulator). 1
Inverse Optimality Design for Biological Movement Systems
"... Abstract: This paper proposes an inverse optimality design method for nonlinear deterministic system and nonlinear stochastic system with multiplicative and additive noises. The new method is developed based on the general HamiltonJacobiBellman(HJB) equation, and it constructs an estimated cost fu ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract: This paper proposes an inverse optimality design method for nonlinear deterministic system and nonlinear stochastic system with multiplicative and additive noises. The new method is developed based on the general HamiltonJacobiBellman(HJB) equation, and it constructs an estimated cost function using a linear function approximator—Gaussian Radial Basis function, which can recover the original performance criterion for which the given control law is optimal. The performance of the algorithm is illustrated on scalar systems and a complex biomechanical control problem involving a stochastic model of the human arm.
The Principle of Maximum Causal Entropy for Estimating Interacting Processes
"... Abstract—The principle of maximum entropy provides a powerful framework for estimating joint, conditional, and marginal probability distributions. However, there are many important distributions with elements of interaction and feedback where its applicability has not been established. This work pre ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—The principle of maximum entropy provides a powerful framework for estimating joint, conditional, and marginal probability distributions. However, there are many important distributions with elements of interaction and feedback where its applicability has not been established. This work presents the principle of maximum causal entropy—an approach based on directed information theory for estimating an unknown process based on its interactions with a known process. We demonstrate the breadth of the approach using two applications: a predictive solution for inverse optimal control in decision processes and computing equilibrium strategies in sequential games. Index Terms—Maximum entropy, statistical estimation, causal entropy, directed information, inverse optimal control, inverse reinforcement learning, correlated equilibrium. I.
Bayesian Theory of Mind: Modeling Human Reasoning about Beliefs, Desires, Goals, and Social Relations
, 2012
"... This thesis proposes a computational framework for understanding human Theory of Mind (ToM): our conception of others ' mental states, how they relate to the world, and how they cause behavior. Humans use ToM to predict others ' actions, given their mental states, but also to do the revers ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
This thesis proposes a computational framework for understanding human Theory of Mind (ToM): our conception of others ' mental states, how they relate to the world, and how they cause behavior. Humans use ToM to predict others ' actions, given their mental states, but also to do the reverse: attribute mental states beliefs, desires, intentions, knowledge, goals, preferences, emotions, and other thoughts to explain others ' behavior. The goal of this thesis is to provide a formal account of the knowledge and mechanisms that support these judgments. The thesis will argue for three central claims about human ToM. First, ToM is constructed around probabilistic, causal models of how agents ' beliefs, desires and goals interact with their situation and perspective (which can differ from our own) to produce behavior. Second, the core content of ToM can be formalized using contextspecific models of approximately rational planning, such as Markov decision processes (MDPs), partially observable MDPs (POMDPs), and Markov games. ToM reasoning will be formalized as rational probabilistic inference over these models of intentional (inter)action, termed Bayesian Theory of Mind (BToM). Third, hypotheses about the structure and content of ToM can be tested through a combination
Factorized Decision Forecasting via Combining Valuebased and Rewardbased Estimation
"... Abstract — A powerful recent perspective for predicting sequential decisions learns the parameters of decision problems that produce observed behavior as (near) optimal solutions. Under this perspective, behavior is explained in terms of utilities, which can often be defined as functions of state an ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract — A powerful recent perspective for predicting sequential decisions learns the parameters of decision problems that produce observed behavior as (near) optimal solutions. Under this perspective, behavior is explained in terms of utilities, which can often be defined as functions of state and action features to enable generalization across decision tasks. Two approaches have been proposed from this perspective: estimate a featurebased reward function and recursively compute values from it, or directly estimate a featurebased value function. In this work, we investigate the combination of these two approaches into a single learning task using directed information theory and the principle of maximum entropy. This enables uncovering which type of estimate is most appropriate—in terms of predictive accuracy and/or computational benefit—for different portions of the decision space. I.
Inverse Reinforcement Learning with Locally Consistent Reward Functions
"... Existing inverse reinforcement learning (IRL) algorithms have assumed each expert’s demonstrated trajectory to be produced by only a single reward function. This paper presents a novel generalization of the IRL problem that allows each trajectory to be generated by multiple locally consistent rewar ..."
Abstract
 Add to MetaCart
(Show Context)
Existing inverse reinforcement learning (IRL) algorithms have assumed each expert’s demonstrated trajectory to be produced by only a single reward function. This paper presents a novel generalization of the IRL problem that allows each trajectory to be generated by multiple locally consistent reward functions, hence catering to more realistic and complex experts ’ behaviors. Solving our generalized IRL problem thus involves not only learning these reward functions but also the stochastic transitions between them at any state (including unvisited states). By representing our IRL problem with a probabilistic graphical model, an expectationmaximization (EM) algorithm can be devised to iteratively learn the different reward functions and the stochastic transitions between them in order to jointly improve the likelihood of the expert’s demonstrated trajectories. As a result, the most likely partition of a trajectory into segments that are generated from different locally consistent reward functions selected by EM can be derived. Empirical evaluation on synthetic and realworld datasets shows that our IRL algorithm outperforms the stateoftheart EM clustering with maximum likelihood IRL, which is, interestingly, a reduced variant of our approach. 1