Results 1 
4 of
4
Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635
 In Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, 2014
"... We consider the problem of reinforcement learning with an orientation toward contexts in which an agent must generalize from past experience and explore to reduce uncertainty. We propose an approach to exploration based on randomized value functions and an algorithm – randomized leastsquares value ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We consider the problem of reinforcement learning with an orientation toward contexts in which an agent must generalize from past experience and explore to reduce uncertainty. We propose an approach to exploration based on randomized value functions and an algorithm – randomized leastsquares value iteration (RLSVI) – that embodies this approach. We explain why versions of leastsquares value iteration that use Boltzmann or greedy exploration can be highly inefficient and present computational results that demonstrate dramatic efficiency gains enjoyed by RLSVI. Our experiments focus on learning over episodes of a finitehorizon Markov decision process and use a version of RLSVI designed for that task, but we also propose a version of RLSVI that addresses continual learning in an infinitehorizon discounted Markov decision process. 1
Efficient exploration and value function generalization in deterministic systems
 In Advances in Neural Information Processing Systems 26
, 2013
"... We consider the problem of reinforcement learning over episodes of a finitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true va ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We consider the problem of reinforcement learning over episodes of a finitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q ∗ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE[Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q ∗ does not lie in Q, for the special case where Q is the span of prespecified indicator functions over disjoint sets.
Bayesian Reinforcement Learning with Exploration
"... Abstract. We consider a general reinforcement learning problem and show that carefully combining the Bayesian optimal policy and an exploring policy leads to minimax samplecomplexity bounds in a very general class of (historybased) environments. We also prove lower bounds and show that the new al ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. We consider a general reinforcement learning problem and show that carefully combining the Bayesian optimal policy and an exploring policy leads to minimax samplecomplexity bounds in a very general class of (historybased) environments. We also prove lower bounds and show that the new algorithm displays adaptive behaviour when the environment is easier than worstcase.
∗Mobile Intelligent Autonomous Systems Group
"... We present algorithms to effectively represent a set of Markov decision processes (MDPs), whose optimal policies have already been learned, by a smaller source subset for lifelong, policyreusebased transfer learning in reinforcement learning. This is necessary when the number of previous tasks is ..."
Abstract
 Add to MetaCart
(Show Context)
We present algorithms to effectively represent a set of Markov decision processes (MDPs), whose optimal policies have already been learned, by a smaller source subset for lifelong, policyreusebased transfer learning in reinforcement learning. This is necessary when the number of previous tasks is large and the cost of measuring similarity counteracts the benefit of transfer. The source subset forms an ‘net ’ over the original set of MDPs, in the sense that for each previous MDP Mp, there is a source Ms whose optimal policy has < regret in Mp. Our contributions are as follows. We present EXP3Transfer, a principled policyreuse algorithm that optimally reuses a given source policy set when learning for a new MDP. We present a framework to cluster the previous MDPs to extract a source subset. The framework consists of (i) a distance dV over MDPs to measure policybased similarity between MDPs; (ii) a cost function g(·) that uses dV to measure how good a particular clustering is for generating useful source tasks for EXP3Transfer and (iii) a provably convergent algorithm, MHAV, for finding the optimal clustering. We validate our algorithms through experiments in a surveillance domain.