Results 11  20
of
68
Learning is planning: near Bayesoptimal reinforcement learning via MonteCarlo tree search
"... Bayesoptimal behavior, while welldefined, is often difficult to achieve. Recent advances in the use of MonteCarlo tree search (MCTS) have shown that it is possible to act nearoptimally in Markov Decision Processes (MDPs) with very large or infinite state spaces. Bayesoptimal behavior in an unkn ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
Bayesoptimal behavior, while welldefined, is often difficult to achieve. Recent advances in the use of MonteCarlo tree search (MCTS) have shown that it is possible to act nearoptimally in Markov Decision Processes (MDPs) with very large or infinite state spaces. Bayesoptimal behavior in an unknown MDP is equivalent to optimal behavior in the known beliefspace MDP, although the size of this beliefspace MDP grows exponentially with the amount of history retained, and is potentially infinite. We show how an agent can use one particular MCTS algorithm, Forward Search Sparse Sampling (FSSS), in an efficient way to act nearly Bayesoptimally for all but a polynomial number of steps, assuming that FSSS can be used to act efficiently in any possible underlying MDP. 1
Arbitrarily modulated Markov decision processes
 In Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference
, 2009
"... decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., nonstationary) fashion. We propose an online Qlearning style algorithm and give a guarantee on its performance evaluated in retrospect against alternative policies.Unlikepreviousworks,theguarantee ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., nonstationary) fashion. We propose an online Qlearning style algorithm and give a guarantee on its performance evaluated in retrospect against alternative policies.Unlikepreviousworks,theguaranteedepends critically on the variability of the uncertainty in the transition probabilities, but holds regardless of arbitrary changes in rewards and transition probabilities over time. Besides its intrinsic computational efficiency, this approach requires neither prior knowledge nor estimation of the transition probabilities. I.
Automatic Feature Selection for ModelBased Reinforcement Learning in Factored MDPs
"... Abstract—Feature selection is an important challenge in machine learning. Unfortunately, most methods for automating feature selection are designed for supervised learning tasks and are thus either inapplicable or impractical for reinforcement learning. This paper presents a new approach to feature ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
Abstract—Feature selection is an important challenge in machine learning. Unfortunately, most methods for automating feature selection are designed for supervised learning tasks and are thus either inapplicable or impractical for reinforcement learning. This paper presents a new approach to feature selection specifically designed for the challenges of reinforcement learning. In our method, the agent learns a model, represented as a dynamic Bayesian network, of a factored Markov decision process, deduces a minimal feature set from this network, and efficiently computes a policy on this feature set using dynamic programming methods. Experiments in a stocktrading benchmark task demonstrate that this approach can reliably deduce minimal feature sets and that doing so can substantially improve performance and reduce the computational expense of planning. KeywordsReinforcement learning; feature selection; factored MDPs I.
Efficient learning of relational models for sequential decision making
, 2010
"... The explorationexploitation tradeoff is crucial to reinforcementlearning (RL) agents, and a significant number of sample complexity results have been derived for agents in propositional domains. These results guarantee, with high probability, nearoptimal behavior in all but a polynomial number of ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
The explorationexploitation tradeoff is crucial to reinforcementlearning (RL) agents, and a significant number of sample complexity results have been derived for agents in propositional domains. These results guarantee, with high probability, nearoptimal behavior in all but a polynomial number of timesteps in the agent’s lifetime. In this work, we prove similar results for certain relational representations, primarily a class we call “relational action schemas”. These generalized models allow us to specify state transitions in a compact form, for instance describing the effect of picking up a generic block instead of picking up 10 different specific blocks. We present theoretical results on crucial subproblems in actionschema learning using the KWIK framework, which allows us to characterize the sample efficiency of an agent learning these models in a reinforcementlearning setting. These results are extended in an apprenticeship learning paradigm where and agent has access not only to its environment, but also to a teacher that can demonstrate traces of state/action/state sequences. We show that the class of action schemas that are efficiently learnable in this paradigm is strictly larger than those learnable in the online setting. We link
Online learning in Markov decision processes with arbitrarily changing rewards and transitions
 In GameNets’09: Proceedings of the First ICST international conference on Game Theory for Networks
"... Abstract — We consider decisionmaking problems in Markov decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., nonstationary) fashion. We present algorithms that combine online learning and robust control, and establish guarantees on their performan ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
Abstract — We consider decisionmaking problems in Markov decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., nonstationary) fashion. We present algorithms that combine online learning and robust control, and establish guarantees on their performance evaluated in retrospect against alternative policies—i.e., its regret. These guarantees depend critically on the range of uncertainty in the transition probabilities, but hold regardless of the changes in rewards and transition probabilities over time. We present a version of the main algorithm in the setting where the decisionmaker’s observations are limited to its trajectory, and another version that allows a tradeoff between performance and computational complexity. I.
Cyclic Equilibria in Markov Games
, 2005
"... Although variants of value iteration have been proposed for finding Nash or correlated equilibria in generalsum Markov games, these variants have not been shown to be effective in general. In this paper, we demonstrate by construction that existing variants of value iteration cannot find statio ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Although variants of value iteration have been proposed for finding Nash or correlated equilibria in generalsum Markov games, these variants have not been shown to be effective in general. In this paper, we demonstrate by construction that existing variants of value iteration cannot find stationary equilibrium policies in arbitrary generalsum Markov games. Instead, we
Sample Complexity of Multitask Reinforcement Learning
"... Transferring knowledge across a sequence of reinforcementlearning tasks is challenging, and has a number of important applications. Though there is encouraging empirical evidence that transfer can improve performance in subsequent reinforcementlearning tasks, there has been very little theoretical ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Transferring knowledge across a sequence of reinforcementlearning tasks is challenging, and has a number of important applications. Though there is encouraging empirical evidence that transfer can improve performance in subsequent reinforcementlearning tasks, there has been very little theoretical analysis. In this paper, we introduce a new multitask algorithm for a sequence of reinforcementlearning tasks when each task is sampled independently from (an unknown) distribution over a finite set of Markov decision processes whose parameters are initially unknown. For this setting, we prove under certain assumptions that the pertask sample complexity of exploration is reduced significantly due to transfer compared to standard singletask algorithms. Our multitask algorithm also has the desired characteristic that it is guaranteed not to exhibit negative transfer: in the worst case its pertask sample complexity is comparable to the corresponding singletask algorithm. 1
PAC Reinforcement Learning Bounds for RTDP and RandRTDP
"... Realtime Dynamic Programming (RTDP) is a popular algorithm for planning in a Markov Decision Process (MDP). It can also be viewed as a learning algorithm, where the agent improves the value function and policy while acting in an MDP. It has been empirically observed that an RTDP agent generally per ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Realtime Dynamic Programming (RTDP) is a popular algorithm for planning in a Markov Decision Process (MDP). It can also be viewed as a learning algorithm, where the agent improves the value function and policy while acting in an MDP. It has been empirically observed that an RTDP agent generally performs well when viewed this way, but past theoretical results have been limited to asymptotic convergence proofs. We show that, like true learning algorithms E 3 and RMAX, a slightly modified version of RTDP satisfies a Probably Approximately Correct (PAC) condition (with better sample complexity bounds). In other words, we show that the number of timesteps in an infinitelength run in which the RTDP agent acts according to a nonɛoptimal policy from its current state is less than some polynomial (in the size of the MDP), with high probability. We also show that a randomized version of RTDP is PAC with asymptotically equal sample complexity bounds, but has much less perstep computational cost, O(ln(S) ln(SA) + ln(A)), rather than O(S + ln(A)) when we consider only the dependence on S, the number of states and A, the number of actions of the MDP.
Robust Bayesian reinforcement learning through tight lower bounds
"... Abstract. In the Bayesian approach to sequential decision making, exact calculation of the (subjective) utility is intractable. This extends to most special cases of interest, such as reinforcement learning problems. While utility bounds are known to exist for this problem, so far none of them were ..."
Abstract

Cited by 5 (5 self)
 Add to MetaCart
Abstract. In the Bayesian approach to sequential decision making, exact calculation of the (subjective) utility is intractable. This extends to most special cases of interest, such as reinforcement learning problems. While utility bounds are known to exist for this problem, so far none of them were particularly tight. In this paper, we show how to efficiently calculate a lower bound, which corresponds to the utility of the optimal stationary policy for the decision problem, which is generally different from both the Bayesoptimal policy and the policy which is optimal for the mean MDP. We then show how these can be applied to obtain robust exploration policies in a Bayesian reinforcement learning setting. 1
Autonomous Qualitative Learning of Distinctions and Actions in a Developing Agent
"... How can an agent bootstrap up from a pixellevel representation to autonomously learn highlevel states and actions using only domain general knowledge? This thesis attacks a piece of this problem and assumes that an agent has a set of continuous variables describing the environment and a set of con ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
How can an agent bootstrap up from a pixellevel representation to autonomously learn highlevel states and actions using only domain general knowledge? This thesis attacks a piece of this problem and assumes that an agent has a set of continuous variables describing the environment and a set of continuous motor primitives, and poses a solution for the problem of how an agent can learn a set of useful states and effective higherlevel actions through autonomous experience with the environment. There exist methods for learning models of the environment, and there also exist methods for planning. However, for autonomous learning, these methods have been used almost exclusively in discrete environments. This thesis proposes attacking the problem of learning highlevel states and actions in continuous environments by using a qualitative representation to bridge the gap between continuous and discrete variable representations. In this approach, the agent begins with a broad discretization and initially can only tell if the value of each variable is increasing, decreasing, or remaining steady. The agent then simultaneously learns a qualitative representation (discretization) and a set of predictive models of the environment. The agent then converts these models into plans to form actions. The agent then uses those learned actions to explore the environment. The method is evaluated using a simulated robot with realistic physics. The robot is sitting at a table that contains one or two blocks, as well as other distractor objects that are out of reach. The agent autonomously explores the environment without being given a task. After learning, the agent is given various tasks to determine if it learned the necessary states and actions to complete them. The results show that the agent was able to use this method to autonomously learn to perform the tasks.