Results 1 - 10
of
20
Transfer Learning for Reinforcement Learning Domains: A Survey
"... The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work.
A unifying framework for computational reinforcement learning theory
, 2009
"... Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervised-learning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understand ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervised-learning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understanding the nature of supervised learning, they have not been as successful in reinforcement learning (RL). Here, the fundamental barrier is the need for active exploration in sequential decision problems. An RL agent tries to maximize long-term utility by exploiting its knowledge about the problem, but this knowledge has to be acquired by the agent itself through exploring the problem that may reduce short-term utility. The need for active exploration is common in many problems in daily life, engineering, and sciences. For example, a Backgammon program strives to take good moves to maximize the probability of winning a game, but sometimes it may try novel and possibly harmful moves to discover how the opponent reacts in the hope of discovering a better game-playing strategy. It has been known since the early days of RL that a good tradeoff between exploration and exploitation is critical for the agent to learn fast (i.e., to reach near-optimal strategies
Transferring state abstractions between mdps
- In ICML Workshop on Structural Knowledge Transfer for Machine Learning
, 2006
"... Decision makers that employ state abstraction (or state aggregation) usually find solutions faster by treating groups of states as indistinguishable by ignoring irrelevant state information. Identifying irrelevant information is essential for the field of knowledge transfer where learning takes plac ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Decision makers that employ state abstraction (or state aggregation) usually find solutions faster by treating groups of states as indistinguishable by ignoring irrelevant state information. Identifying irrelevant information is essential for the field of knowledge transfer where learning takes place in a general setting for multiple environments. We provide a general treatment and algorithm for transferring state abstractions in MDPs. 1.
Efficient skill learning using abstraction selection
- In Proceedings of the 21st International Joint Conference on Artificial Intelligence
, 2009
"... We present an algorithm for selecting an appropriate abstraction when learning a new skill. We show empirically that it can consistently select an appropriate abstraction using very little sample data, and that it significantly improves skill learning performance in a reasonably large real-valued re ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
We present an algorithm for selecting an appropriate abstraction when learning a new skill. We show empirically that it can consistently select an appropriate abstraction using very little sample data, and that it significantly improves skill learning performance in a reasonably large real-valued reinforcement learning domain. 1
Thresholded Rewards: Acting Optimally in Timed, Zero-Sum Games
, 2007
"... In timed, zero-sum games, the goal is to maximize the probability of winning, which is not necessarily the same as maximizing our expected reward. We consider cumulative intermediate reward to be the difference between our score and our opponent’s score; the “true ” reward of a win, loss, or tie is ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
In timed, zero-sum games, the goal is to maximize the probability of winning, which is not necessarily the same as maximizing our expected reward. We consider cumulative intermediate reward to be the difference between our score and our opponent’s score; the “true ” reward of a win, loss, or tie is determined at the end of a game by applying a threshold function to the cumulative intermediate reward. We introduce thresholded-rewards problems to capture this dependency of the final reward outcome on the cumulative intermediate reward. Thresholded-rewards problems reflect different real-world stochastic planning domains, especially zero-sum games, in which time and score need to be considered. We investigate the application of thresholded rewards to finitehorizon Markov Decision Processes (MDPs). In general, the optimal policy for a thresholded-rewards MDP will be nonstationary, depending on the number of time steps remaining and the cumulative intermediate reward. We introduce an efficient value iteration algorithm that solves thresholdedrewards MDPs exactly, but with running time quadratic on the number of states in the MDP and the length of the time horizon. We investigate a number of heuristic-based techniques that efficiently find approximate solutions for MDPs with large state spaces or long time horizons.
Bounding Performance Loss in Approximate MDP Homomorphisms
"... We define a metric for measuring behavior similarity between states in a Markov decision process (MDP), in which action similarity is taken into account. We show that the kernel of our metric corresponds exactly to the classes of states defined by MDP homomorphisms (Ravindran & Barto, 2003). We prov ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
We define a metric for measuring behavior similarity between states in a Markov decision process (MDP), in which action similarity is taken into account. We show that the kernel of our metric corresponds exactly to the classes of states defined by MDP homomorphisms (Ravindran & Barto, 2003). We prove that the difference in the optimal value function of different states can be upper-bounded by the value of this metric, and that the bound is tighter than previous bounds provided by bisimulation metrics (Ferns et al. 2004, 2005). Our results hold both for discrete and for continuous actions. We provide an algorithm for constructing approximate homomorphisms, by using this metric to identify states that can be grouped together, as well as actions that can be matched. Previous research on this topic is based mainly on heuristics. 1
Sensorimotor abstraction selection for efficient, autonomous robot skill acquisition
- In Proceedings of the 7th IEEE International Conference on Development and Learning
, 2008
"... Abstract—To achieve truly autonomous robot skill acquisition, a robot can use neither a single large general state space (because learning is not feasible), nor a small problem-specific state space (because it is not general). We propose that instead a robot should have a set of sensorimotor abstrac ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract—To achieve truly autonomous robot skill acquisition, a robot can use neither a single large general state space (because learning is not feasible), nor a small problem-specific state space (because it is not general). We propose that instead a robot should have a set of sensorimotor abstractions that can be considered small candidate state spaces, and select one that is appropriate for learning a skill when it decides to do so. We introduce an incremental algorithm that selects a state space in which to learn a skill from among a set of potential spaces given a successful sample trajectory. The algorithm returns a policy fitting that trajectory in the new state space so that learning does not have to begin from scratch. We demonstrate that the algorithm selects an appropriate space for a sequence of demonstration skills on a physically realistic simulated mobile robot, and that the resulting initial policies closely match the sample trajectory. I.
Efficient learning of relational models for sequential decision making
, 2010
"... The exploration-exploitation tradeoff is crucial to reinforcement-learning (RL) agents, and a significant number of sample complexity results have been derived for agents in propositional domains. These results guarantee, with high probability, near-optimal behavior in all but a polynomial number of ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The exploration-exploitation tradeoff is crucial to reinforcement-learning (RL) agents, and a significant number of sample complexity results have been derived for agents in propositional domains. These results guarantee, with high probability, near-optimal behavior in all but a polynomial number of timesteps in the agent’s lifetime. In this work, we prove similar results for certain relational representations, primarily a class we call “relational action schemas”. These generalized models allow us to specify state transitions in a compact form, for instance describing the effect of picking up a generic block instead of picking up 10 different specific blocks. We present theoretical results on crucial subproblems in action-schema learning using the KWIK framework, which allows us to characterize the sample efficiency of an agent learning these models in a reinforcement-learning setting. These results are extended in an apprenticeship learning paradigm where and agent has access not only to its environment, but also to a teacher that can demonstrate traces of state/action/state sequences. We show that the class of action schemas that are efficiently learnable in this paradigm is strictly larger than those learnable in the online setting. We link
Planning for Human-Robot Interaction Using Time-State Aggregated POMDPs
"... In order to interact successfully in social situations, a robot must be able to observe others ’ actions and base its own behavior on its beliefs about their intentions. Many interactions take place in dynamic environments, and the outcomes of people’s or the robot’s actions may be time-dependent. I ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In order to interact successfully in social situations, a robot must be able to observe others ’ actions and base its own behavior on its beliefs about their intentions. Many interactions take place in dynamic environments, and the outcomes of people’s or the robot’s actions may be time-dependent. In this paper, such interactions are modeled as a POMDP with a time index as part of the state, resulting in a fully Markov model with a potentially very large state space. The complexity of finding even an approximate solution often limits POMDP’s practical applicability for large problems. This difficulty is addressed through the development of an algorithm for aggregating states in POMDPs with a time-indexed state space. States that represent the same physical configuration of the environment at different times are chosen to be combined using reward-based metrics, preserving the structure of the original model while producing a smaller model that is faster to solve. We demonstrate that solving the aggregated model produces a policy with performance comparable to the policy from the original model. The example domains used are a simulated elevator-riding task and a simulated driving task based on data collected from human drivers.
Automatic shaping and decomposition of reward functions
, 2007
"... This paper investigates the problem of automatically learning how to restructure the reward function of a Markov decision process so as to speed up reinforcement learning. We begin by describing a method that learns a shaped reward function given a set of state and temporal abstractions. Next, we co ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper investigates the problem of automatically learning how to restructure the reward function of a Markov decision process so as to speed up reinforcement learning. We begin by describing a method that learns a shaped reward function given a set of state and temporal abstractions. Next, we consider decomposition of the per-timestep reward in multieffector problems, in which the overall agent can be decomposed into multiple units that are concurrently carrying out various tasks. We show by example that to find a good reward decomposition, it is often necessary to first shape the rewards appropriately. We then give a function approximation algorithm for solving both problems together. Standard reinforcement learning algorithms can be augmented with our methods, and we show experimentally that in each case, significantly faster learning results.

