#### DMCA

## Combining Multiple Correlated Reward and Shaping Signals by Measuring Confidence

Citations: | 3 - 3 self |

### Citations

789 | Finite-time analysis of the multiarmed bandit problem - Auer, Cesa-Bianchi, et al. |

378 | Online Q-learning using connectionist systems (Tech
- Rummery, Niranjan
- 1994
(Show Context)
Citation Context ...dicating the utility of that transition. The goal of an RL agent operating in an MDP is to maximize the expected, discounted return of the reward function. Temporal-difference learners such as SARSA (=-=Rummery and Niranjan 1994-=-) attempt this by estimating the Q-function, which represents the utility of each state-action pair. MOMDPs (Roijers et al. 2013) extend this framework to multiple objectives, with the reward function... |

364 | Multi-Agent Systems: A Survey from a Machine Learning Perspective - Stone, Veloso - 2000 |

344 |
On a measure of divergence between two statistical populations defined by probability distributions
- Bhattacharyya
- 1943
(Show Context)
Citation Context ... use a number of statistical tests to estimate confidence. Continuing with our examples, given mean and variance of two (assumed) normal distributions, we can calculate the Bhattacharyya coefficient (=-=Bhattacharyya 1943-=-), which indicates the percentage of overlap between given distributions. The less overlap between distributions, the better the agent can differentiate between the actions represented by those distri... |

240 | Policy invariance underreward transformations: Theory and application to reward shaping - Ng, Harada, et al. - 1999 |

64 | A multiagent approach to autonomous intersection management
- Dresner, Stone
- 2008
(Show Context)
Citation Context ...e two objectives in (Brys, Pham, and Taylor 2014), we classify this problem as a CMOMDP, and follow the same experimental setup. The experiments were implemented in the real-time AIM micro simulator (=-=Dresner and Stone 2008-=-), setup with a four intersection Manhattan grid. Each of the four lights is controlled by a separate SARSA(λ) agent, which has only local information, i.e. information about its own intersection. The... |

58 | Reducing local optima in single-objective problems by multi-objectivization,” in Evolutionary MultiCriterion Optimization - Knowles, Watson, et al. - 2001 |

38 | On optimal cooperation of knowledge sources – an empirical investigation - Benda, Jagannathan, et al. - 1986 |

31 |
Reinforcement learning: An introduction, volume 1. Cambridge Univ
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...ack for the same basic single-objective problem, and intelligently combining such objectives may yield faster and better optimization. This paper deals with such reinforcement learning (RL) problems (=-=Sutton and Barto 1998-=-), formulated as Correlated Multi-Objective Markov Decision Processes (CMOMDP). (Single-objective) MDPs describe a system as a set of potential observations of that system’s state S, a set of possible... |

29 | Reinforcement learning in continuous action spaces - Hasselt, Wiering - 2007 |

28 |
Brains, Behavior and Robotics
- Albus
- 1981
(Show Context)
Citation Context ...lems have very large and/or continuous state spaces, making basic tabular learning methods impossible to use. A very popular way to overcome this problem is to use tile-coding function approximation (=-=Albus 1981-=-), which overlays the state space with multiple axis-parallel tilings. This allows for a discretization of the state-space, while the overlapping tilings guarantee a certain degree of generalization. ... |

20 | Theoretical considerations of potential-based reward shaping for multi-agent systems
- Devlin, Kudenko
- 2011
(Show Context)
Citation Context ...formance of single-objective learn1Results of our experimental validation are omitted for space. 2It has been proven that potential-based shaping in multi-agent RL does not alter the Nash Equilibria (=-=Devlin and Kudenko 2011-=-). 3The code used to run experiments in the pursuit domain can be downloaded at http://ai.vub.ac.be/members/ tim-brys 0 0.5 1 1.5 2 2.5 3 x 104 38 40 42 44 46 48 50 52 54 56 58 Time (s) Av er ag esde ... |

18 | An empirical study of potential-based reward shaping and advice in complex, multi-agent systems - Devlin, Kudenko, et al. |

17 | Helper-objectives: Using multi-objective evolutionary algorithms for single-objective optimisation - Jensen - 2005 |

14 | A survey of multi-objective sequential decision-making
- Roijers, Vamplew, et al.
- 2013
(Show Context)
Citation Context ... the reward function. Temporal-difference learners such as SARSA (Rummery and Niranjan 1994) attempt this by estimating the Q-function, which represents the utility of each state-action pair. MOMDPs (=-=Roijers et al. 2013-=-) extend this framework to multiple objectives, with the reward function returning a vector of rewards to be maximized, and the added difficulty of finding trade-off solutions (e.g., trading off econo... |

5 | GIS and intelligent agents for multiobjective natural resource allocation: A reinforcement learning approach - Bone, Dragicevic - 2009 |

4 | Introduction to Intelligent Systems in Traffic and Transportation - Bazzan, Klügl - 2014 |

1 | Multi-objectivization of reinforcement learning problems by reward shaping - Brys, Harutyunyan, et al. - 2014 |

1 |
Distributed learning and multi-objectivity in traffic light control. Connection Science 26(1):56–83
- Brys, Pham, et al.
- 2014
(Show Context)
Citation Context ... Of course, this modification in itself can not improve learning, but these copies of the basic reward signal can be diversified by adding a different potential-based reward shaping function to each (=-=Brys et al. 2014-=-). Since potential-based reward shaping is guaranteed to not alter the optimality of solutions (Ng, Harada, and Russell 1999), the problem remains a CMOMDP with a single Pareto optimal point, but each... |

1 | Increasing efficiency of evolutionary algorithms by choosing between auxiliary fitness functions with reinforcement learning - Buzdalova, Buzdalov |