#### DMCA

## Multiagent Planning and Learning Using Random Decompositions and Adaptive Representations (2015)

### Citations

5604 | Reinforcement learning: an introduction
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...f the state space is simply Mna, where m is the number of grids. Hence the problem size increases exponentially in the number of agents. Many existing MDP solvers [27] and transition model estimators =-=[28]-=- have polynomial complexity in the size of the state-action space, which corresponds to having exponential complexity in the number of agents, rendering these algorithms computationally intractable fo... |

5441 |
Artificial Intelligence: A Modern Approach, 2nd edition
- Russell, Norvig
- 2003
(Show Context)
Citation Context ...rk+l, and the loop continues. The TBPI algorithm (See Section 2.1.2) was used as the planner. The first two domains are classical RL problems: 1) gridworld and 2) block building problems motivated by =-=[105]-=-. Both domain descriptions are extended to include state correlated uncertainty. The third domain is a PST mission with state-correlated uncertainty in fuel dynamics. Structure of this domain is more ... |

3931 |
On the Theory of Dynamic Programming
- Bellman
- 1952
(Show Context)
Citation Context ...ed Multiagent Planning with MDPs Dynamic Programming (DP) [36] is one of the most popular methods for solving an MDP, which involves computing the value function as a solution to the Bellman equation =-=[37]-=-. Exact DP methods, such as value iteration and policy iteration [38] work by sweeping through the planning space by performing Bellman backups and they are guaranteed to converge to the optimal solut... |

1897 |
Markov decision processes: Discrete stochastic dynamic programming
- Puterman
- 1994
(Show Context)
Citation Context ...ng problems of interest. A deeper technical discussion of these problems and tools for analyzing them will be provided in Chapter 2. 1.2.1 Multiagent Planning Problem Markov Decision Processes (MDPs) =-=[21]-=- are a common framework for analyzing stochastic decision making problems. We will utilize this framework to study multiagent planning problems. An MDP is a tuple (S, A, T, R, -y), where S is the stat... |

1310 |
An Introduction to Multi-Agent Systems
- Wooldridge
- 2002
(Show Context)
Citation Context ...laws ei and the communication laws ci such that the estimates ik -+ 7; as k -+ oo, preferably with an exponential rate. 1.3 Challenges Planning and learning with MMDPs involve a variety of challenges =-=[25]-=-. This thesis addresses three particular challenges that are critical to the design of multiagent planning and learning algorithms. 1. Scalability: The "curse of dimensionality" is a widely recognized... |

1027 | Evolutionary Game Theory
- Weibull
- 1997
(Show Context)
Citation Context ...d communication have been developed [68, 69]. However, extensions to the heterogeneous learning setting have not been much investigated. Game Theory is a popular framework to study multiagent systems =-=[70]-=-, flexibility of the game theoretic modeling framework enabled a thorough analysis of both homogeneous and heterogeneous team learning in cooperative and competitive scenarios [71, 72]. However, studi... |

914 |
Probabilistic graphical models: principles and techniques
- Koller, Friedman
- 2009
(Show Context)
Citation Context ...he Bayesian Optimization Algorithm (BOA) [81] is a probabilistic framework for solving black-box optimization problems by guiding the search for the global optima using probabilistic graphical models =-=[85]-=-, built by performing inference on the samples obtained from the 39 Algorithm 2: Bayesian Optimization (BOA) algorithm Input: The evaluation function f, Probabilistic Graphical Model p(x; 9), Initial ... |

867 | Exact matrix completion via convex optimization.
- Candes, Recht
- 2009
(Show Context)
Citation Context ...umber of items as np. Let Mij E [0, 1] denote the rating of the product i provided by user j. Usually IMiI < np xnd and the missing entries are estimated by solving the following optimization problem =-=[89]-=-, min|IXII, (2.13) SAt, Xij = Mij, where |j -|| is the nuclear norm. This optimization problem is convex and can be solved in real time for reasonably sized matrices. We will refer to the matrix with ... |

768 | Dynamic Bayesian Networks: representation, inference and learning
- Murphy
- 2002
(Show Context)
Citation Context ...ves investigating the factorization of the state-action space and then exploiting the independence between these factors. Factored MDPs [49] represent the original MDP with a Dynamic Bayesian Network =-=[50]-=- (DBN), where the state transitions are represented in a more compact manner to avoid exponential blow-up of the planning space. Both policy iteration [51] and linear programming approaches [52] were ... |

751 |
Dynamic Programming and Optimal Control. Athena Scientific
- Bertsekas
- 2005
(Show Context)
Citation Context ... such as value iteration and policy iteration [38] work by sweeping through the planning space by performing Bellman backups and they are guaranteed to converge to the optimal solution asymptotically =-=[39]-=-. However, for large-scale planning spaces, such as those associated with multiagent missions, both memory and computational complexity increases exponentially in the number of dimensions, which rende... |

699 |
Adaptive Control Processes: A Guided Tour
- Bellman
- 1961
(Show Context)
Citation Context ... of the most popular methods for solving an MDP, which involves computing the value function as a solution to the Bellman equation [37]. Exact DP methods, such as value iteration and policy iteration =-=[38]-=- work by sweeping through the planning space by performing Bellman backups and they are guaranteed to converge to the optimal solution asymptotically [39]. However, for large-scale planning spaces, su... |

632 | Learning to act using real-time dynamic programming - Barto, Bradtke, et al. - 1995 |

601 | Markov games as a framework for multi-agent reinforcement learning
- Littman
- 1994
(Show Context)
Citation Context ... multiagent systems [70], flexibility of the game theoretic modeling framework enabled a thorough analysis of both homogeneous and heterogeneous team learning in cooperative and competitive scenarios =-=[71, 72]-=-. However, studies have mainly focused on the theoretical aspects, such as convergence to Nash equilibria and practical aspects such as scalability and ability to work under limited communication were... |

411 | The complexity of decentralized control of Markov decision processes
- Bernstein, Givan, et al.
(Show Context)
Citation Context ... open problem. It should also be noted that, we are interested in solving centralized multiagent planning problems in this Thesis. There is a bulk literature on decentralized multiagent planning (See =-=[23, 65, 66]-=-). These problems come with additional challenges, such as partial observability of the agent states and enforcing the policy consensus among agents. Although decentralized planning is a very interest... |

374 |
Approximate Dynamic Programming: Solving the curses of dimensionality ,
- Powell
- 2007
(Show Context)
Citation Context ... by it's current location. The size of the state space is simply Mna, where m is the number of grids. Hence the problem size increases exponentially in the number of agents. Many existing MDP solvers =-=[27]-=- and transition model estimators [28] have polynomial complexity in the size of the state-action space, which corresponds to having exponential complexity in the number of agents, rendering these algo... |

369 | Multiagent Systems: A Survey from a Machine Learning Perspective
- Stone, Veloso
- 2000
(Show Context)
Citation Context ...Introduction The design and study of multiagent systems is a widely acknowledged interdisciplinary engineering problem, drawing attention of researchers from distinct fields, such as computer science =-=[1]-=-, electrical engineering [21, aerospace engineering [3] and operations research [4]. The engineering of multiagent systems for cooperative scenarios can be defined as the development of methodologies ... |

366 |
Stochastic approximation methods for constrained and unconstrained systems.
- Kushner, Clark
- 1978
(Show Context)
Citation Context ...feature vector correspondingly, so that we can write T; = 9[-#i in vector notation. The standard approach to update i online from noisy samples (Si,k, ti,k) is to use the stochastic gradient approach =-=[86]-=-, Oi+1 = 6f + a kAk(sik, ti,k)#Oi(Si,k), (2.8) 42 where ak E [0,1] is the learning rate used for mitigating with the noise at iteration k, and A (si,k, ti,k) is the estimation error at step k, = ti,k ... |

329 | BOA: The Bayesian Optimization Algorithm. In
- Pelikan, Goldberg, et al.
- 1999
(Show Context)
Citation Context ...cess is automated by the use of a probabilistic meta-optimization layer. The algorithm, named 29 Randomized Coordination Discovery (RCD), utilizes the framework of the Bayesian Optimization Algorithm =-=[81]-=-, which guides the decomposition search by generating samples from probabilistic graphical models defined on the planning space. This meta-optimization layer automatically discovers the coordination s... |

326 | Controlling cooperative problem solving in industrial multi-agent systems using joint intentions - Jennings - 1995 |

313 | An analysis of temporal-difference learning with function approximation.
- Tsitsiklis, Roy
- 1997
(Show Context)
Citation Context ...g Theorem 3.4.2 with the Lemma 3.4.3 shows that the representation used by RCD-TBPI will converge to a fixed linear value function approximation Q = 'i 9iqi asymptotically. The well-known result from =-=[96]-=- puts an upper bound on the approximation error introduced by a linear function approximation, when the basis functions #4 are fixed. Since the representation used by RCD-TBPI converges to a fixed rep... |

297 | R-max – a general polynomial time algorithm for near-optimal reinforcement learning.
- Braffman, Tennenholtz
- 2002
(Show Context)
Citation Context ... as the exploration-exploitation dilemma [28]. Many different approaches were developed to handle this dilemma, such as randomized exploration [29] and incorporation of knownness and memory functions =-=[41]-=-. However, handling exploration in large-scale planning spaces is still an open problem. Approximate Dynamic Programming (ADP) [29] methods were developed to address the issue of scalability by approx... |

232 | Packet routing in dynamically changing networks: A reinforcement learning approach
- Boyan, Littman
- 1994
(Show Context)
Citation Context ...gents to learn different or agent-dependent models, such as the impact of external disturbances on the agent health [32] or modelling of the arrival rate of packages in a specific part of the network =-=[33]-=-. In such problems, the learning process might be accelerated substantially by sharing information/models among agents. Learning with heterogeneous teams is far more challenging, because it involves t... |

225 | The linear programming approach to approximate dynamic programming.
- Farias, Roy
- 2003
(Show Context)
Citation Context ...twork [50] (DBN), where the state transitions are represented in a more compact manner to avoid exponential blow-up of the planning space. Both policy iteration [51] and linear programming approaches =-=[52]-=- were developed for solving factored MDPs efficiently. A highly scalable multiagent planning algorithm with factored MDPs was proposed by Guestrin and Parr [53], where each agent solves its individual... |

172 | Efficient solution algorithms for factored MDPs
- Guestrin, Koller, et al.
(Show Context)
Citation Context ... enable faster computation time [48]. This process usually involves investigating the factorization of the state-action space and then exploiting the independence between these factors. Factored MDPs =-=[49]-=- represent the original MDP with a Dynamic Bayesian Network [50] (DBN), where the state transitions are represented in a more compact manner to avoid exponential blow-up of the planning space. Both po... |

166 | Modeling supply chain dynamics: A multiagent approach."
- Swaminathan, Smith
- 1998
(Show Context)
Citation Context ...terdisciplinary engineering problem, drawing attention of researchers from distinct fields, such as computer science [1], electrical engineering [21, aerospace engineering [3] and operations research =-=[4]-=-. The engineering of multiagent systems for cooperative scenarios can be defined as the development of methodologies for coordinating multiple entities towards achieving a common goal. An important su... |

166 | Sequential optimality and coordination in multiagent systems.
- Boutilier
- 1999
(Show Context)
Citation Context ...-y E [0,1) is a discount factor. T(s, a, s') denotes the probability of ending up at state s' E S given that the current state and action is s E S, a E A. Multiagent Markov Decision Processes (MMDPs) =-=[22]-=- are specific types of MDPs that involve multiple agents, where the action space is factorized as A = A1 x A2 --- x A, and na is the number of agents. The objective of the planning problem is to find ... |

152 | Kernel-based reinforcement learning
- Ormoneit, Sen
(Show Context)
Citation Context ...n process based on the observed/simulated data. Examples are, adaptive tile coding [44], incremental feature dependency discovery [45], orthogonal matching pursuit [46] and kernel based approximators =-=[47]-=-. Overall, these approximate techniques were shown to improve the scalability and convergence rate substantially for a wide variety of MDPs and relaxed the constraints on specifying a fixed set of bas... |

141 | Hierarchical Solution of Markov Decision Processes using Macroactions.
- Hauskrecht, Meuleau, et al.
- 1998
(Show Context)
Citation Context ...e-scale MDP can be decomposed into a hierarchy of subMDPs, where sub-MDPs at the bottom of the hierarchy can be solved by neglecting states of MDPs at the top. Several works extended this formulation =-=[60]-=- and developed efficient 27 algorithms than can solve hierarchical MDPs much faster compared to the original flat MDP. The options framework [61] is closely related to the hierarchical MDPs, where a c... |

121 | Model minimization in Markov decision processes
- Dean, Givan
- 1997
(Show Context)
Citation Context ...ning spaces. Structural MDP decomposition algorithms take advantage of the structure of the problem formulation to lower the size and dimension of the planning space to enable faster computation time =-=[48]-=-. This process usually involves investigating the factorization of the state-action space and then exploiting the independence between these factors. Factored MDPs [49] represent the original MDP with... |

113 | Coordinated Reinforcement Learning. In:
- GUESTRIN, LAGOUDAKIS, et al.
- 2002
(Show Context)
Citation Context ... [51] and linear programming approaches [52] were developed for solving factored MDPs efficiently. A highly scalable multiagent planning algorithm with factored MDPs was proposed by Guestrin and Parr =-=[53]-=-, where each agent solves its individual MDP and the joint return optimization is achieved via use of a coordination graph. The main drawback of the aforementioned approaches is the assumption that th... |

111 | The Design and Analysis of a Computational Model of Cooperative Coevolution.
- Potter
- 1997
(Show Context)
Citation Context ...red. Homogeneous 28 learning have been investigated by application of evolutionary computation techniques [67], important results include applications to team-mate modeling [73], competitive learning =-=[74]-=- and credit assessment [75]. Overall, algorithms demonstrated good scalability properties. However, heterogeneous learning and performance under limited communication have not been much investigated. ... |

110 |
Transfer learning for reinforcement learning domains: a survey
- Taylor, Stone
- 2009
(Show Context)
Citation Context ...arning in the new problem. Transfer learning is an active area of research and found many applications ranging from multi-task learning [78] and collaborative filtering [79] to reinforcement learning =-=[80]-=-. Although transfer learning algorithms are not developed specifically for solving heterogeneous multiagent learning problems, they can be used as a tool to study similarities between models learned b... |

101 |
A comprehensive survey of multiagent reinforcement learning
- Busoniu, Babuska, et al.
- 2008
(Show Context)
Citation Context ...iagent Reinforcement learning (MARL) algorithms solve multiagent planning problems without the knowledge of the model beforehand. A general survey on multiagent reinforcement learning can be found on =-=[76]-=-. MARL algorithms typically learn a joint model across the agents, hence the model heterogeneity challenge is not explicitly addressed.In addition, many MARL algorithms usually suffer from scalability... |

94 | Policy iteration for factored MDPs.
- Koller, Parr
- 2000
(Show Context)
Citation Context ...original MDP with a Dynamic Bayesian Network [50] (DBN), where the state transitions are represented in a more compact manner to avoid exponential blow-up of the planning space. Both policy iteration =-=[51]-=- and linear programming approaches [52] were developed for solving factored MDPs efficiently. A highly scalable multiagent planning algorithm with factored MDPs was proposed by Guestrin and Parr [53],... |

92 | Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes.
- WU, Mahadevan, et al.
- 2007
(Show Context)
Citation Context ...... , na,1 * J. This assumption usually simplifies both the planning and the learning problem significantly. For instance, the function approximation methods [29] and dimensionality reduction methods =-=[30]-=- work much more efficiently for the multiagent planning problems with homogeneous teams. Similarly, solving multiagent learning problems become significantly easier, since all the agents are trying to... |

92 | Bayesian learning in social networks.
- Gale, Kariv
- 2003
(Show Context)
Citation Context ...sensus problem [31]. This problem has been investigated extensively in different settings and efficient algorithms in Bayesian framework for both perfect and limited communication have been developed =-=[68, 69]-=-. However, extensions to the heterogeneous learning setting have not been much investigated. Game Theory is a popular framework to study multiagent systems [70], flexibility of the game theoretic mode... |

91 |
Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Pr I Llc,
- Busoniu, Babuska, et al.
- 2010
(Show Context)
Citation Context ...re greater than 6d6 log d + 4d 5 log(nfa,,.(nagent + next)), where d is the maximum neighborhood size in the graph. A2) Exploration schedule E(k) of TBPI satisfies the infinite exploration conditions =-=[29]-=-. Under the assumptions above, RCD algorithm converges to a distribution p(x; 0*). Proof The convergence of BOA was established in the Lemma 3.4.1. In order to extend this result to the convergence of... |

77 |
Tsitsiklis, Introduction to Probability. Athena Scientific
- Bertsekas, N
- 2002
(Show Context)
Citation Context ...x ;A xeA Ek[f(x) -zp(X; Ok)f (X) 2 _(E.XEA P(X; k)f (X)) 2 XEA Ekf(x)] Ek[f(x ] _ EXEA f (x) 2p(X; Ok) - Ek[ (X)2 Ek [f(x] _ Ek [(f(x) 2 - Ek [f(x)] 2 Ek[ f(x] By using the definition of the variance =-=[94]-=-, we can see that nominator is Ek[f(x) 2 ] - E[f(X)]2 = Ek[(f(x) - Ek[f(x)])2 )], which is the variance of f(x) with respect to the probability distribution p(x; 0k). Since the variance is always non-... |

71 | Knows What It Knows: A Framework for Self-Aware Learning. In
- Li, Littman, et al.
- 2008
(Show Context)
Citation Context ...he model, we construct a policy that performs well on state transitions that are more likely to occur during the execution of the scenario. Although more advanced exploration algorithms, such as KWIK =-=[84]-=- can be used to handle the exploration-exploitation dilemma, we use the E-greedy approach with optimistic initialization to handle this dilemma [28] to save memory and computation power. 2.2 Bayesian ... |

68 | Bayesian Reasoning and Machine Learning. - Barber - 2011 |

65 | Using collective intelligence to route Internet traffic
- Wolpert, Tumer, et al.
- 1999
(Show Context)
Citation Context ...the missions may stem from many different sources, such as the impact of external disturbances on the dynamical model of the robot [17], the rate change and amplitude of package requests in a network =-=[18]-=-, or the uncertainty in the target motion model in a cooperative target tracking scenario [19]. Hence planning algorithms require a model of the uncertainty in order to hedge against the stochastic dy... |

65 | Collaborative multiagent reinforcement learning by payoff propagation.
- Kok, Vlassis
- 2006
(Show Context)
Citation Context ...scussed in the Sections 2.1.2 and 1.4, many existing approximate MDP solvers fail to scale up with the number of agents in multiagent scenarios. The coordination structure decomposition based methods =-=[53, 90]-=- yield the best scalability but require the existence of a fixed coordination graph structure beforehand. The main premise of the RCD is the discovery of useful coordination structures through a rando... |

60 | Efficient structure learning in factored-state MDPs,” - Strehl, Diuk, et al. - 2007 |

42 | On the convergence of a class of estimation of distribution algorithm.
- Zhang, Muhlenbein
- 2004
(Show Context)
Citation Context ...for a family of cost functions with a specific structure. Pelikan proved finite time convergence guarantees for these classes of problems. The following proof extends the work of Zhang and Muhlenbein =-=[93]-=-, by considering the weighted likelihood estimation for discrete probability distributions. Lemma 3.4.1 (Asymptotic Convergence of BOA) Let f : A -* [0,1], A c Zn be a positive multivariable discrete ... |

34 |
An analysis of temporal difference learning with function approximation
- Tsitsiklis, Roy
- 1997
(Show Context)
Citation Context ...rsions of value and policy iteration [39] can be applied to obtain a sub-optimal solution. Linear function approximation [35] methods received particular attention due to their theoretical guarantees =-=[42]-=-. Non26 linear approximators such as radial basis functions [29] and neural networks [43] have also been studied because of their capabilities to approximate a large class of functions. Note however t... |

32 | Hierarchical POMDP controller optimization by likelihood maximization. In
- Toussaint, Charlin, et al.
- 2008
(Show Context)
Citation Context ...ed to the hierarchical MDPs, where a collection of actions and state transitions are abstracted to entities that can be executed as macro actions. Both planning with options [62] and learning options =-=[63, 64]-=- attracted many researchers in recent years, however automating the discovery of such structures is still an open problem. It should also be noted that, we are interested in solving centralized multia... |

29 | Utile coordination: Learning interdependencies among cooperative agents.
- Kok, Jan, et al.
- 2005
(Show Context)
Citation Context ...ol to estimate factored MDP structures. Although not studied from multiagent planning point of view, Doshi [57] applied nonparametric inference techniques to learn structure of DBNs. Kok and Valassis =-=[58]-=- proposed an algorithm for learning the structure of the coordination graphs greedily based on statistical tests, however no theoretical analysis was done and the algorithm was shown to be effective f... |

29 |
Survey of Collaborative Filtering
- Su, Khoshgoftaar, et al.
- 2009
(Show Context)
Citation Context ... estimating multiple models by fusing information across different input sets. One of the biggest applications of CF is the web recommendation systems used by the companies such as Amazon and Netflix =-=[88]-=-. These systems frequently estimate which items are likely to be bought by which users based on the analysis of similarity of item ratings provided by the users. Matrix completion is one of the popula... |

28 | Agents learning about agents: A framework and analysis
- Vidal, Durfee
- 1997
(Show Context)
Citation Context ...unication were largely ignored. Homogeneous 28 learning have been investigated by application of evolutionary computation techniques [67], important results include applications to team-mate modeling =-=[73]-=-, competitive learning [74] and credit assessment [75]. Overall, algorithms demonstrated good scalability properties. However, heterogeneous learning and performance under limited communication have n... |

25 | Pursuit-evasion on trees by robot teams
- Kolling, Carpin
(Show Context)
Citation Context ...rating in an uncertain environment with the objective of achieving a common goal or trying to maximize a joint reward. 21 Examples include planetary science missions [5] , cooperative pursuit evasion =-=[6]-=-, persistent surveillance with multiple Unmanned Aerial Vehicles (UAVs) [7], reconnaissance missions [8], and supply chain event management [9]. Cooperation between agents is a recurring theme in thes... |

25 |
Cooperative Multiagent Learning: The State of the Art. Autonomous Agents and Multiagent Systems 11(3): 387–434
- Panait, Luke
(Show Context)
Citation Context ... model learning methods. Related Work on Multiagent Learning Fusion of multiagent systems and machine learning methods is an active area of research. A general survey on these methods can be found in =-=[67]-=-. Classical multiagent learning literature focuses on the problem of a group of agents trying to learn a single model, which can be formulated as a parameter/model consensus problem [31]. This problem... |

22 | Scalable learning in stochastic games
- Bowling, Veloso
- 2000
(Show Context)
Citation Context ...g algorithms contain parameters that must be tuned to obtain a good performance in a specific domain. These parameters can be as simple as a scalar learning rate in a reinforcement learning algorithm =-=[34]-=- and sometimes they can be as complex as a set of arbitrary basis functions for an approximation algorithm [35]. In many applications, these parameters are set by domain experts based on the heuristic... |

20 |
Shankar Sastry. Conflict resolution for air traffic management: A study in multiagent hybrid systems
- Tomlin, Pappas
- 1998
(Show Context)
Citation Context ... is a widely acknowledged interdisciplinary engineering problem, drawing attention of researchers from distinct fields, such as computer science [1], electrical engineering [21, aerospace engineering =-=[3]-=- and operations research [4]. The engineering of multiagent systems for cooperative scenarios can be defined as the development of methodologies for coordinating multiple entities towards achieving a ... |

19 | Adaptive Tile Coding for Value Function Approximation, AI
- Whiteson, Stone
- 2007
(Show Context)
Citation Context ...relies on domain expertise. Recently, significant effort has been applied into automating the basis function selection process based on the observed/simulated data. Examples are, adaptive tile coding =-=[44]-=-, incremental feature dependency discovery [45], orthogonal matching pursuit [46] and kernel based approximators [47]. Overall, these approximate techniques were shown to improve the scalability and c... |

17 | Efficient skill learning using abstraction selection
- Konidaris, Barto
- 2009
(Show Context)
Citation Context ...ed to the hierarchical MDPs, where a collection of actions and state transitions are abstracted to entities that can be executed as macro actions. Both planning with options [62] and learning options =-=[63, 64]-=- attracted many researchers in recent years, however automating the discovery of such structures is still an open problem. It should also be noted that, we are interested in solving centralized multia... |

17 |
Measure theory and probability,
- Adams, M, et al.
- 1996
(Show Context)
Citation Context ...t is seen that Ek+[f(x)] - Ek[f(x)] > 0. Hence the sequence Ek[f(x)] in the Eq.3.7 is monotonically increasing with respect to the step k. Since f is bounded on A, by the monotone convergence theorem =-=[95]-=-, Ek[f(x)] converges to a limit Ek[f(x)] -+ g*. Next, we need to show that g* = f* = f(x), x E A*. We will proceed with using proof by contradiction. Assume that g* f*. Then, since Ek[f(x)] -+ g* is a... |

14 | Decentralized control of partially observable Markov decision processes.
- Amato, Chowdhary, et al.
- 2013
(Show Context)
Citation Context ...del T from the observed state transitions. In many multiagent applications of interest, the agents cannot directly change the states of the other agents, that is, the transition models are independent=-=[23]-=-. The coupling between agents usually enters the system through the reward model. We will assume that the transition dynamics are independent among the agents and the transition model can be factorize... |

14 | Online Discovery of Feature Dependencies.
- Geramifard, Doshi, et al.
- 2011
(Show Context)
Citation Context ...nt effort has been applied into automating the basis function selection process based on the observed/simulated data. Examples are, adaptive tile coding [44], incremental feature dependency discovery =-=[45]-=-, orthogonal matching pursuit [46] and kernel based approximators [47]. Overall, these approximate techniques were shown to improve the scalability and convergence rate substantially for a wide variet... |

13 | Multiagent reinforcement learning for urban traffic control using coordination graphs.
- Kuyer, Whiteson, et al.
- 2008
(Show Context)
Citation Context ...ontrol of manufacturing processes [15]. Similar to the missions with mobile agents, cooperation between agents is critical in these applications. Consider a multiagent traffic flow regulation problem =-=[16]-=-, in which each agent controls a traffic light at an intersection and the objective is to maximize the flow rate of the mobile vehicles while trying to minimize the traffic jams and the number of coll... |

12 | Convergence analysis of gradient descent stochastic algorithms
- Shapiro, Wardi
- 1996
(Show Context)
Citation Context ...tation since it is fixed. We provide a theoretical proof showing that iFDD-SGD asymptotically converges to a solution with probability one using existing results on stochastic gradient descent theory =-=[102, 104]-=-. Moreover, we show that if p can be captured through the representation class, iFDD-SGD converges to this point. Throughout the section we assume the standard diminishing step-size parameter for Eq. ... |

10 | Role-based autonomous multirobot exploration
- Hoog, Cameron, et al.
- 2009
(Show Context)
Citation Context ...involve teams of mobile robots operating in an uncertain environment with the objective of achieving a common goal or trying to maximize a joint reward. 21 Examples include planetary science missions =-=[5]-=- , cooperative pursuit evasion [6], persistent surveillance with multiple Unmanned Aerial Vehicles (UAVs) [7], reconnaissance missions [8], and supply chain event management [9]. Cooperation between a... |

10 | Infinite dynamic Bayesian networks.
- Doshi, Wingate, et al.
- 2011
(Show Context)
Citation Context ...structure of factored MDPs and [56] showed that Dynamical Credal Networks can be used as a tool to estimate factored MDP structures. Although not studied from multiagent planning point of view, Doshi =-=[57]-=- applied nonparametric inference techniques to learn structure of DBNs. Kok and Valassis [58] proposed an algorithm for learning the structure of the coordination graphs greedily based on statistical ... |

9 | Greedy algorithms for sparse reinforcement learning,” in ICML,
- Painter-wakefield, Parr
- 2012
(Show Context)
Citation Context ...tomating the basis function selection process based on the observed/simulated data. Examples are, adaptive tile coding [44], incremental feature dependency discovery [45], orthogonal matching pursuit =-=[46]-=- and kernel based approximators [47]. Overall, these approximate techniques were shown to improve the scalability and convergence rate substantially for a wide variety of MDPs and relaxed the constrai... |

8 | Transfer learning in collaborative filtering with uncertain ratings
- Pan, Xiang, et al.
- 2012
(Show Context)
Citation Context ...m in order to accelerate the learning in the new problem. Transfer learning is an active area of research and found many applications ranging from multi-task learning [78] and collaborative filtering =-=[79]-=- to reinforcement learning [80]. Although transfer learning algorithms are not developed specifically for solving heterogeneous multiagent learning problems, they can be used as a tool to study simila... |

6 |
An autonomous multiagent approach to supply chain event management.
- LA, Salomone, et al.
- 2012
(Show Context)
Citation Context ...etary science missions [5] , cooperative pursuit evasion [6], persistent surveillance with multiple Unmanned Aerial Vehicles (UAVs) [7], reconnaissance missions [8], and supply chain event management =-=[9]-=-. Cooperation between agents is a recurring theme in these missions. For instance, in a multi-robot exploration mission, individual agents should coordinate their actions based on the actions of other... |

6 | Dynamic mission planning for communication control in multiple unmanned aircraft teams.
- Kopeikin, Ponda, et al.
- 2013
(Show Context)
Citation Context ...nt broadcasts their estimated models by using a communication function ci, c ((1.3) where the output of the ci might be saturated due to the constraints on the throughput of the communication network =-=[24]-=-. For a variety of different multiagent communication architectures also see [23]. The objective of the learning problem is to design the estimation laws ei and the communication laws ci such that the... |

6 |
A hyperparameter consensus method for agreement under uncertainty
- Fraser, Bertuccelli, et al.
- 2012
(Show Context)
Citation Context ...sensus problem [31]. This problem has been investigated extensively in different settings and efficient algorithms in Bayesian framework for both perfect and limited communication have been developed =-=[68, 69]-=-. However, extensions to the heterogeneous learning setting have not been much investigated. Game Theory is a popular framework to study multiagent systems [70], flexibility of the game theoretic mode... |

6 | Bayesian Nonparametric Inverse Reinforcement Learning
- Michini, How
- 2012
(Show Context)
Citation Context ...ter 1, in many scenarios these models are not available. This thesis assumes that the reward model is available or can be estimated via another algorithm (such as using inverse reinforcement learning =-=[101]-=-) and focuses on estimation/learning of transition models using multiple measurements/observations obtained by the agents. Since the exact representation of the transition model is intractable for lar... |

5 |
Robust Decision-Making with Model Uncertainty in Aerospace Systems
- Bertuccelli
- 2008
(Show Context)
Citation Context ...r, in many real world applications, such models may not be available a priori and mismatch between the planning model and the actual dynamics of the environment might lead to catastrophic performance =-=[20]-=-. Hence, models need to be learned/updated during the mission to mitigate the effects of the model mismatch. Thus in real word applications, the planning problem cannot be separated from the model lea... |

5 | Dynamic programming and stochastic control processes - Bellman - 1958 |

4 |
Bio-inspired multi-agent systems for reconfigurable manufacturing systems
- Leitdo, Barbosa, et al.
(Show Context)
Citation Context ...ctricity distribution management [13] and meeting scheduling [14]. There are also applications in which the mobile and stationary agents must work together, such as control of manufacturing processes =-=[15]-=-. Similar to the missions with mobile agents, cooperation between agents is critical in these applications. Consider a multiagent traffic flow regulation problem [16], in which each agent controls a t... |

4 |
Distributed bayesian learning in multiagent systems: Improving our understanding of its capabilities and limitations
- Djuric, Wang
- 2012
(Show Context)
Citation Context ... to estimate the same parameters. The bulk of the literature in multiagent machine learning examines the problem of multiple agents with homogeneous transition dynamics trying to infer a single model =-=[31]-=-. However, many real world applications require the agents to learn different or agent-dependent models, such as the impact of external disturbances on the agent health [32] or modelling of the arriva... |

4 |
Tsitsiklis. Neuro-dynamic programming: an overview
- Bertsekas, John
- 1995
(Show Context)
Citation Context .... Linear function approximation [35] methods received particular attention due to their theoretical guarantees [42]. Non26 linear approximators such as radial basis functions [29] and neural networks =-=[43]-=- have also been studied because of their capabilities to approximate a large class of functions. Note however that ADP methods mentioned so far require the designer to hand-code the set of approximati... |

4 | Scaling up approximate value iteration with options: Better policies with fewer iterations
- Mann, Mannor
- 2014
(Show Context)
Citation Context ...work [61] is closely related to the hierarchical MDPs, where a collection of actions and state transitions are abstracted to entities that can be executed as macro actions. Both planning with options =-=[62]-=- and learning options [63, 64] attracted many researchers in recent years, however automating the discovery of such structures is still an open problem. It should also be noted that, we are interested... |

4 | Adaptive Planning for Markov Decision Processes with Uncertain Transition Models via Incremental Feature Dependency Discovery
- Ure, Geramifard, et al.
- 2012
(Show Context)
Citation Context ...ith the noise at iteration k, and A (si,k, ti,k) is the estimation error at step k, = ti,k - OzT(sik). (2.9) It can be shown that under mild assumptions on ak and the richness of the obtained samples =-=[87]-=-, Ti -+ 110, as k -+ oo, where H, is the projector operator that projects 7; to the subspace spanned by #i. 2.3.3 Incremental Feature Dependency Discovery (iFDD) Stochastic Gradient Descent algorithms... |

3 | Delivering the smart grid: Challenges for autonomous agents and multi-agent systems research,”
- Rogers, Ramchurn, et al.
- 2012
(Show Context)
Citation Context ...st execute actions that regulate the environment dynamics. Examples include distributed vehicle monitoring [10], air traffic control [11], network management [12], electricity distribution management =-=[13]-=- and meeting scheduling [14]. There are also applications in which the mobile and stationary agents must work together, such as control of manufacturing processes [15]. Similar to the missions with mo... |

3 | Practical Reinforcement Learning Using Representation Learning and Safe Exploration for Large Scale Markov Decision Processes
- Geramifard
- 2012
(Show Context)
Citation Context ...ameters can be as simple as a scalar learning rate in a reinforcement learning algorithm [34] and sometimes they can be as complex as a set of arbitrary basis functions for an approximation algorithm =-=[35]-=-. In many applications, these parameters are set by domain experts based on the heuristics and domain knowledge obtained from past experiences. However, many algorithms contain a large number of param... |

3 | Chi-square tests driven method for learning the structure of factored mdps. arXiv preprint arXiv:1206.6842
- Degris, Sigaud, et al.
- 2012
(Show Context)
Citation Context ...tructure of the DBN and the coordination graph is known to the designer. Recent research has focused on estimating the structure of factored MDPs, which is still an active area of research [54]. Ref. =-=[55]-=- demonstrated that statistical tests can be applied to learn structure of factored MDPs and [56] showed that Dynamical Credal Networks can be used as a tool to estimate factored MDP structures. Althou... |

3 |
Quickest time herding and detection for optimal social learning. arXiv preprint arXiv:1003.4972
- Krishnamurthy
- 2010
(Show Context)
Citation Context ... multiagent systems [70], flexibility of the game theoretic modeling framework enabled a thorough analysis of both homogeneous and heterogeneous team learning in cooperative and competitive scenarios =-=[71, 72]-=-. However, studies have mainly focused on the theoretical aspects, such as convergence to Nash equilibria and practical aspects such as scalability and ability to work under limited communication were... |

3 |
Batch iFDD: A scalable matching pursuit algorithm for solving MDPs,” in UAI,
- Geramifard, Walsh, et al.
- 2013
(Show Context)
Citation Context ...reads into a new location. The size of the planning space for this problem is approximately 1042 state-action pairs. 3.5.2 Compared Approaches The following approaches were compared with RCD: " iFDD+ =-=[99]-=-: iFDD+ applies linear function approximation directly to the value function. The method is adaptive in the sense that it starts with a fixed number of binary basis functions and grows the representat... |

2 |
Autonomous decentralized surveillance system and continuous target tracking technology for air traffic control applications
- Koga, Lu
- 2013
(Show Context)
Citation Context ...ents operating in a highly dynamic environment, where the agents must execute actions that regulate the environment dynamics. Examples include distributed vehicle monitoring [10], air traffic control =-=[11]-=-, network management [12], electricity distribution management [13] and meeting scheduling [14]. There are also applications in which the mobile and stationary agents must work together, such as contr... |

2 |
Agent and multi-agent applications to support distributed communities of practice: a short review,” Autonomous Agents and Multi-Agent Systems,
- Sato, Azevedo, et al.
- 2012
(Show Context)
Citation Context ...late the environment dynamics. Examples include distributed vehicle monitoring [10], air traffic control [11], network management [12], electricity distribution management [13] and meeting scheduling =-=[14]-=-. There are also applications in which the mobile and stationary agents must work together, such as control of manufacturing processes [15]. Similar to the missions with mobile agents, cooperation bet... |

2 |
Planning under uncertainty using nonparametric bayesian models
- Campbell, Ponda, et al.
- 2012
(Show Context)
Citation Context ... on the dynamical model of the robot [17], the rate change and amplitude of package requests in a network [18], or the uncertainty in the target motion model in a cooperative target tracking scenario =-=[19]-=-. Hence planning algorithms require a model of the uncertainty in order to hedge against the stochastic dynamics of the environment. However, in many real world applications, such models may not be av... |

2 |
Leliane Nunes De Barros. Efficient solutions to factored mdps with imprecise transition probabilities
- Delgado, Sanner
- 2011
(Show Context)
Citation Context ...used on estimating the structure of factored MDPs, which is still an active area of research [54]. Ref. [55] demonstrated that statistical tests can be applied to learn structure of factored MDPs and =-=[56]-=- showed that Dynamical Credal Networks can be used as a tool to estimate factored MDP structures. Although not studied from multiagent planning point of view, Doshi [57] applied nonparametric inferenc... |

2 |
et al. Genetic programming produced competitive soccer softbot teams for robocup97. Genetic Programming
- Luke
- 1998
(Show Context)
Citation Context ...g have been investigated by application of evolutionary computation techniques [67], important results include applications to team-mate modeling [73], competitive learning [74] and credit assessment =-=[75]-=-. Overall, algorithms demonstrated good scalability properties. However, heterogeneous learning and performance under limited communication have not been much investigated. Multiagent Reinforcement le... |

2 |
A tutorial on linear function approximators for dynamic programming and reinforcement learning
- Geramifard, Walsh, et al.
- 2013
(Show Context)
Citation Context ...very state-action pair is visited infinitely often, which can be ensured theoretically by picking an appropriate exploration schedule E(k) [29]. Iteration complexity of the algorithm is E(SAntramnmc) =-=[83]-=-. The main advantage of using TBPI is to avoid sweeping through all the state-action pairs. By applying a Monte-Carlo Bellman update to only the trajectories sampled from the model, we construct a pol... |

2 |
Convergence of indirect adaptive asynchronous value iteration algorithms
- Kushner, Yin
- 2003
(Show Context)
Citation Context ...) -pk(s) = ((s) - #(s)T~kl. Then, the final form of the parameter update law is 9k,1+1 =-k, + a ,~ki( k10S,) (4-3) Eq. 4.3 is a variant of the well studied stochastic gradient descent (SGD) algorithm =-=[102]-=-. Since the structure of p is not known beforehand, quality of the resulting approximation found by SGD depends strongly on the selected set of features. 'Our methodology can be extended to state-acti... |

2 |
Utgoff and Doina Precup. Feature extraction, construction, and selection: A data-mining perspective, chapter Constructive function approximation
- Paul
- 1998
(Show Context)
Citation Context ...sampled estimation error Ap (s) over active candidate features exceeds some pre-determined threshold , these conjunctions are added to set of features. The evaluation function learner (ELF) algorithm =-=[103]-=-, expands the representation akin to iFDD that we use, yet candidate features are selected based on a limited set of heuristically selected features. 4.1.4 Convergence Analysis This Section investigat... |

1 |
Experimental demonstration of efficient multi-agent learning and planning for persistent missions in uncertain environments
- Ure, Chowdhary, et al.
- 2012
(Show Context)
Citation Context ... goal or trying to maximize a joint reward. 21 Examples include planetary science missions [5] , cooperative pursuit evasion [6], persistent surveillance with multiple Unmanned Aerial Vehicles (UAVs) =-=[7]-=-, reconnaissance missions [8], and supply chain event management [9]. Cooperation between agents is a recurring theme in these missions. For instance, in a multi-robot exploration mission, individual ... |

1 | Humanrobot collaborative teleoperation system for semi-autonomous reconnaissance robot - Tang, Cao, et al. - 2009 |

1 |
Distributed and adaptive traffic signal control within a realistic traffic simulation
- McKenney, White
(Show Context)
Citation Context ...ions involve stationary agents operating in a highly dynamic environment, where the agents must execute actions that regulate the environment dynamics. Examples include distributed vehicle monitoring =-=[10]-=-, air traffic control [11], network management [12], electricity distribution management [13] and meeting scheduling [14]. There are also applications in which the mobile and stationary agents must wo... |

1 |
Hiroaki Harai. Development of an autonomous distributed control system for optical packet and circuit integrated networks
- Miyazawa, Furukawa, et al.
- 2012
(Show Context)
Citation Context ...y dynamic environment, where the agents must execute actions that regulate the environment dynamics. Examples include distributed vehicle monitoring [10], air traffic control [11], network management =-=[12]-=-, electricity distribution management [13] and meeting scheduling [14]. There are also applications in which the mobile and stationary agents must work together, such as control of manufacturing proce... |

1 |
Health-aware decentralized planning and learning for large-scale multiagent missions
- Ure, Chowdhary, et al.
- 2013
(Show Context)
Citation Context ...ng to infer a single model [31]. However, many real world applications require the agents to learn different or agent-dependent models, such as the impact of external disturbances on the agent health =-=[32]-=- or modelling of the arrival rate of packages in a specific part of the network [33]. In such problems, the learning process might be accelerated substantially by sharing information/models among agen... |

1 |
Dietterich et al. The maxq method for hierarchical reinforcement learning
- Thomas
- 1998
(Show Context)
Citation Context ...s was done and the algorithm was shown to be effective for only a small number of moderately sized problems. Another popular MDP structure decomposition technique is inducing hierarchies. The work in =-=[59]-=- shows how a large-scale MDP can be decomposed into a hierarchy of subMDPs, where sub-MDPs at the bottom of the hierarchy can be solved by neglecting states of MDPs at the top. Several works extended ... |

1 |
Doina Precup, Satinder Singh, et al. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning
- Sutton
- 1999
(Show Context)
Citation Context ...Ps at the top. Several works extended this formulation [60] and developed efficient 27 algorithms than can solve hierarchical MDPs much faster compared to the original flat MDP. The options framework =-=[61]-=- is closely related to the hierarchical MDPs, where a collection of actions and state transitions are abstracted to entities that can be executed as macro actions. Both planning with options [62] and ... |

1 |
A Oliehoek, Matthijs TJ Spaan, and Nikos Vlassis. Optimal and approximate q-value functions for decentralized pomdps
- Frans
- 2008
(Show Context)
Citation Context ... open problem. It should also be noted that, we are interested in solving centralized multiagent planning problems in this Thesis. There is a bulk literature on decentralized multiagent planning (See =-=[23, 65, 66]-=-). These problems come with additional challenges, such as partial observability of the agent states and enforcing the policy consensus among agents. Although decentralized planning is a very interest... |

1 |
Pradeep K Ravikumar. High-dimensional graphical model selection using 11-regularized logistic regression
- Wainwright, Lafferty
- 2006
(Show Context)
Citation Context ...ished algorithms for learning the structure of Bernoulli graphical models, like the models displayed in Fig. 3-3. We use the l regularized logistic regression algorithm developed by Wainwright et al. =-=[92]-=- to learn the structure of the Markov Network P(e,j; 9). This algorithm converts the structure learning problem to a convex optimization problem and the regularization parameter can be tuned to obtain... |

1 |
Heterogeneous Multiagent Learning with Applications to Forest Fire Management
- Ure, Omidshafiei, et al.
(Show Context)
Citation Context ...ehicle spends in the network. Hence the traffic jams leads to diminished rewards. The size of the planning space for this problem is approximately 1022 state-action pairs. 57 * Forest Fire Management =-=[97]-=-: This mission involves a group of UAVs managing a forest fire. The stochastic forest fire spread model is from Boychuk [98]. The fire spread dynamics are affected by a number of factors, such as the ... |

1 |
A stochastic forest fire growth model. Environmental and Ecological Statistics
- Boychuk, Kulperger, et al.
- 2009
(Show Context)
Citation Context ...oblem is approximately 1022 state-action pairs. 57 * Forest Fire Management [97]: This mission involves a group of UAVs managing a forest fire. The stochastic forest fire spread model is from Boychuk =-=[98]-=-. The fire spread dynamics are affected by a number of factors, such as the wind direction, fuel left in the location and the vegetation. A total of 16 UAVs are present in the mission. Actions are tra... |

1 |
Matthijs TJ Spaan, and Nikos Vlassis. Multi-robot decision making using coordination graphs
- Kok
- 2003
(Show Context)
Citation Context ...he basis functions from the initial set. This algorithm was included in the comparisons because it is a the state-of-the-art adaptive function approximation technique in approximate DP/RL. " Fixed CG =-=[100]-=-: In order to emphasize the value of automating the coordination graph search, an approach that involves a fixed Coordination Graph is also included in the results. We use intuition/domain knowledge t... |