| Abounadi, J., Bertsekas. D., & Borkar, V. (1998). Learning Algorithms For Markov Decision Processes with average cost. Technical Report, LIDS-P-2434, MIT, MA. |
....concentrate on concepts. Reinforcement Learning (RL) has emerged in the last decade as a unifying discipline for learning and adaptive control. Comprehensive overviews may be found in [2, 7] RL for average reward Markov Decision Processes (MDPs) was suggested in [13, 10] and later analyzed in [1]. Several methods exist for average reward RL, including Q learning [1] the E algorithm [8] actor critic schemes [2] and more. The paper is organized as follows: In Section 2 we describe the stochastic game setup, recall ap proachability theory, and mention a key theorem that allows to ....
....the last decade as a unifying discipline for learning and adaptive control. Comprehensive overviews may be found in [2, 7] RL for average reward Markov Decision Processes (MDPs) was suggested in [13, 10] and later analyzed in [1] Several methods exist for average reward RL, including Q learning [1] the E algorithm [8] actor critic schemes [2] and more. The paper is organized as follows: In Section 2 we describe the stochastic game setup, recall ap proachability theory, and mention a key theorem that allows to consider only a finite number of directions for approaching a set. Section 3 ....
[Article contains additional citation context not shown here]
J. Abounadi, D. Bertsekas, and V. Borkar. Learning algorithms for markov decision processes with average cost. LIDS-P 2434, Lab. for Info. and Decision Systems, MIT, October 1998.
....concentrate on concepts. Reinforcement Learning (RL) has emerged in the last decade as a unifying discipline for learning and adaptive control. Comprehensive overviews may be found in [2, 7] RL for average reward Markov Decision Processes (MDPs) was suggested in [13, 10] and later analyzed in [1]. Several methods exist for average reward RL, including Q learning [1] the E 3 algorithm [8] actor critic schemes [2] and more. The paper is organized as follows: In Section 2 we describe the stochastic game setup, recall approachability theory, and mention a key theorem that allows to ....
....the last decade as a unifying discipline for learning and adaptive control. Comprehensive overviews may be found in [2, 7] RL for average reward Markov Decision Processes (MDPs) was suggested in [13, 10] and later analyzed in [1] Several methods exist for average reward RL, including Q learning [1] the E 3 algorithm [8] actor critic schemes [2] and more. The paper is organized as follows: In Section 2 we describe the stochastic game setup, recall approachability theory, and mention a key theorem that allows to consider only a nite number of directions for approaching a set. Section 3 ....
[Article contains additional citation context not shown here]
J. Abounadi, D. Bertsekas, and V. Borkar. Learning algorithms for markov decision processes with average cost. LIDS-P 2434, Lab. for Info. and Decision Systems, MIT, October 1998.
No context found.
J. Abounady, D. Bertsekas and V. S. Borkar, "Learning Algorithms for Markov Decision Processes with Average Cost," SIAM J. Control and Optim., to appear.
.... known if one can compute V ( This can be done by standard methods such as value iteration, policy iteration or linear programming described in [14] In case the transition probabilities are not known, one can employ simulation based approximate methods based on reinforcement learning, see, e.g. [1], 11] This is an important situation, because transition probabilities depend not only on the arrival process and the service rates, but also on the disutility functions which may not be known even approximately. In the next section we present an on line algorithm for solving the dynamic ....
J. Abounady, D. Bertsekas and V. S. Borkar, "Learning Algorithms for Markov Decision Processes with Average Cost," SIAM J. Control and Optim., to appear.
.... if one can compute V ( This can be done by standard methods such as value iteration, policy iteration or linear programming described in [14] In case the transition probabilities are not known, one can employ simulation based approximate methods based on reinforcement learning, see, e.g. [1], 11] This is an important situation, because transition probabilities depend not only on the arrival process and the service rates, but also on the disutility functions which may not be known even approximately. In the next section we present an on line algorithm for solving the dynamic ....
Abounady J., Bertsekas D., Borkar V. S., Learning algorithms for Markov decision processes with average cost, SIAM J. Control and Optim., to appear.
....assumption for stochastic shortest path Q learning, where the cost per stage may be negative. The methodology developed in this paper provides also an essential foundation for a convergence analysis of Q learning algorithms for average cost dynamic programming problems, given in a companion paper (Abounadi, Bertsekas and Borkar [1998]) The general framework that we propose applies to synchronous and asynchronous variants 2 of algorithms of the form x k 1 = x k #(k) F (x k , # k ) x k . 1) Here x k is a sequence in # n , # k is a stochastic noise sequence, F is, for each fixed #, non expansive ....
....s = max i=1, n x i min i=1, n x i , where x 1 , xn are the components of x. In this case, however, a weaker boundedness result is obtained, which is the subject of the following lemma. This lemma is used crucially in our companion paper on Q learning in average cost control (Abounadi, Bertsekas, and Borkar [1998]) Lemma 2.2: Let B be an open and bounded subset of # n containing the origin, and let C be a subset of # n that contains B. Consider the algorithm x k 1 = G k (x k , # k ) 6) where we assume the following: 1. # k is a random process defined over a probability space ....
[Article contains additional citation context not shown here]
J. Abounadi, D. P. Bertsekas, V. S. Borkar, 1998. "Learning Algorithms for Markov Decision Processes with Average Cost," preprint, submitted for publication.
....assumption for stochastic shortest path Q learning, where the cost per stage may be negative. The methodology developed in this paper provides also an essential foundation for a convergence analysis of Q learning algorithms for average cost dynamic programming problems, given in a companion paper (Abounadi, Bertsekas and Borkar [1998]) The general framework that we propose applies to synchronous and asynchronous variants 2 of algorithms of the form x k 1 = x k #(k) F (x k , # k ) x k . 1) Here x k is a sequence in # n , # k is a stochastic noise sequence, F is, for each fixed #, non expansive ....
....= max i=1, n x i min i=1, n x i , 6 where x 1 , xn are the components of x. In this case, however, a weaker boundedness result is obtained, which is the subject of the following lemma. This lemma is used crucially in our companion paper on Q learning in average cost control (Abounadi, Bertsekas, and Borkar [1998]) Lemma 2.2: Let B be an open and bounded subset of # n containing the origin, and let C be a subset of # n that contains B. Consider the algorithm x k 1 = G k (x k , # k ) 6) where we assume the following: 1. # k is a random process defined over a probability space ....
[Article contains additional citation context not shown here]
J. Abounadi, D. P. Bertsekas, V. S. Borkar, 1998. "Learning Algorithms for Markov Decision Processes with Average Cost," preprint, submitted for publication.
.... relative value iteration where a common scalar offset is subtracted from all components of the iterates at each iteration (likewise for the Q value iteration) The choice of this offset term is not unique. We shall be considering one particular choice, though others can be handled similarly (see [1]) 3.2 Q learning If the matrix Q defined in (15) can be computed via value iteration or some other scheme then the optimal control is found through a simple minimization. If transition probabilities are unknown so that value iteration is not directly applicable, one may apply a stochastic ....
....via value iteration or some other scheme then the optimal control is found through a simple minimization. If transition probabilities are unknown so that value iteration is not directly applicable, one may apply a stochastic approximation variant known as the Q learning algorithm of Watkins [1, 20, 21]. This is defined through the recursion Q n 1 (i; a) Q n (i; a) a(n) h fi min b Q n ( Psi n 1 (i; a) b) c(i; a) Gamma Q n (i; a) i ; i 2 S; a 2 A; where Psi n 1 (i; a) is an independently simulated S valued random variable with law p(i; Delta; a) Making the appropriate ....
[Article contains additional citation context not shown here]
ABOUNADI, J., BERTSEKAS, D., BORKAR, V.S., Learning algorithms for Markov decision processes with average cost, Lab. for Info. and Decision Systems, M.I.T., 1996, (Draft report).
No context found.
Abounadi, J., Bertsekas. D., & Borkar, V. (1998). Learning Algorithms For Markov Decision Processes with average cost. Technical Report, LIDS-P-2434, MIT, MA.
No context found.
J. Abounadi, D. Bertsekas, and V. Borkar. Learning algorithms for Markov decision processes with average cost. SIAM J. Control Optim., 40:681 -- 698, 2001.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC