Results 1  10
of
112
Indexability of Restless Bandit Problems and Optimality . . .
"... We consider a class of restless multiarmed bandit problems (RMBP) that arises in dynamic multichannel access, user/server scheduling, and optimal activation in multiagent systems. For this class of RMBP, we establish the indexability and obtain Whittle’s index in closedform for both discounted an ..."
Abstract

Cited by 63 (15 self)
 Add to MetaCart
We consider a class of restless multiarmed bandit problems (RMBP) that arises in dynamic multichannel access, user/server scheduling, and optimal activation in multiagent systems. For this class of RMBP, we establish the indexability and obtain Whittle’s index in closedform for both discounted and average reward criteria. These results lead to a direct implementation of Whittle’s index policy with remarkably low complexity. When arms are stochastically identical, we show that Whittle’s index policy is optimal under certain conditions. Furthermore, it has a semiuniversal structure that obviates the need to know the Markov transition probabilities. The optimality and the semiuniversal structure result from the equivalency between Whittle’s index policy and the myopic policy established in this work. For nonidentical arms, we develop efficient algorithms for computing a performance upper bound given by Lagrangian relaxation. The tightness of the upper bound and the nearoptimal performance of Whittle’s index policy are illustrated with simulation examples.
Cognitive Medium Access: Exploration, Exploitation and Competition
, 2007
"... This paper establishes the equivalence between cognitive medium access and the competitive multiarmed bandit problem. First, the scenario in which a single cognitive user wishes to opportunistically exploit the availability of empty frequency bands in the spectrum with multiple bands is considered ..."
Abstract

Cited by 58 (5 self)
 Add to MetaCart
This paper establishes the equivalence between cognitive medium access and the competitive multiarmed bandit problem. First, the scenario in which a single cognitive user wishes to opportunistically exploit the availability of empty frequency bands in the spectrum with multiple bands is considered. In this scenario, the availability probability of each channel is unknown to the cognitive user a priori. Hence efficient medium access strategies must strike a balance between exploring the availability of other free channels and exploiting the opportunities identified thus far. By adopting a Bayesian approach for this classical bandit problem, the optimal medium access strategy is derived and its underlying recursive structure is illustrated via examples. To avoid the prohibitive computational complexity of the optimal strategy, a low complexity asymptotically optimal strategy is developed. The proposed strategy does not require any prior statistical knowledge about the traffic pattern on the different channels. Next, the multicognitive user scenario is considered and low complexity medium access protocols, which strike the optimal balance between exploration and exploitation in such competitive environments, are developed. Finally, this formalism is extended to the case in which each cognitive user is capable of sensing and using multiple channels simultaneously.
Learning Multiuser Channel Allocations in Cognitive Radio Networks: A Combinatorial MultiArmed Bandit Formulation
"... Abstract—We consider the following fundamental problem in the context of channelized dynamic spectrum access. There are M secondary users and N ≥ M orthogonal channels. Each secondary user requires a single channel for operation that does not conflict with the channels assigned to the other users. D ..."
Abstract

Cited by 37 (7 self)
 Add to MetaCart
(Show Context)
Abstract—We consider the following fundamental problem in the context of channelized dynamic spectrum access. There are M secondary users and N ≥ M orthogonal channels. Each secondary user requires a single channel for operation that does not conflict with the channels assigned to the other users. Due to geographic dispersion, each secondary user can potentially see different primary user occupancy behavior on each channel. Time is divided into discrete decision rounds. The throughput obtainable from spectrum opportunities on each userchannel combination over a decision period is modeled as an arbitrarilydistributed random variable with bounded support but unknown mean, i.i.d. over time. The objective is to search for an allocation of channels for all users that maximizes the expected sum throughput. We formulate this problem as a combinatorial multiarmed bandit (MAB), in which each arm corresponds to a matching of the users to channels. Unlike most prior work on multiarmed bandits, this combinatorial formulation results in dependent arms. Moreover, the number of arms grows superexponentially as the permutation P (N, M). We present a novel matchinglearning algorithm with polynomial storage and polynomial computation per decision period for this problem, and prove that it results in a regret (the gap between the expected sumthroughput obtained by a genieaided perfect allocation and that obtained by this algorithm) that is uniformly upperbounded for all time n by a function that grows as O(M 4 Nlogn), i.e. polynomial in the number of unknown parameters and logarithmic in time. We also discuss how our results provide a nontrivial generalization of known theoretical results on multiarmed bandits. I.
Online learning in opportunistic spectrum access: A restless bandit approach
 IEEE INFOCOM, April
, 1998
"... Abstract—We consider an opportunistic spectrum access (OSA) problem where the timevarying condition of each channel (e.g., as a result of random fading or certain primary users’ activities) is modeled as an arbitrary finitestate Markov chain. At each instance of time, a (secondary) user probes a c ..."
Abstract

Cited by 37 (6 self)
 Add to MetaCart
(Show Context)
Abstract—We consider an opportunistic spectrum access (OSA) problem where the timevarying condition of each channel (e.g., as a result of random fading or certain primary users’ activities) is modeled as an arbitrary finitestate Markov chain. At each instance of time, a (secondary) user probes a channel and collects a certain reward as a function of the state of the channel (e.g., good channel condition results in higher data rate for the user). Each channel has potentially different state space and statistics, both unknown to the user, who tries to learn which one is the best as it goes and maximizes its usage of the best channel. The objective is to construct a good online learning algorithm so as to minimize the difference between the user’s performance in total rewards and that of using the best channel (on average) had it known which one is the best from a priori knowledge of the channel statistics (also known as the regret). This is a classic exploration and exploitation problem and results abound when the reward processes are assumed to be iid. Compared to prior work, the biggest difference is that in our case the reward process is assumed to be Markovian, of which iid is a special case. In addition, the reward processes are restless in that the channel conditions will continue to evolve independent of the user’s actions. This leads to a restless bandit problem, for which there exists little result on either algorithms or performance bounds in this learning context to the best of our knowledge. In this paper we introduce an algorithm that utilizes regenerative cycles of a Markov chain and computes a samplemean based index policy, and show that under mild conditions on the state transition probabilities of the Markov chains this algorithm achieves logarithmic regret uniformly over time, and that this regret bound is also optimal. We numerically examine the performance of this algorithm along with a few other learning algorithms in the case of an OSA problem with GilbertElliot channel models, and discuss how this algorithm may be further improved (in terms of its constant) and how this result may lead to similar bounds for other algorithms.
Algorithms for dynamic spectrum access with learning for cognitive radio
 IEEE Transactions on Signal Processing
, 2010
"... We study the problem of dynamic spectrum sensing and access in cognitive radio systems as a partially observed Markov decision process (POMDP). A group of cognitive users cooperatively tries to exploit vacancies in some primary (licensed) channels whose occupancies follow a Markovian evolution. We f ..."
Abstract

Cited by 32 (2 self)
 Add to MetaCart
(Show Context)
We study the problem of dynamic spectrum sensing and access in cognitive radio systems as a partially observed Markov decision process (POMDP). A group of cognitive users cooperatively tries to exploit vacancies in some primary (licensed) channels whose occupancies follow a Markovian evolution. We first consider the scenario where the cognitive users have perfect knowledge of the distribution of the signals they receive from the primary users. For this problem, we obtain a greedy channel selection and access policy that maximizes the instantaneous reward, while satisfying a constraint on the probability of interfering with licensed transmissions. We also derive an analytical universal upper bound on the performance of the optimal policy. Through simulation, we show that our scheme achieves good performance relative to the upper bound and substantial improvement relative to an existing scheme. We then consider the more practical scenario where the exact distribution of the signal from the primary is unknown. We assume a parametric model for the distribution and develop an algorithm that can learn the true distribution, still guaranteeing the constraint on the interference probability. We show
Multichannel opportunistic access: a case of restless bandits with multiple plays
 ALLERTON CONFERENCE, OCTOBER 2009, ALLERTON, IL
, 2009
"... This paper considers the following stochastic control problem that arises in opportunistic spectrum access: a system consists of n channels where the state (“good” or “bad”) of each channel evolves as independent and identically distributed Markov processes. A user can select exactly k channels to ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
This paper considers the following stochastic control problem that arises in opportunistic spectrum access: a system consists of n channels where the state (“good” or “bad”) of each channel evolves as independent and identically distributed Markov processes. A user can select exactly k channels to sense and access (based on the sensing result) in each time slot. A reward is obtained whenever the user senses and accesses a “good” channel. The objective is to design a channel selection policy that maximizes the expected discounted total reward accrued over a finite or infinite horizon. In our previous work we established the optimality of a greedy policy for the special case of k = 1 (i.e., single channel access) under the condition that the channel state transitions are positively correlated over time. In this paper we show under the same condition the greedy policy is optimal for the general case of k ≥ 1; the methodology introduced here is thus more general. This problem may be viewed as a special case of the restless bandit problem, with multiple plays. We discuss connections between the current problem and existing literature on this class of problems.
Zhao “The NonBayesian Restless Multiarmed Bandit: A Case Of NearLogarithmic Regret
 Proc. of Internanional Conference on Acoustics, Speech and Signal Processing (ICASSP
, 2011
"... In the classic Bayesian restless multiarmed bandit (RMAB) problem, there are N arms, with rewards on all arms evolving at each time as Markov chains with known parameters. A player seeks to activate K ≥ 1 arms at each time in order to maximize the expected total reward obtained over multiple plays. ..."
Abstract

Cited by 29 (20 self)
 Add to MetaCart
In the classic Bayesian restless multiarmed bandit (RMAB) problem, there are N arms, with rewards on all arms evolving at each time as Markov chains with known parameters. A player seeks to activate K ≥ 1 arms at each time in order to maximize the expected total reward obtained over multiple plays. RMAB is a challenging problem that is known to be PSPACEhard in general. We consider in this work the even harder nonBayesian RMAB, in which the parameters of the Markov chain are assumed to be unknown a priori. We develop an original approach to this problem that is applicable when the corresponding Bayesian problem has the structure that, depending on the known parameter values, the optimal solution is one of a prescribed finite set of policies. In such settings, we propose to learn the optimal policy for the nonBayesian RMAB by employing a suitable metapolicy which treats each policy from this finite set as an arm in a different nonBayesian multiarmed bandit problem for which a singlearm selection policy is optimal. We demonstrate this approach by developing a novel sensing policy for opportunistic spectrum access over unknown dynamic channels. We prove that our policy achieves nearlogarithmic regret (the difference in expected reward compared to a modelaware genie), which leads to the same average reward that can be achieved by the optimal policy under a known model. This is the first such result in the literature for a nonBayesian RMAB. Index Terms — restless bandit, regret, opportunistic spectrum access, learning, nonBayesian 1.
Optimal Cognitive Access of Markovian Channels under Tight Collision Constraints
"... Abstract—The problem of cognitive access of channels of primary users by a secondary user is considered. The transmissions of primary users are modeled as independent continuoustime Markovian onoff processes. A secondary cognitive user employs a slotted transmission format, and it senses one of th ..."
Abstract

Cited by 21 (7 self)
 Add to MetaCart
(Show Context)
Abstract—The problem of cognitive access of channels of primary users by a secondary user is considered. The transmissions of primary users are modeled as independent continuoustime Markovian onoff processes. A secondary cognitive user employs a slotted transmission format, and it senses one of the possible channels before transmission. The objective of the cognitive user is to maximize its throughput subject to collision constraints imposed by the primary users. The optimal access strategy is in general a solution of a constrained partially observable Markov decision process, which involves a constrained optimization in an infinite dimensional functional space. It is shown in this paper that, when the collision constraints are tight, the optimal access strategy can be implemented by a simple memoryless access policy with periodic channel sensing. Analytical expressions are given for the thresholds on collision probabilities for which memoryless access performs optimally. Extensions to multiple secondary users are also presented. Numerical and theoretical results are presented to validate and extend the analysis for different practical scenarios. Index Terms—Cognitive radio, Dynamic spectrum allocation, Cognitive medium access, Markov decision processes.
Dynamic multichannel access with imperfect channel state detection
 IEEE Trans. Signal Process
, 2010
"... Abstract—A restless multiarmed bandit problem that arises in multichannel opportunistic communications is considered, where channels are modeled as independent and identical Gilbert–Elliot channels and channel state detection is subject to errors. A simple structure of the myopic policy is establis ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
(Show Context)
Abstract—A restless multiarmed bandit problem that arises in multichannel opportunistic communications is considered, where channels are modeled as independent and identical Gilbert–Elliot channels and channel state detection is subject to errors. A simple structure of the myopic policy is established under a certain condition on the false alarm probability of the channel state detector. It is shown that myopic actions can be obtained by maintaining a simple channel ordering without knowing the underlying Markovian model. The optimality of the myopic policy is proved for the case of two channels and conjectured for general cases. Lower and upper bounds on the performance of the myopic policy are obtained in closedform, which characterize the scaling behavior of the achievable throughput of the multichannel opportunistic system. The approximation factor of the myopic policy is also analyzed to bound its worstcase performance loss with respect to the optimal performance. Index Terms—Cognitive radio, dynamic multichannel access, myopic policy, restless multiarmed bandit.
Learning in A Changing World: Restless MultiArmed Bandit with Unknown Dynamics
"... We consider the restless multiarmed bandit (RMAB) problem with unknown dynamics in which a player chooses one out of N arms to play at each time. The reward state of each arm transits according to an unknown Markovian rule when it is played and evolves according to an arbitrary unknown random proce ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
(Show Context)
We consider the restless multiarmed bandit (RMAB) problem with unknown dynamics in which a player chooses one out of N arms to play at each time. The reward state of each arm transits according to an unknown Markovian rule when it is played and evolves according to an arbitrary unknown random process when it is passive. The performance of an arm selection policy is measured by regret, defined as the reward loss with respect to the case where the player knows which arm is the most rewarding and always plays the best arm. We construct a policy with an interleaving exploration and exploitation epoch structure that achieves a regret with logarithmic order when arbitrary (but nontrivial) bounds on certain system parameters are known. When no knowledge about the system is available, we show that the proposed policy achieves a regret arbitrarily close to the logarithmic order. We further extend the problem to a decentralized setting where multiple distributed players share the arms without information exchange. Under both an exogenous restless model and an endogenous restless model, we show that a decentralized extension of the proposed policy preserves the logarithmic regret order as in the centralized setting. The results apply to adaptive learning in various dynamic systems and communication networks, as well as financial investment. I.