Results 1  10
of
22
Online regret bounds for Markov decision processes with deterministic transitions
 Proc. of the 19th International Conference on Algorithmic Learning Theory (ALT 2008), volume 5254 of Lecture Notes in Computer Science
, 2008
"... Abstract. We consider an upper confidence bound algorithm for Markov decision processes (MDPs) with deterministic transitions. For this algorithm we derive upper bounds on the online regret (with respect to an (ε)optimal policy) that are logarithmic in the number of steps taken. These bounds also m ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We consider an upper confidence bound algorithm for Markov decision processes (MDPs) with deterministic transitions. For this algorithm we derive upper bounds on the online regret (with respect to an (ε)optimal policy) that are logarithmic in the number of steps taken. These bounds also match known asymptotic bounds for the general MDP setting. We also present corresponding lower bounds. As an application, multiarmed bandits with switching cost are considered. 1
Restless bandits with switching costs: Linear programming relaxations, performance bounds and limited lookahead policies
 in American Control Conference
, 2006
"... Abstract—The multiarmed bandit problem and one of its most interesting extensions, the restless bandits problem, are frequently encountered in various stochastic control problems. We present a linear programming relaxation for the restless bandits problem with discounted rewards, where only one pro ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Abstract—The multiarmed bandit problem and one of its most interesting extensions, the restless bandits problem, are frequently encountered in various stochastic control problems. We present a linear programming relaxation for the restless bandits problem with discounted rewards, where only one project can be activated at each period but with additional costs penalizing switching between projects. The relaxation can be efficiently computed and provides a bound on the achievable performance. We describe several heuristic policies; in particular, we show that a policy adapted from the primaldual heuristic of Bertsimas and NiñoMora [1] for the classical restless bandits problem is in fact equivalent to a onestep lookahead policy; thus, the linear programming relaxation provides a means to compute an approximation of the costtogo. Moreover, the approximate costtogo is decomposable by project, and this allows the onestep lookahead policy to take the form of an index policy, which can be computed online very efficiently. We present numerical experiments, for which we assess the quality of the heuristics using the performance bound. I.
Modeling Human Decisionmaking in Generalized Gaussian Multiarmed Bandits
, 2014
"... We present a formal model of human decisionmaking in exploreexploit tasks using the context of multiarmed bandit problems, where the decisionmaker must choose among multiple options with uncertain rewards. We address the standard multiarmed bandit problem, the multiarmed bandit problem with tr ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
We present a formal model of human decisionmaking in exploreexploit tasks using the context of multiarmed bandit problems, where the decisionmaker must choose among multiple options with uncertain rewards. We address the standard multiarmed bandit problem, the multiarmed bandit problem with transition costs, and the multiarmed bandit problem on graphs. We focus on the case of Gaussian rewards in a setting where the decisionmaker uses Bayesian inference to estimate the reward values. We model the decisionmaker’s prior knowledge with the Bayesian prior on the mean reward. We develop the upper credible limit (UCL) algorithm for the standard multiarmed bandit problem and show that this deterministic algorithm achieves logarithmic cumulative expected regret, which is optimal performance for uninformative priors. We show how good priors and good assumptions on the correlation structure among arms can greatly enhance decisionmaking performance, even over short time horizons. We extend to the stochastic UCL algorithm and draw several connections to human decisionmaking behavior. We present empirical data from human experiments and show that human performance is efficiently captured by the stochastic UCL algorithm with appropriate parameters. For the multiarmed bandit problem with transition costs and the multiarmed bandit problem on graphs, we generalize the UCL algorithm to the block UCL algorithm and the graphical block UCL algorithm, respectively. We show that these algorithms also achieve logarithmic cumulative expected regret and require a sublogarithmic expected number of transitions among arms. We further illustrate the performance of these algorithms with numerical examples.
INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS
 APPLIED PROBABILITY TRUST (4 FEBRUARY 2008)
, 2008
"... Multiarmed bandit problem is studied when the arms are not always available. The arms are first assumed to be intermittently available with some state/actiondependent probabilities. It is proven that no index policy can attain the maximum expected total discounted reward in every instance of that ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Multiarmed bandit problem is studied when the arms are not always available. The arms are first assumed to be intermittently available with some state/actiondependent probabilities. It is proven that no index policy can attain the maximum expected total discounted reward in every instance of that problem. The Whittle index policy is derived, and its properties are studied. Then it is assumed that arms may break down, but repair is an option at some cost, and the new Whittle index policy is derived. Both problems are indexable. The proposed index policies cannot be dominated by any other index policy over all multiarmed bandit problems considered here. Whittle indices are evaluated for Bernoulli arms with unknown success probabilities.
1Sequential Learning for Multichannel Wireless Network Monitoring with Channel Switching Costs
"... Abstract—We consider the problem of optimally assigning p sniffers to K channels to monitor the transmission activities in a multichannel wireless network with switching costs. The activity of users is initially unknown to the sniffers and is to be learned along with channel assignment decisions to ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—We consider the problem of optimally assigning p sniffers to K channels to monitor the transmission activities in a multichannel wireless network with switching costs. The activity of users is initially unknown to the sniffers and is to be learned along with channel assignment decisions to maximize the benefits of this assignment, resulting in the fundamental tradeoff between exploration and exploitation. Switching costs are incurred when sniffers change their channel assignments. As a result, frequent changes are undesirable. We formulate the snifferchannel assignment with switching costs as a linear partial monitoring problem, a superclass of multiarmed bandits. As the number of arms (snifferchannel assignments) is exponential, novel techniques are called for, to allow efficient learning. We use the linear bandit model to capture the dependency amongst the arms and develop a policy that takes advantage of this dependency. We prove that the proposed Upper Confident Boundbased (UCB) policy enjoys a logarithmic regret bound in time t that depends sublinearly on the number of arms, while its total switching cost grows in the order of O(log log(t)). Index Terms—Local area networks, network monitoring, sequential learning. I.
Endogenous Learning with Bounded Memory
, 2012
"... I analyze the effects of memory limitations on the endogenous learning behavior of an agent in a standard twoarmed bandit problem. An infinitely lived agent chooses each period between two alternatives with unknown types, to maximize discounted payo¤s. The agent can experiment with each alternative ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
I analyze the effects of memory limitations on the endogenous learning behavior of an agent in a standard twoarmed bandit problem. An infinitely lived agent chooses each period between two alternatives with unknown types, to maximize discounted payo¤s. The agent can experiment with each alternative and receive payoffs that are partially informative about its type. The agent does not recall past actions or payo¤s. Instead, the agent has a finite number of memory states as in Wilson (2004): he can condition his actions only on the memory state he is currently in, and he can update his memory state depending on the payoff received. I …find that the inclination to choose the currently better alternative does not constrain learning in the limit as discounting vanishes. Even though uncertainties are independent, the agent optimally holds correlated beliefs across memory states. Optimally, memory states re‡ect the magnitude of the relative ranking of alternatives. After a high payoff from one of the alternatives, the agent optimally moves to a memory state with more pessimistic beliefs on the other, even though no information about
Twostage index computation for bandits with switching penalties I: switching costs
 Working Paper 0741, Statistics and Econometrics Series 09, http://halweb.uc3m.es/jnino/eng/public2.html, Univ. Carlos III de
"... This paper addresses the multiarmed bandit problem with switching penalties including both costs and delays, extending results of the companion paper [J. NiñoMora. ``TwoStage Index Computation for Bandits with Switching Penalties I: Switching Costs.' ' Conditionally accepted at INFORMS ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
This paper addresses the multiarmed bandit problem with switching penalties including both costs and delays, extending results of the companion paper [J. NiñoMora. ``TwoStage Index Computation for Bandits with Switching Penalties I: Switching Costs.' ' Conditionally accepted at INFORMS J. Comp.], which addressed the no switching delays case. Asawa and Teneketzis (1996) introduced an index for bandits with delays that partly characterizes optimal policies, attaching to each bandit state a ``continuation index' ' (its Gittins index) and a ``switching index,'' yet gave no algorithm for it. This paper presents an efficient, decoupled computation method, which in a first stage computes the continuation index and then, in a second stage, computes the switching index an order of magnitude faster in at most arithmetic operations for anstate bandit. The paper exploits the fact that the Asawa and Teneketzis index is the Whittle, or marginal productivity, index of a classic bandit with switching penalties in its semiMarkov restless reformulation, by deploying workreward analysis and LPindexability methods introduced by the author. A computational study demonstrates the dramatic runtime savings achieved by the new algorithm, the nearoptimality of the index policy, and its substantial gains
Keeping Your Options Open
, 2010
"... In standard models of experimentation, the costs of project development consist of (i) the direct cost of running trials as well as (ii) the implicit opportunity cost of leaving alternative projects idle. Another natural type of experimentation cost, the cost of holding on to the option of developin ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In standard models of experimentation, the costs of project development consist of (i) the direct cost of running trials as well as (ii) the implicit opportunity cost of leaving alternative projects idle. Another natural type of experimentation cost, the cost of holding on to the option of developing a currently inactive project, has not been studied. In a (multiarmed bandit) model of experimentation in which inactive projects have explicit maintenance costs and can be irreversibly discarded, I fully characterise the optimal experimentation policy and show that the decisionmaker’s incentive to actively manage its options has important implications for the order of project development. In the model, an experimenter searches for a success among a number of projects by choosing both those to develop now and those to maintain for (potential) future development. In the absence of maintenance costs, the optimal experimentation policy has a ‘staywiththewinner’ property: the projects that are more likely to succeed are developed first. Maintenance costs provide incentives to bring the option value of less promising projects forward, and under the optimal experimentation policy, projects that are less likely to succeed are sometimes developed first. A project development strategy of ‘goingwiththeloser’ strikes a balance between the cost of discarding possibly valuable options and the cost of leaving them open.
Multiarmed Bandit Problem with Lockup Periods
"... We investigate a stochastic multiarmed bandit problem in which the forecaster’s choice is restricted. In this problem, rounds are divided into lockup periods and the forecaster must select the same arm throughout a period. While there has been much work on finding optimal algorithms for the stocha ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We investigate a stochastic multiarmed bandit problem in which the forecaster’s choice is restricted. In this problem, rounds are divided into lockup periods and the forecaster must select the same arm throughout a period. While there has been much work on finding optimal algorithms for the stochastic multiarmed bandit problem, their use under restricted conditions is not obvious. We extend the application ranges of these algorithms by proposing their natural conversion from ones for the stochastic bandit problem (indexbased algorithms and greedy algorithms) to ones for the multiarmed bandit problem with lockup periods. We prove that the regret of the converted algorithms is O(log T +Lmax), where T is the total number of rounds and Lmax is the maximum size of the lockup periods. The regret is preferable, except for the case when the maximum size of the lockup periods is large. For these cases, we propose a metaalgorithm that results in a smaller regret by using a empirical best arm for large periods. We empirically compare and discuss these algorithms.
Optimal Hiring and Retention Policies for Heterogeneous Workers who Learn
"... We study the hiring and retention of heterogeneous workers who learn over time. We show that the problem can be analyzed as an infinitearmed bandit with switching costs and apply results from Bergemann and Välimäki (2001) to characterize the optimal hiring and retention policy. For problems with Ga ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We study the hiring and retention of heterogeneous workers who learn over time. We show that the problem can be analyzed as an infinitearmed bandit with switching costs and apply results from Bergemann and Välimäki (2001) to characterize the optimal hiring and retention policy. For problems with Gaussian data, we develop approximations that allow the efficient implementation of the optimal policy and the evaluation of its performance. Our numerical examples demonstrate that the value of active monitoring and screening of employees can be substantial.