### Two-Armed Restless Bandits with Imperfect Information: Stochastic Control and Indexability∗

, 2013

"... We present a two-armed bandit model of decision making under uncertainty where the expected return to investing in the “risky arm ” increases when choosing that arm and decreases when choosing the “safe ” arm. These dynamics are natural in applications such as human capital development, job search, ..."

Abstract
- Add to MetaCart

We present a two-armed bandit model of decision making under uncertainty where the expected return to investing in the “risky arm ” increases when choosing that arm and decreases when choosing the “safe ” arm. These dynamics are natural in applications such as human capital development, job search, and occupational choice. Using new insights from stochastic control, along with a monotonicity condition on the payoff dynamics, we show that optimal strategies in our model are stopping rules that can be characterized by an index which formally coincides with Gittins ’ index. Our result implies the indexability of a new class of “restless” bandit models.

### AUTOMATING THE RUNTIME PERFORMANCE EVALUATION OF SIMULATION ALGORITHMS

"... Simulation algorithm implementations are usually evaluated by experimental performance analysis. To conduct such studies is a challenging and time-consuming task, as various impact factors have to be controlled and the resulting algorithm performance needs to be analyzed. This problem is aggravated ..."

Abstract
- Add to MetaCart

Simulation algorithm implementations are usually evaluated by experimental performance analysis. To conduct such studies is a challenging and time-consuming task, as various impact factors have to be controlled and the resulting algorithm performance needs to be analyzed. This problem is aggravated when it comes to comparing many alternative implementations for a multitude of benchmark model setups. We present an architecture that supports the automated execution of performance evaluation experiments on several levels. Desirable benchmark model properties are motivated, and the quasi-steady state property of such models is exploited for simulation end time calibration, a simple technique to save computational effort in simulator performance comparisons. The overall mechanism is quite flexible and can be easily adapted to the various requirements that different kinds of performance studies impose. It is able to speed up performance experiments significantly, which is shown by a simple performance study. 1

### To Migrate or to Wait: Bandwidth-Latency Tradeoff In Opportunistic Scheduling of Parallel Tasks

"... Abstract-We consider the problem of scheduling low-priority tasks onto resources already assigned to high-priority tasks. Due to burstiness of the high-priority workloads, the resources can be temporarily underutilized and made available to the low-priority tasks. The increased level of utilization ..."

Abstract
- Add to MetaCart

(Show Context)
Abstract-We consider the problem of scheduling low-priority tasks onto resources already assigned to high-priority tasks. Due to burstiness of the high-priority workloads, the resources can be temporarily underutilized and made available to the low-priority tasks. The increased level of utilization comes at a cost to the lowpriority tasks due to intermittent resource availability. Focusing on two major costs, bandwidth cost associated with migrating tasks and latency cost associated with suspending tasks, we aim at developing online scheduling policies achieving the optimal bandwidth-latency tradeoff for parallel low-priority tasks with synchronization requirements. Under Markovian resource availability models, we formulate the problem as a Markov Decision Process (MDP) whose solution gives the optimal scheduling policy. Furthermore, we discover structures of the problem in the special case of homogeneous availability patterns that enable a simple threshold-based policy that is provably optimal. We validate the efficacy of the proposed policies by trace-driven simulations. I.

### Submitted to the Annals of Statistics BATCHED BANDIT PROBLEMS

"... Abstract Motivated by practical applications, chiefly clinical tri-als, we study the regret achievable for stochastic bandits under the constraint that the employed policy must split trials into a small number of batches. We propose a simple policy that operates under this contraint and show that a ..."

Abstract
- Add to MetaCart

(Show Context)
Abstract Motivated by practical applications, chiefly clinical tri-als, we study the regret achievable for stochastic bandits under the constraint that the employed policy must split trials into a small number of batches. We propose a simple policy that operates under this contraint and show that a very small number of batches gives close to minimax optimal regret bounds. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits. 1. Introduction. All clinical trials are run in batches: groups of pa-tients are treated simultaneously, with the data from each batch influencing the design of the next. Despite the fact that this structure is codified into law in the case of drug approval, it has received scant attention from statis-ticians. What can be achieved given the small number of batches that is

### Essays in Problems of Optimal Sequential Decisions

"... In this dissertation, we study several Markovian problems of optimal sequential decisions by focusing on research questions that are driven by probabilistic and operations-management considerations. Our probabilistic interest is in understanding the distribution of the total reward that one obtains ..."

Abstract
- Add to MetaCart

In this dissertation, we study several Markovian problems of optimal sequential decisions by focusing on research questions that are driven by probabilistic and operations-management considerations. Our probabilistic interest is in understanding the distribution of the total reward that one obtains when implementing a policy that maximizes its expected value. With this respect, we study the sequential selection of unimodal and alternating subsequences from a random sample, and we prove accurate bounds for the expected values and exact asymptotics. In the unimodal problem, we also note that the variance of the optimal total reward can be bounded in terms of its expected value. This fact then motivates a much broader analysis that characterizes a class of Markov decision problems that share this important property. In the alternating subsequence problem, we also outline how one could be able to prove a Central Limit Theorem for the number of alternating selections in a finite random sample, as the size of the sample grows to infinity. Our operations-management interest is in studying the interaction of on-the-job learning and learning-by-doing in a workforce-related problem. Specifically, we study the sequential hiring and retention of heterogeneous workers who learn over time. We model the hiring and retention problem as a Bayesian infinite-armed bandit, and we characterize the optimal policy in detail. Through an extensive set of numerical examples, we gain

### ON LEARNING AND INFORMATION ACQUISITION WITH RESPECT TO FUTURE AVAILABILITY OF ALTERNATIVES

, 2008

"... Most bandit frameworks applied to economic problems such as market learning and job matching are based on the unrealistic assumption that decision makers are fully confident about the future availability of alternatives. In this paper, we study two generalizations of the classical bandit problem in ..."

Abstract
- Add to MetaCart

Most bandit frameworks applied to economic problems such as market learning and job matching are based on the unrealistic assumption that decision makers are fully confident about the future availability of alternatives. In this paper, we study two generalizations of the classical bandit problem in which arms may become unavailable temporarily or permanently, and in which arms may break down and the decision maker has the option to fix them. It is shown that an optimal index policy does not exist for either problem. Nevertheless, there exists a near-optimal index policy in the class of Whittle index policies that cannot be dominated uniformly by any other index policy over all instances of either problem. The index strikes the balance between exploration and exploitation with respect to the availability of alternatives: it converges to the Gittins index as the probability of availability approaches one and to the immediate one-time reward as it approaches zero. Whittle indices are evaluated for Bernoulli arms with unknown success probabilities.

### A Linear Programming Relaxation and a Heuristic for the Restless Bandits Problem with General Switching Costs

, 2008

"... We extend a relaxation technique due to Bertsimas and Niño-Mora for the restless bandit problem to the case where arbitrary costs penalize switching between the bandits. We also construct a one-step lookahead policy using the solution of the relaxation. Computational experiments and a bound for appr ..."

Abstract
- Add to MetaCart

(Show Context)
We extend a relaxation technique due to Bertsimas and Niño-Mora for the restless bandit problem to the case where arbitrary costs penalize switching between the bandits. We also construct a one-step lookahead policy using the solution of the relaxation. Computational experiments and a bound for approximate dynamic programming provide some empirical support for the heuristic. 1

### Endogenous Learning with Bounded Memory Job Market Paper

, 2010

"... I analyze the e¤ects of memory limitations on the endogenous learning behavior of an agent in a standard two-armed bandit problem. An in
nitely lived agent chooses each period between two alternatives with unknown types, to maximize discounted payo¤s. The agent can experiment with each alternative a ..."

Abstract
- Add to MetaCart

I analyze the e¤ects of memory limitations on the endogenous learning behavior of an agent in a standard two-armed bandit problem. An in
nitely lived agent chooses each period between two alternatives with unknown types, to maximize discounted payo¤s. The agent can experiment with each alternative and receive payo¤s that are partially informative about its type. The agent does not recall past actions or payo¤s. Instead, the agent has a
nite number of memory states as in Wilson (2004): he can condition his actions only on the memory state he is currently in, and he can update his memory state depending on the payo ¤ received. I nd that the inclination to choose the currently better alternative does not con-strain learning in the limit as discounting vanishes. Even though uncertainties are independent, the agent optimally holds correlated beliefs across memory states. Op-timally, memory states reect the magnitude of the relative ranking of alternatives. After a high payo ¤ from one of the alternatives, the agent optimally moves to a mem-ory state with more pessimistic beliefs on the other, even though no information about

### ........ Em i. F.........

, 2008

"... Certified by.................,--. r--4-Certified by........ Certified by............... ..."

Abstract
- Add to MetaCart

Certified by.................,--. r--4-Certified by........ Certified by...............