Results 1  10
of
22
Online regret bounds for Markov decision processes with deterministic transitions
 Proc. of the 19th International Conference on Algorithmic Learning Theory (ALT 2008), volume 5254 of Lecture Notes in Computer Science
, 2008
"... Abstract. We consider an upper confidence bound algorithm for Markov decision processes (MDPs) with deterministic transitions. For this algorithm we derive upper bounds on the online regret (with respect to an (ε)optimal policy) that are logarithmic in the number of steps taken. These bounds also m ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We consider an upper confidence bound algorithm for Markov decision processes (MDPs) with deterministic transitions. For this algorithm we derive upper bounds on the online regret (with respect to an (ε)optimal policy) that are logarithmic in the number of steps taken. These bounds also match known asymptotic bounds for the general MDP setting. We also present corresponding lower bounds. As an application, multiarmed bandits with switching cost are considered. 1
Restless bandits with switching costs: Linear programming relaxations, performance bounds and limited lookahead policies
 in American Control Conference
, 2006
"... Abstract—The multiarmed bandit problem and one of its most interesting extensions, the restless bandits problem, are frequently encountered in various stochastic control problems. We present a linear programming relaxation for the restless bandits problem with discounted rewards, where only one pro ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Abstract—The multiarmed bandit problem and one of its most interesting extensions, the restless bandits problem, are frequently encountered in various stochastic control problems. We present a linear programming relaxation for the restless bandits problem with discounted rewards, where only one project can be activated at each period but with additional costs penalizing switching between projects. The relaxation can be efficiently computed and provides a bound on the achievable performance. We describe several heuristic policies; in particular, we show that a policy adapted from the primaldual heuristic of Bertsimas and NiñoMora [1] for the classical restless bandits problem is in fact equivalent to a onestep lookahead policy; thus, the linear programming relaxation provides a means to compute an approximation of the costtogo. Moreover, the approximate costtogo is decomposable by project, and this allows the onestep lookahead policy to take the form of an index policy, which can be computed online very efficiently. We present numerical experiments, for which we assess the quality of the heuristics using the performance bound. I.
Modeling Human Decisionmaking in Generalized Gaussian Multiarmed Bandits
, 2014
"... We present a formal model of human decisionmaking in exploreexploit tasks using the context of multiarmed bandit problems, where the decisionmaker must choose among multiple options with uncertain rewards. We address the standard multiarmed bandit problem, the multiarmed bandit problem with tr ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
We present a formal model of human decisionmaking in exploreexploit tasks using the context of multiarmed bandit problems, where the decisionmaker must choose among multiple options with uncertain rewards. We address the standard multiarmed bandit problem, the multiarmed bandit problem with transition costs, and the multiarmed bandit problem on graphs. We focus on the case of Gaussian rewards in a setting where the decisionmaker uses Bayesian inference to estimate the reward values. We model the decisionmaker’s prior knowledge with the Bayesian prior on the mean reward. We develop the upper credible limit (UCL) algorithm for the standard multiarmed bandit problem and show that this deterministic algorithm achieves logarithmic cumulative expected regret, which is optimal performance for uninformative priors. We show how good priors and good assumptions on the correlation structure among arms can greatly enhance decisionmaking performance, even over short time horizons. We extend to the stochastic UCL algorithm and draw several connections to human decisionmaking behavior. We present empirical data from human experiments and show that human performance is efficiently captured by the stochastic UCL algorithm with appropriate parameters. For the multiarmed bandit problem with transition costs and the multiarmed bandit problem on graphs, we generalize the UCL algorithm to the block UCL algorithm and the graphical block UCL algorithm, respectively. We show that these algorithms also achieve logarithmic cumulative expected regret and require a sublogarithmic expected number of transitions among arms. We further illustrate the performance of these algorithms with numerical examples.
Index policies for discounted bandit problems with availability constraints
, 2008
"... Abstract. In the classical bandit problem, the arms of a slot machine are always available. This paper studies the case where the arms are not always available. We first consider the problem where the arms are intermittently available with some timedependent probabilities. We prove the nonexisten ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract. In the classical bandit problem, the arms of a slot machine are always available. This paper studies the case where the arms are not always available. We first consider the problem where the arms are intermittently available with some timedependent probabilities. We prove the nonexistence of an optimal index policy and propose the socalled Whittle index policy after reformulating the problem as a restless bandit. The index strikes the balance between exploration and exploitation: it converges to the Gittins index as the probability of availability approaches to one and to the immediate onetime reward as it approaches to zero. We then consider the problem where the arms may break down and repair is an option at some cost, and we derive the corresponding Whittle index policy. We show that both problems are indexable and that the proposed index policies cannot be dominated uniformly by any other index policy over the entire class of bandit problems considered here. We illustrate how to evaluate one of the indices on a numerical example in which rewards are Bernoulli random variables with unknown success probabilities.
1Sequential Learning for Multichannel Wireless Network Monitoring with Channel Switching Costs
"... Abstract—We consider the problem of optimally assigning p sniffers to K channels to monitor the transmission activities in a multichannel wireless network with switching costs. The activity of users is initially unknown to the sniffers and is to be learned along with channel assignment decisions to ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—We consider the problem of optimally assigning p sniffers to K channels to monitor the transmission activities in a multichannel wireless network with switching costs. The activity of users is initially unknown to the sniffers and is to be learned along with channel assignment decisions to maximize the benefits of this assignment, resulting in the fundamental tradeoff between exploration and exploitation. Switching costs are incurred when sniffers change their channel assignments. As a result, frequent changes are undesirable. We formulate the snifferchannel assignment with switching costs as a linear partial monitoring problem, a superclass of multiarmed bandits. As the number of arms (snifferchannel assignments) is exponential, novel techniques are called for, to allow efficient learning. We use the linear bandit model to capture the dependency amongst the arms and develop a policy that takes advantage of this dependency. We prove that the proposed Upper Confident Boundbased (UCB) policy enjoys a logarithmic regret bound in time t that depends sublinearly on the number of arms, while its total switching cost grows in the order of O(log log(t)). Index Terms—Local area networks, network monitoring, sequential learning. I.
Endogenous Learning with Bounded Memory
, 2012
"... I analyze the effects of memory limitations on the endogenous learning behavior of an agent in a standard twoarmed bandit problem. An infinitely lived agent chooses each period between two alternatives with unknown types, to maximize discounted payo¤s. The agent can experiment with each alternative ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
I analyze the effects of memory limitations on the endogenous learning behavior of an agent in a standard twoarmed bandit problem. An infinitely lived agent chooses each period between two alternatives with unknown types, to maximize discounted payo¤s. The agent can experiment with each alternative and receive payoffs that are partially informative about its type. The agent does not recall past actions or payo¤s. Instead, the agent has a finite number of memory states as in Wilson (2004): he can condition his actions only on the memory state he is currently in, and he can update his memory state depending on the payoff received. I …find that the inclination to choose the currently better alternative does not constrain learning in the limit as discounting vanishes. Even though uncertainties are independent, the agent optimally holds correlated beliefs across memory states. Optimally, memory states re‡ect the magnitude of the relative ranking of alternatives. After a high payoff from one of the alternatives, the agent optimally moves to a memory state with more pessimistic beliefs on the other, even though no information about
Multiarmed Bandit Problem with Lockup Periods
"... We investigate a stochastic multiarmed bandit problem in which the forecaster’s choice is restricted. In this problem, rounds are divided into lockup periods and the forecaster must select the same arm throughout a period. While there has been much work on finding optimal algorithms for the stocha ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We investigate a stochastic multiarmed bandit problem in which the forecaster’s choice is restricted. In this problem, rounds are divided into lockup periods and the forecaster must select the same arm throughout a period. While there has been much work on finding optimal algorithms for the stochastic multiarmed bandit problem, their use under restricted conditions is not obvious. We extend the application ranges of these algorithms by proposing their natural conversion from ones for the stochastic bandit problem (indexbased algorithms and greedy algorithms) to ones for the multiarmed bandit problem with lockup periods. We prove that the regret of the converted algorithms is O(log T +Lmax), where T is the total number of rounds and Lmax is the maximum size of the lockup periods. The regret is preferable, except for the case when the maximum size of the lockup periods is large. For these cases, we propose a metaalgorithm that results in a smaller regret by using a empirical best arm for large periods. We empirically compare and discuss these algorithms.
Optimal Hiring and Retention Policies for Heterogeneous Workers who Learn Gittins,
 J. C.
, 1979
"... We study the hiring and retention of heterogeneous workers who learn over time. We show that the problem can be analyzed as an infinitearmed bandit with switching costs and apply results from Employee turnover can similarly affect organizational performance. Workers who turn over (quit) or are te ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
We study the hiring and retention of heterogeneous workers who learn over time. We show that the problem can be analyzed as an infinitearmed bandit with switching costs and apply results from Employee turnover can similarly affect organizational performance. Workers who turn over (quit) or are terminated may be replaced by new hires who differ in both ability and experience. Different policies for hiring, monitoring, and retaining employees will influence the longrun performance of a firm. Often there can be uncertainty regarding employee capabilities. Significant random variations in task times or quality driven by taskbytask variability can make it difficult for an employer to infer a given employee's efficiency or quality, particularly for new employees who have little or no previous track record. Uncertainty, together with these many sources of variation across employee capabilities, across tasks, and over time makes decisions regarding the retention of workers complex. The longer a worker is retained, the better the inference an employer can make regarding his or her capabilities. Onthejob learning, which can lead to quality improvements in incumbent employees, also favors employee retention. Yet the opportunity cost of retaining a poor performer can be great, particularly if there is wide variation in quality across the population of potential hires. In this paper we develop and analyze a model that integrates all of these factors. In our model, an employer (referred to as "she") seeks to hire and retain a fixed number of employees from an infinite, heterogeneous population of potential hires. Each employee (referred to as "he") repeatedly performs the same task, whose Arlotto, Chick and Gans: Optimal Hiring and Retention Policies for Heterogeneous Workers who Learn 2 cost the employer wishes to minimize or, equivalently, whose quality is to be maximized. Each hire moves down a learning curve, but elements of the curve's parameters are unknown to the employer. The employer takes a Bayesian view of employees' types. By repeatedly observing the task performance of a given worker, she can make increasingly better judgments concerning his quality. After each such task, the employee decides whether he wants to continue working or not. Given the worker decides to stay, the employer can decide whether to retain him or to replace him with a new hire. Each of these decisions has a cost for the employer. A quitting cost is incurred when a worker quits, a switching cost is incurred when a worker is terminated, and a training cost is incurred for each newly hired employee. We formulate this problem as an infinitehorizon, discounted problem in which, at any time, the employer uses a single worker, and we show that the problem can be modeled as a multiarmed bandit problem with switching costs and an infinite number of arms. We then apply wellknown results, developed by These Gittinsindex results extend to more complex settings, including contexts with multiple employees and environments with multiple, heterogeneous pools of potential employees. For specific common forms of the learningcurve function we delineate a simple stopping boundary and then use the boundary to develop approximations to the Gittins index that are straightforward to calculate and implement. These approximations are then the basis of numerical examples. Our numerical results provide insights into the nature and performance of the optimal policy. They show how the stopping boundary reflects a tradeoff between two types of learning: the performance improvement that is linked to an employee's onthejob experience, and the statistical learning that allows the employer to make better judgments concerning a worker's ability. They demonstrate that the value of active monitoring and screening of employees can be substantial. They reveal that the early stages of workers' tenures are the most important for the effectiveness of the optimal policy and, in turn, suggest simpler hiring policies that have the potential to perform well, within a few percent of optimality. Sensitivity analysis with respect to model parameters provides further insights. In addition to direct gains that accrue from steeper learning curves, investments in employee learning can provide an important secondary benefit: the optimality of lower termination rates. Reductions in the variability of task performance can improve the sensitivity of screening procedures and similarly reduce optimal termination rates. The ability to terminate employees should motivate managers to consider a broader spectrum of potential hires. Literature review There is a vast empirical literature on learningcurve phenomena The literature that explicitly addresses both worker heterogeneity and learning is much smaller. Most closely related to our work is Nagypál (2007), which models both learningaboutmatchquality (between workers and a firm) and learningbydoing. That paper's aims and results differ significantly from ours. Its model and analysis enable the use of statistical methods to discriminate between the two forms of learning in empirical employment records. We focus on modelbased, and normative insights into the nature of effective retention/termination decisions. A few recent papers in operationsrelated fields also address dimensions of heterogeneity in learning and employee retention. None of these papers considers uncertainty regarding learning curves across individuals or groups, however. Neither do they address employee turnover or employee retention decisions. There also exists a rich literature that addresses labor quality and selection. The literature on secretary problems develops a normative approach to the initial screening and hiring of employees who come from a heterogeneous pool In our context this work can be reinterpreted as addressing firms choosing employees, and we use results concerning infinitearmed bandits with switching costs (but no learning) to characterize optimal hiring and retention policies The managerial implications of learning have received less attention. Arlotto, Chick and Gans: Optimal Hiring and Retention Policies for Heterogeneous Workers who Learn 4 Pinker and Shumsky The Hiring and Retention Problem with One Employee In this section, we define the problem of an employer who requires the services of a single worker and who, at each discrete period of time, decides whether to retain the current employee or to terminate him and hire someone else from an infinite pool of workers. The assumption that there exists an infinite pool of potential hires is appropriate in socalled "employers' markets," in which the potential workforce is sufficiently large that workers who quit need not be considered again. Section 4 explores the employment of multiple hires, as well as the presence of several, heterogeneous pools of workers. At each time t = 0, 1, 2, . . . the employer requires the service of a single employee, i, drawn from an infinite pool of potential workers, S t ; S 0 represents the initial pool from which the employer can draw. If employee i quits at time t then he is removed from the pool of potential hires and S t+1 = S t \{i}. We let π(t) = i ∈ S t denote the employer's choice of employee i at time t and define π = {π(0), π(1), . . .} to be a hiring and retention policy that specifies which workers the employer engages over time. The performance of potential workers is uncertain and evolving over time. If worker i ∈ S t is employed at time t, then his performance is defined by the relation where θ i ∈ Ω is a vector of parameters that reflects worker i's ability, n i,t = 0, 1, 2, . . . reflects his experience to date, i,t is a noise term with support E, and g(·) is a deterministic function of its arguments. We denote the realization of Z i,t by z i,t . For Here, a i is a parameter that determines a baselevel of performance and b i < 0 describes the rate of learning. If Z i,t were task time, then a i and b i would be scaled in the logarithm of the time unit. The structural results concerning optimal policies, in Section 3, require only the general functional form (1), together with some technical assumptions. Furthermore, the function g(·) is quite general and, in addition to learning, might reflect the effect of other factors such as fatigue. While our analysis does hinge on Arlotto, Chick and Gans: Optimal Hiring and Retention Policies for Heterogeneous Workers who Learn 5 a single measure of performance, the representation of an outcome, Z i,t , can be generalized to explicitly represent multiple dimensions (such as revenue, cost, quality) that is aggregated into a single score by using a functional. Section 5, in which we develop methods for explicitly calculating the stopping boundaries necessary to implement optimal policies, assumes a more specific form of Z i,t , such as that given by (2). At the end of a given period, after his performance, the current employee notifies the employer of his intention to continue working or to leave. So, we associate with each worker a sequence of Bernoulli leaving We denote the realization of L i and L i,n i,t by i and i,n i,t respectively. For any hiring policy π and for each worker i ∈ S 0 we let be i's working lifetime: the number of periods he is employed. In turn, we define worker i's quitting probability, q i,n , to be and call 1 − q i,n worker i's continuation probability. history up to time t. The quitting probability of an employee with experience n i,t , q i,n i,t , may depend on H i,t and on his ability θ i , but it is assumed to be independent of the employer's hiring policy, π: This independence assumption is restrictive, and it is not difficult to imagine how employee turnover decisions may be influenced by the employer's retention (and compensation) policies. For example, by paying better performers more, the employer could provide an incentive for employee turnover patterns to change in a manner that is favorable to her. The inclusion of these types of incentives and responses extends the analysis of the employer's hiring and retention problem from the realm of singledecisionmaker optimization problems to that of stochastic games and is beyond the focus of our current work. Nevertheless, the strategic interaction of employer and employees is both interesting and important, and we will briefly return to this issue in the numerical results of Section 6. The employer does not know each employee's θ i or i in advance. Rather, she believes that there exists a random vector, Θ, that reflects the distribution of abilities in the population of potential workers, and a random set of leaving decisions, L. The distributions for Θ and L can be estimated using historical data and statistical techniques. Each time the employer hires a new worker, she views that worker's Θ i and L i as iid samples from the population distributions. At time t = 0 all potential workers, i ∈ S 0 , have the same history, H i,0 = ∅, the Arlotto, Chick and Gans: Optimal Hiring and Retention Policies for Heterogeneous Workers who Learn 6 same prior distribution for Θ i , ν i,0 ≡ ν, and no prior experience so that n i,t ≡ 0. Thus, at time t = 0, the employer is indifferent among her choices. At any time t > 0, each worker, i, has cumulative experience n i,t , and the employer uses i's employment history, H i,t , to update her beliefs concerning the distribution of the parameter Θ i . We denote the posterior distribution that describes the employer's uncertainty concerning Θ i at time t as ν i,t (X) = P (Θ i ∈ XH i,t ), where X ⊆ Ω is any Borel set. For Θ i ∼ ν i,t we let , and for {Θ i = θ i } we assume that worker i's performance {Z(ν i,t , n i,t )  θ i } has density ξ n i,t (z  θ i ). If worker i is employed at time t, then his experience, n i,t , increases deterministically by one, and n i,t+1 = n i,t + 1. Moreover, the employer updates her belief concerning i's ability distribution according to Bayes' rule. If P(Ω) is the set of all probability measures, ν, on Ω, then the Bayes operator for each Borel subset X ⊆ Ω. Thus for any given observation, z, the Bayes operator maps the prior distribution, ν i,t , to its posterior distribution, ν i,t+1 . Within each period, t, the employer incurs a taskrelated cost that is driven by the selected employee's performance, c(z i,t ). We assume that c(z) is continuous and nondecreasing in z, which reflects an efficiencybased measure of employee performance. Because the employer does not know employees' true abilities, in each period she uses her belief concerning the distribution of the current employee's ability, ν i,t , to estimate his expected taskrelated cost: The employer also incurs costs that are specific to the hiring and retention policy she is implementing. If, at the start of a period, the employer hires a new employee, she incurs an initial hiring (or training) cost, c h . If, at the end of a period, the employee quits, the employer bears a quitting cost, c q , that includes potential separation costs and the cost of recruiting a replacement. If the employee does not quit, then the employer may decide to terminate him and switch to a different worker, in which case she bears a switching cost, c s . Training, switching and quitting costs are assumed to be nonnegative. To properly account for switching and quitting costs, we introduce for each worker i and each time t a switching indicator, u i,t , such that if policy π employs worker i over several, disjoint, time periods, then the indicator u i,t switches between 0 and 1, and it equals one at every time t such that worker i was not employed at t − 1. Formally, we set u i,0 = 1 for all i ∈ S 0 and for t ≥ 1 we let When {i ∈ S 0 : u i,t = 0} = {i ∈ S 0 : u i,t+1 = 0}, the workers employed at time t − 1 and at time t differ, and the employer needs to incur the switching or quitting cost for the worker that was employed at time t − 1. Arlotto, Chick and Gans: Optimal Hiring and Retention Policies for Heterogeneous Workers who Learn 7 For any time τ ≥ 0 and any state of prior distributions, experiences and switching indicators, (ν, n, u) = {(ν i,τ , n i,τ , u i,τ ) : i ∈ S 0 }, the infinitehorizon total expected discounted cost of any hiring and retention policy, π, from time τ onwards is where the discount factor is γ ∈ [0, 1). We note that in each period, t, the employer bears four possible sources of cost. The first, c h 1(n π(t),t = 0), is the hiring and training cost for a new worker, and it is incurred only once, at the beginning of employee π(t)'s tenure. The second, c Z(ν π(t),t , n π(t),t ) , reflects employee , is the cost of switching to a different worker at time t, should the previous employee be terminated. The fourth source of cost, c q u π(t),t 1(π(t − 1) ∈ S t ∩ t > 0), reflects the cost of switching to a different worker at time t, should the previous employee quit. When t = 0, no switching nor quitting costs should be incurred, and we account for this by including the requirement t > 0 in the indicator functions in equation , we rewrite In this new formulation, the switching cost, c s , is incurred any time the worker employed at time t is different from that employed at time t − 1. The difference, c q − c s , then adjusts the value of the switching cost if the worker employed at t − 1 has quit. The quantity, −c s 1(τ = 0), outside the expectation compensates for the switching cost incurred for the first worker ever employed because u i,0 = 1 for all i ∈ S 0 . We let Π denote the set of nonanticipating hiring policies, and we assume that the employer seeks a policy π * ∈ Π that minimizes the expected discounted value of future employment costs For the problem to be analytically tractable we assume that the parameter space Ω is a Borel subset of R d , and we require that the singleperiod, taskrelated costs are uniformly bounded, i. Structure of the Optimal Policy The hiring and retention problem can be formulated as a Bayesian bandit problem with an infinite number of arms. Two elements of the problem complicate the analysis, however. First, when an employee quits, the arm associated with him becomes unavailable. Second, when the employer switches from one employee to another, she incurs the switching costs, c s , that cannot be attributed to a single employee. In characterizing the optimal hiring and retention policy, we must address both of these difficulties. Arlotto, Chick and Gans: Optimal Hiring and Retention Policies for Heterogeneous Workers who Learn 8 Transformation to Problem with No Quitting The fact that employees quit can be compensated for by transforming the problem with quitting to one in which workers are always available. Rather than quitting, they become unproductive, and their cost exceeds that of any productive worker. To do so we assume that each employee, i ∈ S 0 , becomes unproductive at time t, after his (n i,t + 1)st performance with probability equal to q i,n i,t in (4). When employee i becomes unproductive at time t, his ability distribution changes from ν i,t to ν i,t+1 = 1 K where K ∈ (K sup + c h + max{c q , c s }, ∞) and c(Z(1 K , n)) = K for every n. Once employee i has become unproductive, he will never be able to go back to the productive state. The choice K sup + c h + max{c q , c s } < K implies that the cost of an unproductive worker exceeds the cost of any possible realization of any productive worker, plus the largest cost of hiring a new worker. We then define the stopping time as the time at which employee i becomes unproductive. Because an unproductive worker cannot go back to the productive state, we set q i,k = 0 for all k > n when Λ i = n, and we modify the Bayes operator (5) as follows: Call the original problem in (9), in which employees quit, Problem 1, and call the modified problem, in which they become unproductive, Problem 2. The following lemma confirms the fact that the problem with workers who become unproductive is analogous to that of those who quit. LEMMA 1. (i) In Problem 2, any policy that employs unproductive workers is never optimal. (ii) A policy is optimal for Problem 1 if and only if it is optimal for Problem 2. Proofs of these claims and of the others below are found in the Appendix. Lemma 1 tells us that, for each policy π ∈ Π, employee i's working lifetime Λ i (π) in (3) and the time at which employee i becomes unproductive (10) are closely related. In fact, if employee i quits before he is Transformation to Problem with Retirement Option We derive the optimal policy for Problem 2 by solving a family of stopping problems in which, at each period, n, the employer chooses between employing a single worker, i ∈ S 0 , or terminating all employment and paying a socalled "retirement" cost, m. Given that we are considering an optimal stopping problem for a single employee, we drop the employee index, i, and the time index, t, from subscripts. This approach, called the retirementoption problem, was introduced by Whittle (1980) for bandit problems with a finite number of arms and extended by Markov Decision Process with uniformly bounded costs, a fact that implies that there exists an optimal hiring and retention policy that is stationary and deterministic (Bertsekas and Shreve 1978, Prop. 9.8). 1 The optimal value function for the retirementoption approach satisfies the following Bellman equation: where In words, at any decision time, the employer has the choice of retiring at cost m, or continuing the employment of the worker currently on trial. The expected discounted cost of continuing, HV (ν, n, u, m), can be interpreted by looking at whether the employee is productive (ν = 1 K ) or not (ν = 1 K ). If the employee is productive, then with probability 1 − q n , he remains productive and β(ν, Z(ν, n)) = . With probability q n , he becomes unproductive and his ability distribution changes to 1 K . If the employee is already unproductive at n, then q n = 0, and the modified definition of the Bayes operator (11) gives us Here, we restrict our attention to values of m such that m ≤ K/(1 − γ), so that retiring is attractive when ν = 1 K . Then, (13) becomes If ν = 1 K and the employee is productive at n, the last addend represents the cost difference paid for an employee who has quit, c q − c s , plus the retirement cost for the employer, m. The quantity HV (ν, n, u, m) hence represents the cost of employing a worker with ability distribution, ν, experience, n, and switching indicator, u, for at least one period, followed by an optimal termination decision that depends on the retirement payment, m. 1 A policy is stationary if, at any time t, the action it prescribes depends only on the current state. A policy is deterministic if the action it prescribes is never randomized. Arlotto , Chick and Gans: Optimal Hiring and Retention Policies for Heterogeneous Workers who Learn 10 The stopping time is the time at which the employer chooses to retire, and {ν r } r≥1 and {u r } r≥1 represent the evolution of the ability distribution and the switching indicator after period n. For r = 0, we set ν 0 ≡ ν and u 0 ≡ u. Let Q n = {ω : Λ > n, Λ(ν, n, u, m) = Λ − n} be the set of sample paths for which a productive worker with ability distribution, ν, experience, n, and switching indicator, u, quits before he is terminated. Notice that, if a worker is already unproductive at n and ν = 1 K , then Λ ≤ n and therefore Q n = ∅. Then, we can write the expected discounted cost of continuing and this last representation and its properties will be crucial in the proofs of many of our results. Given the availability of the value function (12), we are interested in the value of m for which the employer is indifferent between continuing to employ the current hire or retiring, at cost m. We denote that value by the index This index is welldefined because the value function (12) is concave and nondecreasing in m, a fact that is stated and proved in the Appendix. It is a direct analogue of the definition of the Gittins index proposed by Whittle (1980) for problems without learning or switching costs. Optimal Policy When the employer switches from one employee to another she incurs a switching cost, a fact that can make the characterization of optimal policies difficult. In particular, when the set of available hires is finite, an employer that switches away from and then returns to an employee, i, at a later period pays a switching cost that she would not have incurred had she continued to employe i over contiguous periods A number of researchers have sought to characterize optimal policies for such bandit problems with switching costs. For problems with a finite number of arms, Asawa and Teneketzis (1996) define two indices, a traditional Gittins index analogous to (17) along with a corresponding "switching cost index," and they show that these indices can be used to describe necessary, though not sufficient, conditions under which an optimal policy will switch arms. NiñoMora (2008) shows how to efficiently calculate Asawa and Teneketzis's indices. As a part of his analysis, Bergemann and Välimäki (2001) use the "forward induction" formulation of s. for all t = 0, 1, 2, . . .. (ii) At any time, t, at most one worker, i, has Gittins index M i (ν i,t , n i,t , u i,t ) < m. (iii) Let t i = inf{t : π * (t) = i} be the first time worker i is employed. Under the optimal policy π * in (i): (b) It is never optimal to employ worker i from time Given the structure of the optimal policy in Part (i) of Proposition 1, we can justifiably call (17) a Gittins index. Moreover, when the optimal policy is implemented, Part (ii) implies that there is often just one Gittinsindexminimal employee. Part (iii) shows that it is never optimal to employ a worker who was previously replaced. For an employer seeking to retain a single employee, the hiring and retention problem therefore decomposes into a sequence of iid optimal stopping problems: hire an employee from the pool and retain him until he turns over or his Gittins index rises above m, whichever comes first. In turn, the optimal policy yields a discounted renewal reward process, with expected value described in Part (iv). Part (iv) of Proposition 1 directly links the expected total discounted cost of the optimal policy to the Gittins index, a result that does not generally hold in bandit problems with finite numbers of arms. In Section 5, we use the result to estimate the expected discounted value of a Gittinsindex policy. Arlotto, Chick and Gans: Optimal Hiring and Retention Policies for Heterogeneous Workers who Learn 12 Extensions: Multiple Parallel Workers and Different Pools Sections 2 and 3 considered the problem of employing a single worker. We now consider two extensions. Section 4.1 considers the problem in which distinct (infinite) pools of heterogeneous workers are available. Section 4.2 considers an employer who wishes to retain multiple employees who work in parallel. In both cases, the optimality of an index rule is retained. Heterogeneous Populations When the employer faces a finite number of heterogeneous populations, her optimal hiring and retention policy is the same as the one proposed in Proposition 1 Part (i). For example, consider two infinite pools S ν 0 and S η 0 , for which the untried workers have common prior distributions ν and η, with ν = η. Let M ( ν, 0, 1) and M ( η, 0, 1) be the indices of the untried workers in each pool. If M ( ν, 0, 1) = M ( η, 0, 1), then workers belonging to the pool with larger index are never employed by an optimal policy. Otherwise, if M ( ν, 0, 1) = M ( η, 0, 1), then the employer is indifferent between the two populations. Hiring and Retention of Multiple Workers Assume now that ν i,0 = ν, n i,0 = 0, and u i,0 = 1 for all i ∈ S 0 , and consider the hiring and retention problem in which the employer wishes to retain a fixed number, D, of people working in parallel. One can partition the infinite pool of potential employees, S 0 , into D separate, countably infinite pools, S 1,0 , . . . , S D,0 , of identical workers with common prior distribution, ν, no experience, and common switching indicator equal to 1. When employee i in pool d quits at time t, he is removed from that pool so that S d,t+1 = S d,t \{i}. Then, the infinitehorizon total expected discounted cost is where π d (t) ∈ S d,t identifies the index of the worker who is employed from pool d at time t, ν π d (t),t his ability distribution, n π d (t),t his experience, and u π d (t),t the value of his switching indicator. By interchanging the sums in (18) one obtains C π (ν, n, u) is the dth position's expected discounted cost, as defined in At any time, t, at which the employer seeks to hire a new worker for any of the D positions, she can employ any untried worker who belongs to the pool of potential employees, S t . This result, due to Bergemann and Välimäki (2001), crucially depends on the assumption that all workers have the same experience and ability distribution at time t = 0, so that the artificial splitting of potential hires into D pools is possible. We note that our analysis of multiple employees also hinges on the independence of the outcomes of various employees' tasks. In many settings task outcomes may be correlated across workers, however, and Implementing the Optimal Policy This section shows how analytic properties of the hiring and retention problem can be combined with dynamic programming to enable the computation of the relevant Gittins indices when performance has certain structural properties. As shown in the appendix, for any given ν, n, u the value function, V (ν, n, u, m), is concave and nondecreasing in m. Therefore, given ν, n, u a simple search scheme, such as bisection, can be used to find the largest fixed point, M (ν, n, u), that defines the Gittins index. Because our set of iid stopping problems allows us to focus on a single employee, we drop the indices i and t as subscripts and let Z n = g(θ, n, n ). To calculate solution values, we explicitly define the functional form of the (n + 1)st performance for a worker, Z n . We assume that g(·) is invertible and that is a linear model where A determines an unknown baselevel that may vary across workers, h(n) is a known learning function, and n is normally distributed noise with mean 0 and known variance σ 2 . Because A is unknown, the mean of the noise can be assumed to be zero without loss of generality. We assume that the potential hire's base level of performance, A, has initial prior distribution, ν, that is normally distributed with mean µ and variance σ 2 , N ( µ, σ 2 ). The form in (19) implies another structural property that will be useful for computing the Gittins indices of workers. The random variables g −1 (Z n ) − h(n) are normally distributed with unknown mean A and variance σ 2 + σ 2 . By standard Bayesian analysis, ν, the posterior distribution of A after observing n tasks, z n = (z 0 , z 2 , . . . , z n−1 ), is normal with Define p = σ 2 / σ 2 , and let p = p + n, where n is the number of samples observed for the singleworker These assumptions are sufficient to guarantee that both the Bellman equation The monotonicity of the Gittins index with respect to w p allows us to concisely describe the optimal policy. For each p = p+n, there is a simple "stopping" boundary, b(p), such that it is optimal to retain the employee (continue) if w p < b(p) and to terminate the employee (stop) if w p > b(p). is N (w p , σ 2 / p), so that Proposition 2 applies. In summary, we use the common technique of approximating the evolution of the posterior distribution as samples are observed, a Gaussian process, with the evolution of the posterior distribution of a related trinomial process on a grid. We construct the necessary grid of points in the (w, p) coordinate system, estimate the terminal conditions (the period at which the dynamic programming backwards recursion starts, typically a large number of periods in the future) using Monte Carlo simulation, perform a backward recursion using a trinomial tree approximation on the grid of points to approximate both V and the optimal stopping boundary for a given value of m, and then search for the value of m that identifies the Gittins index. This process also identifies the optimal stopping boundary that determines the optimal solution to the hiring and retention problem. The numerical results in Section 6 correspond to a learning function that sets g(z) = e z and h(n) = b ln(n + 1). This corresponds to (2) with a common learning parameter b i = b and where n ∼ N (0, σ 2 ). Here, Numerical Examples and the Value of Screening In this section, we use the methods described in Section 5 to calculate Gittins indices, as well as associated optimal stopping boundaries, for several examples. We also use discrete event simulation to estimate rates of termination and voluntary turnover. We compare the performance of the optimal Gittinsindex policy with that of other easily implementable policies and demonstrate that an active hiring and retention policy reduces costs and improves the pool of workers who are employed. We perform a sensitivity analysis with respect to the key parameters of our model, and we conclude that increases in employee learning rates reduce costs, improve the pool of employed workers and lower termination rates. Moreover, we observe that managers favor pools of potential workers with a broader set of abilities. Balancing Uncertainty and Learning Effects The first example is loosely motivated by a call center. Each Z n represents the average duration (in The right panel shows the stopping boundary with respect to E[Z n ]. Here, the stopping boundary is unimodal, with a peak on day 1 due to the elimination of the dayzero training cost, followed by a monotone decrease that is initially steep and that later flattens out. Unlike the left panel, the right panel does not explicitly display a "dip" that reflects the problem's two conflicting forces, between the employer's statistical learning and the employees' learning by doing. Instead, after day 1, we find a monotonically decreasing stopping boundary that requires a worker's expected performance to keep improving over time. The dashed line in both panels plots the prior mean, µ, (left) and the expected call times, E[Z n ], (right) for an "average" employee with baselevel service time A = µ. The vertical distance between the two curves is a measure of how much better or worse a "marginally retained" employee is in comparison to an "average" employee. The presence of training costs induces managers to retain workers who are worse than average. The simulation results in The policy terminates 39.82% of the employees: 1.96% of workers are terminated on day 1, 28.30% are terminated during periods 2 through 10, and 9.57% thereafter. Hence, much of the termination occurs early on. Of course, termination rates vary significantly with training costs. In Section 6.3 we present a sensitivity analysis that addresses this relationship. How the Optimal Policy compares with Simpler Policies This section compares the optimal policy with four families of alternative hiring policies. In the first family, workers are never terminated, and they serve until they naturally turn over. In the second, workers are monitored for a limited screening period, during which they can be terminated after each day of performance. If retained at the end of the screening period, they are never terminated. In (Note that the optimal policy described in this paper is a Gittinsindex policy in which screening takes place each day.) Finally, the fourth family considers policies with a trial period of a given length (1, 5, 10 or 20 days) within which workers are not terminated. At the end of the trial period the employer decides whether to retain or terminate the worker, and, if he is retained, he is not terminated until he turns over. In all cases, we use optimal retain/terminate thresholds, given the details of the particular policy. We also report analogous simulation results for the optimal policy and note that, because it is estimated via simulation, rather than backward recursion, the Gittins index for this example varies slightly (within one standard error) from that reported in Section 6.1. The results in the second column of Interestingly, the Gittinsindex policy that screens workers every 5 days also performs close to optimally. Thus, screening needs not to occur every period for a policy to be effective. The results for "oneshot" at 5 and 10 periods also suggest that simple, oneshot retention decisions have the potential to perform well, with average discounted costs within a few percent of the optimal Gittinsindex policy. Arlotto, Chick and Gans: Optimal Hiring and Retention Policies for Heterogeneous Workers who Learn 18 For any hiring policy, π, its longrun average service rate is , the longrun average number of calls that an agent handles per minute each day. Its numerical values are reported in column four of Gittinsindex policy with the "never screen" policy, we see that the former requires, on average, 16.41% less workers to maintain the same level of capacity. To more clearly understand this, consider the hypothetical scenario in which a call center has an average load of 53.64 calls per minute. With the optimal policy, this requires employing 53.64 / 0.6417 = 83.59 workers longrun average to have a "fullyloaded" system. With the policy "never screen", the same "fullyloaded" system requires 53.64 / 0.5364 = 100 workers, and the optimal policy employs 16.41% fewer workers. The rightmost column of Sensitivity analysis This section examines how the optimal policy depends on key parameters: employees' learning rates; switching and quitting costs; employer uncertainty regarding employee performance; taskbytask variability; and training costs. The Gittins indices, turnover and termination rates reported in this section are computed as in Section 6.1. Learning rates. Section 6.1 studied a pool of workers whose performance improves by 50% over the boundary disappears. The contribution of this experiencebased learning is so high that the screening policy retains workers with a broader set of posterior means. With a faster learning rate, every employee is faster for each level of experience, and one expects the stopping boundary with respect to E[Z n ] to decline. This is indeed the case and, in the right panel of Arlotto, Chick and Gans: Optimal Hiring and Retention Policies for Heterogeneous Workers who Learn 20 REMARK 1. Empirical evidence in the learning literature shows that slower learners can produce higher value in the long run. (See, e.g., March, 1991; Switching and quitting costs. One would expect that changes in switching and quitting costs would similarly affect the optimal policy. However, the theorem below shows that, when the quitting probabilities are constant so that q i,n = q for all n and for all i ∈ S 0 this is not the case. To state the theorem we need to keep track of how the training, quitting and switching costs affect the Gittins index. To that end, we modify our notation to account for these differences, letting M (ν, n, u, c h , c s , c q ) be the Gittins index Thus, if the hazard rate for quitting is constant for all employees at all times, then changes in switching and quitting costs do not affect the relative ordering of workers' Gittins indices. Of course, the values of the Gittins indices change, as do the (analogous) expected discounted costs of the problem. But because the relative orderings do not change, changes in the switching and quitting costs do not affect the optimal policy, and we therefore do not report a sensitivity analysis with respect to c s or c q . When the quitting probabilities are not constant, the specifics of the optimal policy can change with c s and c q . Nevertheless, the overall structure of the optimal policy does not change. Proposition 1 holds for any quitting behavior q i,n as in (4). Variance of the base level performance, variance of samples, training costs. A sensitivity analysis for the prior distribution of abilities, the sampling variance, and training costs yields intuitive results for which we provide a brief overview: • The Gittins index reflects an option value inherent in the ability to change arms and favors arms with more diffuse prior distributions. A sensitivity analysis for the variance of the prior distribution of abilities agrees with the general idea: for a given µ, an increase in the variance, σ 2 , of ability across workers allows the employer to screen more strictly, thereby increasing termination rates, retaining relatively more capable employees, and lowering total costs. Arlotto, Chick and Gans: Optimal Hiring and Retention Policies for Heterogeneous Workers who Learn 21 • A sensitivity analysis with respect to the sampling variance, σ 2 , indicates that lower values of σ result in a smaller fraction of employees who are terminated. Thus, reductions in withinperiod variability improve the selectivity and effectiveness of screening procedures, allowing the employer to reduce termination rates obtained using the optimal policy. • When training costs are absent the screening process is very selective, terminating more than half of employees on day 1 and more than 85% of employees overall. When training costs are present, however, termination rates decrease as training costs increase. For additional details concerning these results, please contact the authors. Conclusions This paper studies how statistical and onthejob learning together determine the nature of optimal hiring and retention decisions. Statistical learning arises when workers are heterogeneous and the employer does not know their true quality. Onthejob learning occurs as experience affects workers' performance. The literature related to this problem comes from various areas, such as labor economics, statistical decision theory, learningcurve theory, and service operations, among others. Our analysis integrates aspects from all of these streams to incorporate training, switching and quitting dynamics, and it applies results from infinitearmed Bayesian bandit problems to characterize the optimal hiring and retention policies. Our numerical results show that active screening of employees can significantly improve expected costs and longrun average employee performance. Because most termination takes place early in employees' tenures, relatively simple finitehorizon and oneshot policies also have the potential to perform well. Our sensitivity analysis shows that, as is common in bandit problems, the ability to terminate employees should motivate managers to consider a broader spectrum of potential hires. Moreover, both reductions in withintask variability and improvements in employee learning provide the additional benefit of lowering termination rates.
Keeping Your Options Open
, 2010
"... In standard models of experimentation, the costs of project development consist of (i) the direct cost of running trials as well as (ii) the implicit opportunity cost of leaving alternative projects idle. Another natural type of experimentation cost, the cost of holding on to the option of developin ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In standard models of experimentation, the costs of project development consist of (i) the direct cost of running trials as well as (ii) the implicit opportunity cost of leaving alternative projects idle. Another natural type of experimentation cost, the cost of holding on to the option of developing a currently inactive project, has not been studied. In a (multiarmed bandit) model of experimentation in which inactive projects have explicit maintenance costs and can be irreversibly discarded, I fully characterise the optimal experimentation policy and show that the decisionmaker’s incentive to actively manage its options has important implications for the order of project development. In the model, an experimenter searches for a success among a number of projects by choosing both those to develop now and those to maintain for (potential) future development. In the absence of maintenance costs, the optimal experimentation policy has a ‘staywiththewinner’ property: the projects that are more likely to succeed are developed first. Maintenance costs provide incentives to bring the option value of less promising projects forward, and under the optimal experimentation policy, projects that are less likely to succeed are sometimes developed first. A project development strategy of ‘goingwiththeloser’ strikes a balance between the cost of discarding possibly valuable options and the cost of leaving them open.