#### DMCA

## Reinforcement learning design for cancer clinical trials (2009)

Venue: | Statist. Med |

Citations: | 12 - 5 self |

### Citations

13195 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ... using simauth.cls Hosted by The Berkeley Electronic Press 8 Y. ZHAO, M. R. KOSOROK AND D. ZENG 3.1. Support vector regression The ideas underlying SVR [26] are similar but slightly dierent from SVM =-=[27]-=- within the margin-based classication scheme. The data xi are mapped into a feature space by a nonlinear transformation , which guarantees that any data set becomes arbitrarily separable as the data... |

3688 | Support-vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...auth.cls http://biostats.bepress.com/uncbiostat/art11 REINFORCEMENT LEARNING DESIGN 9 Similar to SVM which calculates a hyperplane, the solution of an SVR function only depends on the support vectors =-=[29]-=-. Usually support vectors just represent a small fraction of the sample, therefore, the evaluation of the decision function is computationally ecient. This attractive property is especially useful wh... |

863 | A tutorial on support vector regression
- Smola, Schölkopf
- 2004
(Show Context)
Citation Context ...gression problems. There are several examples where SVR is successfully used in practice, and they generally perform better than other regression methods. See Chen et al. [30] and Smola and Scholkopf =-=[31]-=-. For a detailed exposition with a more computational discussion about SVR, refer to LIBSVM [32], which is a library for SVM. 3.2. Extremely randomized trees The complex and unclear structure of the Q... |

740 |
Dynamic Programming and Markov processes
- Howard
- 1960
(Show Context)
Citation Context ... range of problems. Minsky [13]srst described the connection between dynamic programming and reinforcement learning. In classical dynamic programming methods, policy evaluation and policy improvement =-=[12, 14]-=- refer to the computation of the value function and the improved policy, respectively. The computation in both methods requires an interactive process. Combining these two methods together, we obtain ... |

290 | A.: Support vector method for function approximation, regression estimation, and signal processing
- Vapnik, Golowich, et al.
- 1997
(Show Context)
Citation Context ...ng data set. Statist. Med. 2009; 00:1{6 Prepared using simauth.cls Hosted by The Berkeley Electronic Press 8 Y. ZHAO, M. R. KOSOROK AND D. ZENG 3.1. Support vector regression The ideas underlying SVR =-=[26]-=- are similar but slightly dierent from SVM [27] within the margin-based classication scheme. The data xi are mapped into a feature space by a nonlinear transformation , which guarantees that any da... |

267 | Extremely randomized trees.
- Geurts, Ernst, et al.
- 2006
(Show Context)
Citation Context ... trees The complex and unclear structure of the Q-function has also partly motivated the vast literature on nonparametric statistical methods and machine learning. Ernst et al. [33] and Geurts et al. =-=[34]-=- proposed an extremely randomized trees (ERT) method, which is called the Extra-Trees algorithm, for batch mode reinforcement learning. Unlike the classical classication and regression trees such as ... |

255 | On the convergence of stochastic iterative dynamic programming algorithms
- Jaakkola, Jordan, et al.
- 1994
(Show Context)
Citation Context ... St = st;At = at : (1) Under some appropriate and rigorous assumptions, Qt has been shown to converge to Q with probability 1 [20]. More general convergence results were proved by Jaakkola et al. =-=[21]-=- and Tsitsiklis [22]. In learning a non-stationary non-Markovian policy with one set ofsnite horizon trajectories (also called a training data set) fS0; A0; R0; S1; A1; R1; : : : ; AT ; RT ; ST+1g; we... |

217 |
Reinforcement Learning: An Introduction
- RS, AG
- 1998
(Show Context)
Citation Context ...onal issues that arise when learning from interaction with an environment in order to achieve long-term goals. A detailed account of the history of reinforcement learning is given in Sutton and Barto =-=[11]-=-. The basic process of reinforcement learning involves trying a sequence of actions, recording the consequences of those actions, statistically estimating the relationship between actions and conseque... |

177 | Feature-based methods for large scale dynamic programming.
- Tsitsiklis, Roy
- 1996
(Show Context)
Citation Context ...ction variable be continuous. In order to obtain the estimator of interest, many authors have considered dierent approaches in recent years. Murphy [24], Blatt et al. [23] and Tsitsiklis and Van Roy =-=[25]-=- showed that Q-learning estimation can be viewed as approximately least squares value iteration. The parameters t for the t-th Q-function satisfy t 2 argmin En h Rt + max at+1 bQt+1(St+1; at+1; t... |

51 |
JH, Dowlati A, Lilenbaum R, Johnson DH. PaclitaxelCarboplatin alone or with bevacizumab for non-small-cell lung cancer
- Sandler, Gray, et al.
(Show Context)
Citation Context ...hase III studies have examined the ecacy of various targeted therapies. Bevacizumab plus paclitaxel + carboplatin in the treatment of selected patients with NSCLC showed a signicant survival benet =-=[40]-=-, however, with the risk of increased treatment-related deaths. Cetuximab plus cisplatin + vinorelbine demonstrated superior survival in patients with advanced EGFR-detectable NSCLC [41]. The strategi... |

42 |
Dynamic Programming.
- RE
- 1957
(Show Context)
Citation Context ...cement learning algorithms. Since a fundamental property of value functions used throughout reinforcement learning is that they satisfy particular recursive relationships such as the Bellman equation =-=[12]-=-, it is clear that the optimal policy, , must satisfy, t (st) 2 argmaxat E h Rt +sV t+1(St+1) j St = st;At = at i : Modern techniques in mathematical and computational areas have stimulated the ... |

39 | Dynamic multidrug therapies for HIV: Optimal and STI approaches.
- Adams, Banks, et al.
- 2004
(Show Context)
Citation Context ...der discounted instantaneous costs (which is a continuous function directly associated with actions) as their reward function: the rationale behind this comes from a validated and identied HIV model =-=[37]-=-. In this paper, we observed that with sample size N = 1000 for a clinical reinforcement trial, using SVR or ERT leads to a reasonably low bias for estimating optimal regimens. The evidence for this i... |

38 |
Erlotinib in previously treated non-small-cell lung cancer.
- FA, J, et al.
- 2005
(Show Context)
Citation Context ...earning to discover individualized optimal regimens while restricting attention tosrst-line and second-line only, since there is only one approved agent (Erlotinib) indicated for third-line treatment =-=[38]-=-. First-line treatment primarily consists of platinum-based doublets that include cisplatin, Statist. Med. 2009; 00:1{6 Prepared using simauth.cls http://biostats.bepress.com/uncbiostat/art11 REINFORC... |

38 |
Prospective randomized trial of docetaxel versus supportive care in patients with non-small-cell lung cancer previously treated with platinum-based chemotherapy.
- FA, Dancey, et al.
- 2000
(Show Context)
Citation Context .... The choice of agent also depends on many factors, including the patient's number of prior regimens, response to prior chemotherapy, the risk for neutropenia, EGFR expression, and patient preference =-=[42, 43, 44, 45]-=-. Due to the complexity of the biomarkers and unclear toxicities, the trial described here was motivated by the desire to compare those agents in a randomized fashion, and the belief that dierent com... |

30 |
Reinforcement learning: A survey.
- LP, ML, et al.
- 1996
(Show Context)
Citation Context ...ires less memory for estimates and less computation. Almost any TD-learning belongs to the \eligibility traces" problem. For more details on this issue, see Sutton and Barto [11] and Kaelbling et al. =-=[18]-=-. One of the most important o-policy TD-learning methods is Watkins' Q-learning [19, 20]. Q-learning no longer requires estimating the value function: it estimates a Q-function instead. Q-learning ha... |

27 | Adaptive treatment of epilepsy via batchmode reinforcement learning.
- Guez, Vincent, et al.
- 2008
(Show Context)
Citation Context ...ecrease variance while at the same time decreasing bias, and it is very robust to outliers. ERT has been recently demonstrated in a simulation of HIV infection [35] and adaptive treatment of Epilepsy =-=[36]-=-. While this algorithm reveals itself to be very eective to extract a well-tted Q from the data set, it has one drawback: the computational eciency is relatively low especially with increasing numb... |

27 |
Manegold C, Serwatowski P, Gatzemeier U, Digumarti R, Zukin M
- GV, Parikh, et al.
(Show Context)
Citation Context ...s have compared these various platinum doublets, the great majority of these trials have concluded that all such regimens are comparable in their clinical ecacy. As an example, see Scagliotti et al. =-=[39]-=-. Their study represents the largest number of patients entered into a single phase III study using either cisplatin + gemcitabine or cisplatin + pemetrexed regimen. Noninferiority was demonstrated be... |

22 |
Drug Administration. Innovation or stagnation: challenge and opportunity on the critical path to new medical products. White Paper
- Food
- 2004
(Show Context)
Citation Context ...oblem is that very few candidate treatments make it to human clinical trials and only about 10% of treatments making it to human clinical trials demonstrate enough ecacy to be approved for marketing =-=[1, 2]-=-. Typical regimens for patients with certain advanced cancers (such as breast cancer, lung cancer, and ovarian cancer) utilize a single agent in combination with some platinum-based compound, and cons... |

18 |
An experimental design for the development of adaptive treatment strategies
- SA
(Show Context)
Citation Context ...al benet. Forsnding new treatment regimens with this motivation, one of the most promising approaches has been referred to variously as \dynamic treatment regimes" or \adaptive treatment strategies" =-=[6]-=-. In contrast with classic adaptive designs, dynamic treatment regimes can allow dosage level and type to vary with time for subject-specic needs. As a consequence, the optimal strategy is able to pr... |

14 |
Learning From Delayed Rewards
- CJCH
- 1989
(Show Context)
Citation Context ...o the \eligibility traces" problem. For more details on this issue, see Sutton and Barto [11] and Kaelbling et al. [18]. One of the most important o-policy TD-learning methods is Watkins' Q-learning =-=[19, 20]-=-. Q-learning no longer requires estimating the value function: it estimates a Q-function instead. Q-learning handles discounted innite-horizon Markov decision processes (MDP). It requires no prior kn... |

11 |
Constructing evidence-based treatment strategies using methods from computer science.
- Pineau, Bellemare, et al.
- 2007
(Show Context)
Citation Context ...ds for use in the clinical research arena. Reinforcement learning has been applied to treating behavioral disorders, where each patient typically has multiple opportunities to try dierent treatments =-=[8]-=-. Murphy et al. [9] suggest Q-learning, which is one of the most important breakthroughs in reinforcement learning, for constructing decision rules for chronic psychiatric disorders, since these chron... |

9 |
Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition
- TM
- 1965
(Show Context)
Citation Context ...n-based classication scheme. The data xi are mapped into a feature space by a nonlinear transformation , which guarantees that any data set becomes arbitrarily separable as the data dimension grows =-=[28]-=-, then a hyperplane f(x) isstted to the mapped data. One of the popular loss functions involved in SVR is known as the -insensitive loss function, which is dened as L(f(xi); yi) = (jf(xi) yij )... |

8 |
Considerations for second-line therapy of non-small cell lung cancer. The Oncologist 2008; 13(suppl 1):28{36
- TE, MA
(Show Context)
Citation Context ...ients in recentsrst-line trials received second-line treatment. Some patients who maintain a good performance status and tolerate therapy without signicant toxicities will receive third-line therapy =-=[3]-=-. A widely used approach is to give a maximum dosage of chemotherapy drug for some period of time, followed by a period of recuperation in which no drug is given. Although this therapeutic regimen can... |

7 |
Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall: Englewood Clis
- DP
- 1987
(Show Context)
Citation Context ...ed policy, respectively. The computation in both methods requires an interactive process. Combining these two methods together, we obtain two other methods called policy iteration and value iteration =-=[15, 16]-=-. Although dynamic programming can be applied to many types of problems, it is restricted to solving reinforcement learning problems under the Markov assumption. If this assumption is violated, dynami... |

5 |
Sung HG. Evaluating multiple treatment courses in clinical trials
- PF, RE
- 2000
(Show Context)
Citation Context ...ive designs to do this is the play-the-winner-and-drop-the-loser design, which is to repeat a treatment that is successful in a given course and otherwise switch to a dierent treatment. Thall et al. =-=[4]-=- provided a statistical framework for multi-course clinical trials involving some modications of the play-the-winnerand-drop-the-loser strategy. In their proposed design, all treatments after thesrst... |

3 |
McKay JR, TenHave T. Developing adaptive treatment strategies in substance abuse research. Drug and Alcohol Dependence 2007; 88S:S24{S30
- SA, KG, et al.
(Show Context)
Citation Context ...nt and long term management of chronic disease, and they have been utilized in some trials such as sequential multiple assignment randomized trials (SMART) [6] and drug and alcohol dependency studies =-=[7]-=-. However, to date, there are no clinical trial methodologies for discovering new treatment regimens for life-threatening diseases. Thus, for diseases like cancer, the use of clinical trials for evalu... |

3 |
A generalization error for Q-learning
- SA
(Show Context)
Citation Context ...dimension of the action variable A, or having the action variable be continuous. In order to obtain the estimator of interest, many authors have considered dierent approaches in recent years. Murphy =-=[24]-=-, Blatt et al. [23] and Tsitsiklis and Van Roy [25] showed that Q-learning estimation can be viewed as approximately least squares value iteration. The parameters t for the t-th Q-function satisfy t... |

3 |
Phase III study of immediate versus delayed docetaxel after induction therapy with gemcitabine plus carboplatin in advanced non-small-cell lung cancer: Updated report with survival. ASCO 2007; 25:June 20 suppl
- Fidias, Dakhil, et al.
(Show Context)
Citation Context .... The choice of agent also depends on many factors, including the patient's number of prior regimens, response to prior chemotherapy, the risk for neutropenia, EGFR expression, and patient preference =-=[42, 43, 44, 45]-=-. Due to the complexity of the biomarkers and unclear toxicities, the trial described here was motivated by the desire to compare those agents in a randomized fashion, and the belief that dierent com... |

2 |
Logothetis CJ, Millikan RE, Tannir NM. Bayesian and frequentist two-stage treatment strategies based on sequential failure times subject to interval censoring. Statistics in Medicine 2007
- PF, LH
(Show Context)
Citation Context ...winnerand-drop-the-loser strategy. In their proposed design, all treatments after thesrst course are assigned adaptively, thus increasing the amount of information available per patient. Thall et al. =-=[5]-=- presented a Bayesian adaptive design for a trial comparing two-course strategies for treating metastatic renal cancer. Each patient is fairly randomized between two treatments at enrollment, and if a... |

2 |
Learning to predict by the methods of temporal di erence
- RS
- 1988
(Show Context)
Citation Context ..., it is usually not possible to directly compute an optimal policy by just solving the Bellman optimality equation, even if we have a complete and accurate model of the environment's dynamics. Sutton =-=[17]-=- claims that temporal-dierence (TD) learning is an alternative method to solve out optimal policies without any knowledge of the dynamic model. One fundamental expression of TD-learning is the increm... |

2 |
Load forecasting using support vector machines: a study on EUNITE competition 2001
- BJ, MW, et al.
- 2004
(Show Context)
Citation Context ...al andsexible approach for regression problems. There are several examples where SVR is successfully used in practice, and they generally perform better than other regression methods. See Chen et al. =-=[30]-=- and Smola and Scholkopf [31]. For a detailed exposition with a more computational discussion about SVR, refer to LIBSVM [32], which is a library for SVM. 3.2. Extremely randomized trees The complex a... |

2 |
Wehenkel L. Tree-based batch model reinforcement learning
- Ernst, Geurts
(Show Context)
Citation Context ...2. Extremely randomized trees The complex and unclear structure of the Q-function has also partly motivated the vast literature on nonparametric statistical methods and machine learning. Ernst et al. =-=[33]-=- and Geurts et al. [34] proposed an extremely randomized trees (ERT) method, which is called the Extra-Trees algorithm, for batch mode reinforcement learning. Unlike the classical classication and re... |

1 |
Widening bottlenecks in drug discovery Glimpses from Drug Discovery Technology Europe. Drug Discovery Today 2005
- Hogberg
(Show Context)
Citation Context ...oblem is that very few candidate treatments make it to human clinical trials and only about 10% of treatments making it to human clinical trials demonstrate enough ecacy to be approved for marketing =-=[1, 2]-=-. Typical regimens for patients with certain advanced cancers (such as breast cancer, lung cancer, and ovarian cancer) utilize a single agent in combination with some platinum-based compound, and cons... |

1 |
Methodological challenges in constructing eective treatment sequences for chronic psychiatric disorders. Neuropsychoarmacology 2007
- SA, DW, et al.
(Show Context)
Citation Context ...linical research arena. Reinforcement learning has been applied to treating behavioral disorders, where each patient typically has multiple opportunities to try dierent treatments [8]. Murphy et al. =-=[9]-=- suggest Q-learning, which is one of the most important breakthroughs in reinforcement learning, for constructing decision rules for chronic psychiatric disorders, since these chronic conditions often... |

1 |
Tizhoosh HR, Salama MM. Application of reinforcement learning for segmentation of transrectal ultrasound images
- Sahba
(Show Context)
Citation Context ... of knowledge obtained from the previous input image, the reinforcement learning algorithm is potentially capable ofsnding the appropriate local value for sub-images and extracting the prostate image =-=[10]-=-. However, reinforcement learning has not yet been applied to life-threatening diseases like cancer where individual patients do not have the luxury to try many dierent treatments. Our main aim is to... |

1 |
Steps toward arti intelligence
- ML
- 1961
(Show Context)
Citation Context ...two classes: dynamic programming or temporal-dierence learning [11]. Bellman [12]srst provided the \dynamic programming" term to show how these methods are useful to a wide range of problems. Minsky =-=[13]-=-srst described the connection between dynamic programming and reinforcement learning. In classical dynamic programming methods, policy evaluation and policy improvement [12, 14] refer to the computati... |

1 |
Modi policy iteration algorithms for discounted Markov decision problems
- ML, MC
- 1978
(Show Context)
Citation Context ...ed policy, respectively. The computation in both methods requires an interactive process. Combining these two methods together, we obtain two other methods called policy iteration and value iteration =-=[15, 16]-=-. Although dynamic programming can be applied to many types of problems, it is restricted to solving reinforcement learning problems under the Markov assumption. If this assumption is violated, dynami... |

1 |
Asynchronous stochastic approximation and Q-learning
- JN
- 1994
(Show Context)
Citation Context ... (1) Under some appropriate and rigorous assumptions, Qt has been shown to converge to Q with probability 1 [20]. More general convergence results were proved by Jaakkola et al. [21] and Tsitsiklis =-=[22]-=-. In learning a non-stationary non-Markovian policy with one set ofsnite horizon trajectories (also called a training data set) fS0; A0; R0; S1; A1; R1; : : : ; AT ; RT ; ST+1g; we denote the estimato... |

1 |
A-learning for approximate planning. Unpublished Manuscript
- Blatt, SA, et al.
- 2004
(Show Context)
Citation Context ...se these optimal policies to test or predict for a new data set. There are many other promising learning methods based on modications or extensions of Q-learning, for example, Blatt, Murphy, and Zhu =-=[23]-=- proposed A-learning. However, some properties of these methods have not yet been carefully investigated. Due to the simple equations and minimal amount of computation, we restrict our attention in th... |

1 |
Gatzemeier U, Bajeta E, Emig M, Pereira JR. FLEX: A randomized, multicenter, phase III study of cetuximab in combination with cisplatin/vinorelbine (CV) versus CV alone in the treatment of patients with advanced non-small cell lung cancer (NSCLC). Journa
- Pirker, Szczesna, et al.
(Show Context)
Citation Context ...rvival benet [40], however, with the risk of increased treatment-related deaths. Cetuximab plus cisplatin + vinorelbine demonstrated superior survival in patients with advanced EGFR-detectable NSCLC =-=[41]-=-. The strategies ofsrst-line therapy are essentially based on these four targeted combination therapies, the choice depends on a number of factors, including the patient's histology type, toxicity pro... |

1 |
de Marinis F, Von Pawel J, Gatzemeier U, Chang T
- Hanna, FA, et al.
(Show Context)
Citation Context .... The choice of agent also depends on many factors, including the patient's number of prior regimens, response to prior chemotherapy, the risk for neutropenia, EGFR expression, and patient preference =-=[42, 43, 44, 45]-=-. Due to the complexity of the biomarkers and unclear toxicities, the trial described here was motivated by the desire to compare those agents in a randomized fashion, and the belief that dierent com... |