Results 1  10
of
160
Boosting algorithms: Regularization, prediction and model fitting
 Statistical Science
, 2007
"... Abstract. We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and correspo ..."
Abstract

Cited by 99 (12 self)
 Add to MetaCart
(Show Context)
Abstract. We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and corresponding Akaike or Bayesian information criteria, particularly useful for regularization and variable selection in highdimensional covariate spaces, are discussed as well. The practical aspects of boosting procedures for fitting statistical models are illustrated by means of the dedicated opensource software package mboost. This package implements functions which can be used for model fitting, prediction and variable selection. It is flexible, allowing for the implementation of new boosting algorithms optimizing userspecified loss functions. Key words and phrases: Generalized linear models, generalized additive models, gradient boosting, survival analysis, variable selection, software. 1.
RANDOM SURVIVAL FORESTS
, 2008
"... We introduce random survival forests, a random forests method for the analysis of rightcensored survival data. New survival splitting rules for growing survival trees are introduced, as is a new missing data algorithm for imputing missing data. A conservationofevents principle for survival forest ..."
Abstract

Cited by 29 (7 self)
 Add to MetaCart
We introduce random survival forests, a random forests method for the analysis of rightcensored survival data. New survival splitting rules for growing survival trees are introduced, as is a new missing data algorithm for imputing missing data. A conservationofevents principle for survival forests is introduced and used to define ensemble mortality, a simple interpretable measure of mortality that can be used as a predicted outcome. Several illustrative examples are given, including a case study of the prognostic implications of body mass for individuals with coronary artery disease. Computations for all examples were implemented using the freely available Rsoftware package, randomSurvivalForest.
Unbiased split selection for classification trees based on the Gini Index
, 2006
"... The Gini gain is one of the most common variable selection criteria in machine learning. We derive the exact distribution of the maximally selected Gini gain in the context of binary classification using continuous predictors by means of a combinatorial approach. This distribution provides a formal ..."
Abstract

Cited by 26 (10 self)
 Add to MetaCart
The Gini gain is one of the most common variable selection criteria in machine learning. We derive the exact distribution of the maximally selected Gini gain in the context of binary classification using continuous predictors by means of a combinatorial approach. This distribution provides a formal support for variable selection bias in favor of variables with a high amount of missing values when the Gini gain is used as split selection criterion, and we suggest to use the resulting pvalue as an unbiased split selection criterion in recursive partitioning algorithms. We demonstrate the efficiency of our novel method in simulation and real data studies from veterinary gynecology in the context of binary classification and continuous predictor variables with different numbers of missing values. Our method is extendible to categorical and ordinal predictor variables and to other split selection criteria such as the crossentropy criterion. 1
Survival prediction using gene expression data: a review and comparison. submitted
, 2007
"... Background: Knowledge of the transcription of the humane genome might greatly enhance our understanding of cancer. In particular, gene expression may be used to predict the survival of cancer patients. A microarray measures the expression of thousands of genes simultaneously. The highdimensionality ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
Background: Knowledge of the transcription of the humane genome might greatly enhance our understanding of cancer. In particular, gene expression may be used to predict the survival of cancer patients. A microarray measures the expression of thousands of genes simultaneously. The highdimensionality of the data poses the following problem: the number of covariates (∼10000) greatly exceeds the number of samples (∼200). Results: Here we give an inventory of methods that have been used to model survival using gene expression. These methods are critically reviewed and compared in a qualitative way. Finally, the methods are applied to artificial and reallife datasets for a quantitative comparison. Conclusions: The choice of the evaluation measure of predictive performance is crucial for the selection of the best method. Depending on the evaluation measure, either the L2penalized Cox regression or the random forest ensemble method yields the best survival time prediction using gene expression for the data sets used. Consensus, on which evaluation measure of predictive performance is best used, is much needed. 1 1
Models, forests and trees of York English: was/were variation as a case study for statistical practice. Language Variation and Change
, 2012
"... What is the explanation for vigorous variation between was and were in plural existential constructions and what is the optimal tool for analyzing it? The standard variationist tool — the variable rule program — is a generalized linear model; however, recent developments in statistics have introduce ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
(Show Context)
What is the explanation for vigorous variation between was and were in plural existential constructions and what is the optimal tool for analyzing it? The standard variationist tool — the variable rule program — is a generalized linear model; however, recent developments in statistics have introduced new tools, including mixedeffects models, random forests and conditional inference trees. In a stepbystep demonstration, we show how this well known variable benefits from these complementary techniques. Mixedeffects models provide a principled way of assessing the importance of randomeffect factors such as the individuals in the sample. Random forests provide information about the importance of predictors, whether factorial or continuous, and do so also for unbalanced designs with high multicollinearity, cases for which the family of linear models is less appropriate. Conditional inference trees straightforwardly visualize how multiple predictors operate in tandem. Taken together the results confirm that polarity, distance from verb to plural element and the nature of the DP are significant predictors. Ongoing linguistic change and social reallocation via morphologization are operational. Furthermore, the results make predictions that can be tested in future research. We conclude that variationist research can be substantially enriched by an expanded tool kit. Was/were as a case study for statistical practice 1 1
Variable Importance Assessment in Regression: Linear Regression versus Random Forest
"... Relative importance of regressor variables is an old topic that still awaits a satisfactory solution. When interest is in attributing importance in linear regression, averaging over orderings methods for decomposing R 2 are among the stateoftheart methods, although the mechanism behind their behav ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
Relative importance of regressor variables is an old topic that still awaits a satisfactory solution. When interest is in attributing importance in linear regression, averaging over orderings methods for decomposing R 2 are among the stateoftheart methods, although the mechanism behind their behavior is not (yet) completely understood. Random forests—a machinelearning tool for classification and regression proposed a few years ago—have an inherent procedure of producing variable importances. This article compares the two approaches (linear model on the one hand and two versions of random forests on the other hand) and finds both striking similarities and differences, some of which can be explained whereas others remain a challenge. The investigation improves understanding of the nature of variable importance in random forests. This article has supplementary material online.
revised) ‘Big data: new tricks for econometrics
 Journal of Economic Perspectives
, 2014
"... Nowadays computers are in the middle of most economic transactions. These “computermediated transactions ” generate huge amounts of data, and new tools can be used to manipulate and analyze this data. This essay offers a brief introduction to some of these tools and methods. Computers are now invo ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
Nowadays computers are in the middle of most economic transactions. These “computermediated transactions ” generate huge amounts of data, and new tools can be used to manipulate and analyze this data. This essay offers a brief introduction to some of these tools and methods. Computers are now involved in many economic transactions and can capture data associated with these transactions, which can then be manipulated and analyzed. Conventional statistical and econometric techniques such as regression often work well but there are issues unique to big datasets that may require different tools. First, the sheer size of the data involved may require more powerful data manipulation tools. Second, we may have more potential predictors than appropriate for estimation, so we need to do some kind of variable selection. Third, large datasets may allow for more flexible relationships than simple
Microarraybased classification and clinical predictors: on combined classifiers and additional predictive value
 Bioinformatics
, 2008
"... Motivation: In the context of clinical bioinformatics methods are needed for assessing the additional predictive value of microarray data compared to simple clinical parameters alone. Such methods should also provide an optimal prediction rule making use of all potentialities of both types of data: ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
Motivation: In the context of clinical bioinformatics methods are needed for assessing the additional predictive value of microarray data compared to simple clinical parameters alone. Such methods should also provide an optimal prediction rule making use of all potentialities of both types of data: they should ideally be able to catch subtypes which are not identified by clinical parameters alone. Moreover, they should address the question of the additional predictive value of microarray data in a fair framework. Results: We propose a novel but simple twostep approach based on random forests and PLS dimension reduction embedding the idea of prevalidation suggested by Tibshirani and colleagues which is based on an internal crossvalidation for avoiding overfitting. Our approach is fast, flexible and can be used both for assessing the overall additional significance of the microarray data and for building optimal hybrid classification rules. Its efficiency is demonstrated through simulations and an application to breast cancer and colorectal cancer data. Availability: Our method is implemented in the freely available R package ’MAclinical ’ which can be downloaded from
IMPROVING THE PRECISION OF CLASSIFICATION TREES
, 2009
"... Besides serving as prediction models, classification trees are useful for finding important predictor variables and identifying interesting subgroups in the data. These functions can be compromised by weak split selection algorithms that have variable selection biases or that fail to search beyond l ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
Besides serving as prediction models, classification trees are useful for finding important predictor variables and identifying interesting subgroups in the data. These functions can be compromised by weak split selection algorithms that have variable selection biases or that fail to search beyond local main effects at each node of the tree. The resulting models may include many irrelevant variables or select too few of the important ones. Either eventuality can lead to erroneous conclusions. Four techniques to improve the precision of the models are proposed and their effectiveness compared with that of other algorithms, including tree ensembles, on real and simulated data sets.
ESTIMATING TREATMENT EFFECT HETEROGENEITY IN RANDOMIZED PROGRAM EVALUATION
, 2013
"... When evaluating the efficacy of social programs and medical treatments using randomized experiments, the estimated overall average causal effect alone is often of limited value and the researchers must investigate when the treatments do and do not work. Indeed, the estimation of treatment effect het ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
When evaluating the efficacy of social programs and medical treatments using randomized experiments, the estimated overall average causal effect alone is often of limited value and the researchers must investigate when the treatments do and do not work. Indeed, the estimation of treatment effect heterogeneity plays an essential role in (1) selecting the most effective treatment from a large number of available treatments, (2) ascertaining subpopulations for which a treatment is effective or harmful, (3) designing individualized optimal treatment regimes, (4) testing for the existence or lack of heterogeneous treatment effects, and (5) generalizing causal effect estimates obtained from an experimental sample to a target population. In this paper, we formulate the estimation of heterogeneous treatment effects as a variable selection problem. We propose a method that adapts the Support Vector Machine classifier by placing separate sparsity constraints over the pretreatment parameters and causal heterogeneity parameters of interest. The proposed method is motivated by and applied to two wellknown randomized evaluation studies in the social sciences. Our method selects the most effective voter mobilization strategies from a large number of alternative strategies, and it also identifies the characteristics of workers who greatly benefit from (or are negatively affected by) a job training program. In our simulation studies, we find that the proposed method often outperforms some commonly used alternatives.