| R. Tibshirani. Bias, variance and prediction error for classification rules. Technical report, Department of Statistics, University of Toronto, 1996. 41 |
....1998) Ho emphasises the importance of disagreement in ensemble members but does not directly evaluate its impact on the overall ensemble performance. 4 We will show in the evaluation section of this paper that a better measure of agreement (or ambiguity) for ensembles of classifiers is entropy. Tibshirani (1996) also suggests that entropy is a good measure of dispersion in bootstrap estimation in classification. So for a test set containing M cases in a classification problem where there are K categories a measure of ambiguity is: M x K k x k x k P P M A 11 ) log( 1 (6) where ....
Tibshirani, R., (1996) Bias, variance and prediction error for classification rules, University of Toronto, Department of Statistics Technical Report, November 1996 (also available at www-stat.stanford.edu/~tibs).
....the noise component of misclassification error. Breiman s (1996b) decomposition is undefined for any given example (it is only defined for the instance space as a whole) and allows the variance to be zero or undefined even when the learner s predictions fluctuate in response to the training set. Tibshirani (1996) defines bias and variance, but decomposes loss into bias and the aggregation e#ect, a quantity unrelated to his definition of variance. James and Hastie (1997) extend this approach by defining bias and variance but decomposing loss in terms of two quantities they call the systematic e#ect and ....
Tibshirani, R. (1996). Bias, variance and prediction error for classification rules (Technical Report).
....of web pages that truly come from a given population given that they were classified as coming of that population. By using the Bayes theorem, group coverage can be easily written in terms of group conditional error rates. Many authors have studied ways to estimate the overall error rate, e.g. Tibshirani (1996) and Efron and Tibshirani (1995) use the bootstrapping. Others have found decompositions of this overall rate in terms of systematic and random components, e.g. see Diettrich and Kong (1995) Kohavi and Wolpert (1996) Breiman (1996) and Tibshirani (1996) None of this literature focuses on the ....
....to estimate the overall error rate, e.g. Tibshirani (1996) and Efron and Tibshirani (1995) use the bootstrapping. Others have found decompositions of this overall rate in terms of systematic and random components, e.g. see Diettrich and Kong (1995) Kohavi and Wolpert (1996) Breiman (1996) and Tibshirani (1996). None of this literature focuses on the group conditional error rates. For the particular case of the naive Bayes approach, Friedman (1997) gave a heuristic explanation of its good behavior. He also found an approximation for the overall unconditional error rate, which results equal to, P (Y (x) ....
Tibshirani, R. J. (1996), Bias, variance and prediction errors for classification rules, Technical report, Department of Statistics, University of Toronto.
....evaluation criterion, the bias variance insight was borrowed from the field of regression, where squaredloss is the main criterion. As a result, several authors have proposed bias variance decompositions related to zero one loss (Kong Dietterich, 1995; Breiman, 1996b; Kohavi Wolpert, 1996; Tibshirani, 1996; Friedman, 1997) However, each of these decompositions has significant shortcomings. In particular, none has a clear relationship to the original decomposition for squared loss. One source of difficulty has been that the decomposition for squared loss is purely additive (i.e. loss = bias ....
....learner in one artificial, noise free domain, our results show that it is in fact a well founded and useful decomposition, even if incomplete. Breiman (1996b) proposed a decomposition for the average zero one loss over all examples, leaving bias and variance for a specific example x undefined. As Tibshirani (1996) points out, Breiman s definitions of bias and variance have some undesirable properties, seeming artificially constructed to produce a purely additive decomposition. Tibshirani s (1996) definitions do not suffer from these problems; on the other hand, he makes no use of the variance, instead ....
[Article contains additional citation context not shown here]
Tibshirani, R. 1996. Bias, variance and prediction error for classification rules. Technical report, Department of Preventive Medicine and Biostatistics and Department of Statistics, University of Toronto, Toronto, Canada.
....We further show that if the error is measured as 4 conditional entropy of predictions on a test set then the ambiguity even better predicts the improvement in accuracy due to the ensemble. This is not surprising since these metrics are used in a similar way by statisticians in bootstrapping. Tibshirani (1996) suggests that entropy is a good measure of dispersion in bootstrap estimation in classification. For convenience we will present these metrics again here. For a test set containing M cases in a classification problem where there are K categories the entropy measure of ambiguity is: M ....
Tibshirani, R., (1996) Bias, variance and prediction error for classification rules, University of Toronto, Department of Statistics Technical Report, November 1996 (also available at www-stat.stanford.edu/~tibs).
....often help in understanding the relative behavior of estimation algorithms: those with greater representational power, and thus greater ability to respond to the sample, tend to have lower bias, but also higher variance. Recently, several authors (Kong Dietterich, 1995; Kohavi Wolpert, 1996; Tibshirani, 1996; Breiman, 1996; Friedman, 1996) have proposed similar bias variance decompositions for zero one loss functions. In particular, Friedman (1996) has shown, using normal approximations to the class probabilities, that the bias variance interaction now takes a very different form. Zero one loss can ....
Tibshirani, R. (1996). Bias, variance and prediction error for classification rules (technical report). Department of Preventive Medicine and Biostatistics, University of Toronto, Toronto, Ontario. http://utstat.toronto.edu/- reports/tibs/biasvar.ps.
....11 Discussion and Related Work A number of non Bayesian explanations for the success of multiple model methods have been proposed. The hypothesis described in Section 10 is compatible with them, and sheds further light. Several authors (Kong Dietterich, 1995; Breiman, 1996b; Friedman, 1996; Tibshirani, 1996) have related the error reductions obtained by multiple model methods to the notions of bias and variance of a learner. Several alternative definitions of bias and variance for classification learners have been proposed (see previous references, and also (Kohavi Wolpert, 1996) Loosely, bias ....
Tibshirani, R. (1996). Bias, variance and prediction error for classification rules. Technical report, Department of Preventive Medicine and Biostatistics and Department of Statistics, University of Toronto, Toronto, Canada. http://utstat.toronto.edu/- reports/tibs/biasvar.ps.
....decision trees. 5 Relation to bias variance theory One of the main explanations for the improvements achieved by voting classifiers is based on separating the expected error of a classifier into a bias term and a variance term. While the details of these definitions differ from author to author [5, 13, 14, 19], they are all attempts to capture the following quantities: The bias term measures the persistent error of the learning algorithm, in other words, the error that would remain even if we had an infinite number of independently trained hypotheses. The variance term measures the error that is due ....
....This simple observation suggests that it may be inherently impossible ever to find a bias variance decomposition for classification as natural and satisfying as in the quadratic regression case. This difficulty is reflected in the myriad definitions that have been proposed for bias and variance [5, 13, 14, 19]. Rather than addressing each one separately, for the remainder of this section, we will follow the definitions given by Kong and Dietterich [14] and referred to as Definition 0 by Breiman [5] Bagging and variance reduction. As mentioned in the introduction, the notion of variance does seem ....
Robert Tibshirani. Bias, variance and prediction error for classification rules. Technical report, University of Toronto, November 1996.
....of the two classifiers. Iterating the training test splits or bootstrapping reduces the sampling variance, as is evident from Tables 5a and 5b; but, apparently, not sufficiently to result in the desired strong correlation between an estimate and the true error. There is currently much research [7, 8, 15, 38, 51, 62, 63] in machine learning and statistics on more robust procedures for selecting a classifier (cross validating, bootstrapping, stacking , or bagging the entire inference procedure) The methods illustrated in Figure 6 were used to explore the variation of bias with sample size and true error. ....
R. Tibshirani. Bias, variance and prediction error for classification rules. Technical report, Department of Statistics, University of Toronto, 1996.
....in squared error for regression. For classification, 0 1 loss (misclassification rate) is commonly used, but this does not have a straightforward or unique decomposition. Recently, many authors have proposed similar decompositions (Kong and Dietterich, 1996; Breiman, 1996b; James and Hastie, 1997; Tibshirani, 1996; Kohavi and Wolpert, 1996) We used Kong and Dietterich s (1996) definitions. They define bias to be the error of the ideal voted hypothesis, which is the result we would get from combining an infinite number of classifiers, each trained on an independent set of examples. Variance is the ....
R. Tibshirani. (1996). Bias, variance and prediction error for classification rules. Technical report, Department of Statistics, University of Toronto.
....and variance. The bias variance decomposition of error originated in squared error for regression. For classification, 0 1 loss (misclassification rate) is commonly used, but this does not have a straightforward or unique decomposition. Recently, many authors have proposed similar decompositions [9, 28, 30, 31, 47]. We used Kong and Dietterich s definitions [31] They define bias to be the error of the ideal voted hypothesis, which is the result we would get from combining an infinite number of classifiers, each trained on an independent set of examples. Variance is the difference between the expected ....
R. Tibshirani. Bias, variance and prediction error for classification rules. Technical report, Department of Statistics, University of Toronto, 1996.
.... to boosting (Schapire, 1990) Only rather than the standard schemes used in boosting, Breiman combines the estimators using out of sample techniques, as in his work on arcing (Breiman, 1996b) as well as previous work on stacking, and on estimating the error of bagging (Wolpert Macready, 1997; Tibshirani, 1996; Breiman, 1996c) Bauer and Kohavi s article provides a large scale empirical comparison of a number of voting based algorithms for combining classifiers. Using fourteen data sets, they investigated variants of bagging (Breiman, 1996a) and boosting (Schapire, 1990) with decision tree (three ....
Tibshirani, R. (1996). Bias, variance and prediction error for classification rules.
.... mean squared error is well known and easily derived [see e.g. Geman, Bienenstock Doursat 1992) Recently, several suggestions have been made for other loss functions such as zero one loss [see (Breiman 1996, Dietterich Bakiri 1995, Friedman 1996, James Hastie 1997, Kohavi Wolpert 1996, Tibshirani 1996, Wolpert 1997) and references therein] The generalization of the decomposition for mean squared error to a decomposition for zero one loss depends on one s definition of desirable properties for the bias and the variance term. In this note, we will follow the requirements and definitions stated ....
....independent of the target t and arrive at (7) the bias. The exact decomposition seems to be somewhat arbitrary, since in practice one is only interested in changes in the bias and variance terms rather than in their absolute values. Our definition of variance is equivalent to those given in (Tibshirani 1996, James Hastie 1997) Discussion We slightly reformulate what in (James Hastie 1997) are called obvious requirements for a bias variance decomposition. These requirements are similar in spirit to the desiderata stated in (Wolpert 1997) 1. The decomposition for the mean squared error is a ....
[Article contains additional citation context not shown here]
Tibshirani, R. (1996), Bias, variance and prediction error for classification rules, Technical report, University of Toronto.
....decision trees. 5 RELATION TO BIAS VARIANCE THEORY One of the main explanations for the improvements achieved by voting classifiers is based on separating the expected error of a classifier into a bias term and a variance term. While the details of these definitions differ from author to author [5, 13, 14, 18], they are all attempts to capture the following quantities: The bias term measures the persistent error of the learning algorithm, in other words, the error that would remain even if we had an infinite number of independently trained hypotheses. The variance term measures the error that is due to ....
....suggests that it may be inherently more difficult or even impossible to find a biasvariance decomposition for classification as natural and satisfying as in the quadratic regression case. This difficulty is reflected in the myriad definitions that have been proposed for bias and variance [5, 13, 14, 18]. Rather than discussing each one separately, for the remainder of this section, except where noted, we follow the definitions given by Kong and Dietterich [14] and referred to as Definition 0 by Breiman [5] Bagging and variance reduction. The notion of variance certainly seems to be helpful ....
Robert Tibshirani. Bias, variance and prediction error for classification rules. Technical report, University of Toronto, November 1996.
....decision trees. 5 Relation to Bias variance Theory One of the main explanations for the improvements achieved by voting classifiers is based on separating the expected error of a classifier into a bias term and a variance term. While the details of these definitions differ from author to author [8, 25, 26, 40], they are all attempts to capture the following quantities: The bias term measures the persistent error of the learning algorithm, in other words, the error that would remain even if we had an infinite number of independently trained classifiers. The variance term measures the error that is due ....
....suggests that it may be inherently more difficult or even impossible to find a bias variance decomposition for classification as natural and satisfying as in the quadratic regression case. This difficulty is reflected in the myriad definitions that have been proposed for bias and variance [8, 25, 26, 40]. Rather than discussing each one separately, for the remainder of this section, except where noted, we follow the definitions given by Kong and Dietterich [26] and referred to as Definition 0 by Breiman [8] These definitions are given in Appendix C. 5.2 Bagging and variance reduction. The ....
Robert Tibshirani. Bias, variance and prediction error for classification rules. Technical report, University of Toronto, November 1996.
.... that the tradeoff is sometimes counter intuitive relative to conventional squared error tradeoff, and that certain highly biased methods, e.g. nearest neighbor, are nonetheless often highly competitive (see, for instance, Holte [20] A line of research being pursued by Breiman [8, 9] see also [49, 57]) aggregates classifiers rather than averaging. In the approach known as bagging (bootstrap aggregation) a single instance is held out as in leave one out and the remaining instances are bootstrapped (sampled with replacement to provide a training set) many times. The resulting classifiers ....
R. Tibshirani. Bias, variance and prediction error for classification rules. Technical report, Dept. of Statistics, University of Toronto, 1996.
....by conventional crossvalidation. The details of this comparison where one tries different kinds of crossvalidation besides leave one out including in particular bootstrap based variants are the subject of future work. Simultaneously with our work, Tibshirani conducted a similar study [Tib96]. In his study, he investigated the (V 2 ; E 2 ) estimator, for classification problems (as opposed to the regression problems studied in this paper) The results in [Tib96] are quite encouraging; they suggest that the basic idea of the (V 2 ; E 2 ) estimator also works well on classification ....
....variants are the subject of future work. Simultaneously with our work, Tibshirani conducted a similar study [Tib96] In his study, he investigated the (V 2 ; E 2 ) estimator, for classification problems (as opposed to the regression problems studied in this paper) The results in [Tib96] are quite encouraging; they suggest that the basic idea of the (V 2 ; E 2 ) estimator also works well on classification problems. Combined with our results this suggests that this estimation method may be broadly applicable. Our work complements Tibshirani s study in a number of ways. We have ....
R. Tibshirani. Bias, variance and prediction error for classification rules. University of Toronto Statistics Department Technical Report, 1996.
....analysis for boosting (more details are given in our paper with Bartlett and Lee [13] 1. The bias variance decomposition originates in the analysis of quadratic regression. Its application to classification problems is problematic, as reflected in the large number of suggested decompositions [8, 9, 14], in addition to the one given by Breiman in this paper. One unavoidable problem is that voting over several independently generated rules can sometimes increase, rather than decrease, the expected error. 2. Even in those cases where voting several independent classifiers is guaranteed to decrease ....
Robert Tibshirani. Bias, variance and prediction error for classification rules. Technical report, University of Toronto, November 1996.
....use of these samples, along with the ith data item X i = t i ; y i ) useful methods for both model selection and model improvement emerge. The out of bootstrap approach is implicit in some bootstrap estimates of prediction error (Efron (1983) Efron Tibshirani (1997) and was used by Tibshirani (1996), and Breiman (1996c) There are interesting connections to the model mix procedure (Stone (1974) bagging (Breiman (1996b) stacking (Wolpert (1992) and to Bayesian inference. Here is the well known connection between the bootstrap and Bayesian inference (Efron (1979) Rubin (1981) Assume ....
Tibshirani, R. (1996), Bias, variance and prediction error for classification rules, Technical report, University of Toronto.
No context found.
R. Tibshirani. Bias, variance and prediction error for classification rules. Technical report, Department of Statistics, University of Toronto, 1996. 41
No context found.
R. Tibshirani, Bias, Variance and Prediction Error for Classification Rules, Technical Report, University of Toronto, Canada, 1996.
No context found.
Tibshirani, R. (1996a). Bias, variance and prediction error for classification rules. University of Toronto, Canada.
No context found.
Tibshirani, R. (1996a). Bias, variance and prediction error for classification rules.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC