| J.H. Friedman. On bias, variance, 0/1 - loss, and the curse-of-dimensionality. In Technical Report. Stanford University, 1996. |
....inputs and a moderately large number of class labels that can be assigned to any input. Two popular simplifications have been considered for such problems: i) feature extraction, where the input space is projected into a smaller feature space, thereby addressing the curse of dimensionality issue [1,2]; and (ii) modular learning, where instead of using a single classifier, a number of classifiers, each focusing on a specific aspect of the problem, are developed. Several methods for feature extraction and modular learning have been proposed in the pattern recognition and computational ....
Friedman JH. On bias, variance, loss, and the curse of dimensionality. Technical report, Department of Statistics, Stanford University, 1996
....the value of K 0 , exhibits very good performance consistent with both the AIC and the MDL criteria. It should be noted, however, that these are not the only plausible approaches to the problem of order selection; other approaches such as cross validation techniques may also be quite useful [60] [64] 3.3.3 Segmentation Image segmentation is a technique for partitioning the image into meaningful regions corresponding to di erent objects. It may be considered to be a clustering process where the pixels are classi ed 18 into attributed tissue types according to their gray level values ....
J. H. Friedman, \On bias, variance, 0/1 - loss, and the curse-of-dimensionality," Technical Report, Stanford University, 1996.
....examples and weights the predictions of the di#erent classifiers based on their accuracy for the training set. But Boosting can also overfit in the presence of noise (as we empirically show in Section 3) 2. 3 The Bias plus Variance Decomposition Recently, several authors (Breiman, 1996b; Friedman, 1996; Kohavi Wolpert, 1996; Kong Dietterich, 1995) have proposed theories for the e#ectiveness of Bagging and Boosting based on Geman et al. s (1992) bias plus variance decomposition of classification error. In this decomposition we can view the expected error of a learning algorithm on a ....
Friedman, J. (1996). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Journal of Data Mining and Knowledge Discovery, 1.
....er architecture: C 2 pairwise classi ers with respective feature selectors Figure 2: Some examples of letters in LETTER I dataset [21] 2. 3 Combining the pairwise classi ers The outputs of the C 2 classi ers can be combined to obtain the nal output in two ways: i) by simple voting [22], or (ii) by using the MAP rule on an estimate of the overall aposterior probabilities obtained from the outputs of the pairwise classi ers [23] In the voting combination scheme, a count c( k jx) of the number of C 2 classi ers that labeled x into class k , c( k jx) X i k I( ik ....
Jerome H. Friedman. On bias, variance, loss, and the curse of dimensionality. Technical report, Department of Statistics, Stanford University, 1996.
....does not permit a complete and rigorous answer OPTIMALITY OF THE SIMPLE BAYESIAN CLASSIFIER 121 to this question, but some elements can be gleaned from the results in this article, and from the literature. It is well known that squared error loss can be decomposed into three additive components (Friedman, 1996): the intrinsic error due to noise in the sample, the statistical bias (systematic component of the approximation error, or error for an infinite sample) and the variance (component of the error due to the approximation s sensitivity to the sample, or error due to the sample s finite size) A ....
....relative behavior of estimation algorithms: those with greater representational power, and thus greater ability to respond to the sample, tend to have lower bias, but also higher variance. Recently, several authors (Kong Dietterich, 1995; Kohavi Wolpert, 1996; Tibshirani, 1996; Breiman, 1996; Friedman, 1996) have proposed similar bias variance decompositions for zero one loss functions. In particular, Friedman (1996) has shown, using normal approximations to the class probabilities, that the bias variance interaction now takes a very different form. Zero one loss can be highly insensitive to ....
[Article contains additional citation context not shown here]
Friedman, J. H. (1996). On bias, variance, 0/1 - loss, and the curse-of-dimensionality (technical report). Department of Statistics, Stanford University, Stanford, CA. ftp://playfair.stanford.edu/pub/friedman/kdd.ps.Z.
....allowing negative variance. Several subsequent papers [Kohavi and Wolpert 1996, Wolpert and Kohavi 1996, Tibshirani 1996, Breiman 1996] have offered alternative decompositions, with different strengths and weaknesses, as discussed in [Kohavi and Wolpert 1996, Wolpert and Kohavi 1996] Recently Friedman contributed another zero one loss decomposition [1996] to the discussion. Friedman s decomposition only applies to learning algorithms that perform their classification by first predicting the probabilities h y of all the possible output classes and then picking the class argmax i [h i ] In other words, he considers cases where h is single valued, ....
....a little from the formal definition of variance Friedman advocates. Similarly, view the average of h 1 as giving a bias. Then we have the peculiar result that increasing variance while keeping bias fixed can reduce overall expected generalization error. I will refer to such behavior as the Friedman effect . See also [Breiman 1996], in particular the discussion in the first appendix. Variability can be identified with the width of the distribution over h 1 , and in that sense can indeed be taken to be a variance . The question is whether it makes sense to view it as a variance in the restricted desiderata based sense ....
Friedman, J. (1996). On bias, variance, 0/1-loss, and the curse of dimensionality. Unpublished manuscript.
....probability estimation, the estimates produced by the classifier do not necessarily constitute the best noise model for the purposes of applying BMA. Classification learners often achieve low misclassification rates while producing very poor class probability estimates (Domingos Pazzani, 1996; Friedman, 1996) . In the empirical study described below, BMA indeed achieved better results with the uniform class noise model than with Equation 3. In BMA, an unseen example x is assigned to the class that maximizes: P r(cjx; x; c; H) X h2H P r(cjx; h) P r(hj x; c) 4) If a pure classification model ....
....overfit the data, or more precisely, that the models that have lower error on the training data will in fact have higher error on test data. Since learners that overfit are also necessarily unstable learners, or learners with high variance, and these are the learners for which Breiman (1996a) and Friedman (1996) found bagging will work, it is plausible that bagging incorporates a prior that is appropriate to those learners. Whether this assumption that bagging incorporates an error favoring prior is correct for the databases and learner used can be tested by checking the sign and magnitude of the ....
[Article contains additional citation context not shown here]
Friedman, J. H. (1996). On bias, variance, 0/1 - loss, and the curse-of-dimensionality.
....Pazzani, 1997) This function defines the error as the number of incorrect predictions. Unlike other loss functions, such as the squared error, it has the key property that it does not penalize inaccurate probability estimates so long as the greatest probability is assigned to the correct class (Friedman, 1997). There is mounting evidence that this is why naive Bayes classification performance remains high, despite the fact that inter attribute dependencies often cause it to produce incorrect probability estimates (Domingos Pazzani, 1997) This raises the question of whether it can be successfully ....
Friedman, J. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality.
....appears to be more competitive with existing methods, nthan is the regression version. Moreover, the best results often came from using neighborhoods that are shortened, not lengthened, in the axial direction. Lessons from the regression setting often fail to transfer to binary classification. Friedman (1996) provides a clear explanation of why this is so. The reason is that a good regression method must be able to control the variance of f(X i ) whereas a binary classification method can tolerate considerable variance in, for example log( P (X i ) 1 Gamma P (X i ) because P (X i ) only ....
....C Gamma 1 binary comparisons to that group. For example with logistic regression, if c Pr(Y i = c) c Pr(Y i = 0) 1 for all c 0 then Y = 0. Otherwise the maximizer c of c Pr(Y i = c) c Pr(Y i = 0) is chosen. The classification might depend on which class was taken to be the reference. Friedman (1996) fixes this problem by considering all C(C Gamma 1) 2 pairwise comparisons, and selecting the class that wins the most often. Hastie Tibshirani (1996) refine the method of pooling pairwise results. Dietterich Bakiri (1995) use coding theory to group the C classes into a number of dichotomies, ....
Friedman, J. H. (1996), On bias, variance, 0/1-loss, and the curse of dimensionality, Technical report, Department of Statistics, Stanford University.
....is an active area of research. The decomposition for mean squared error is well known and easily derived [see e.g. Geman, Bienenstock Doursat 1992) Recently, several suggestions have been made for other loss functions such as zero one loss [see (Breiman 1996, Dietterich Bakiri 1995, Friedman 1996, James Hastie 1997, Kohavi Wolpert 1996, Tibshirani 1996, Wolpert 1997) and references therein] The generalization of the decomposition for mean squared error to a decomposition for zero one loss depends on one s definition of desirable properties for the bias and the variance term. In this ....
....obeys the first and second requirement, but the limiting operation destroys the third requirement: the bias is no longer just a function of the average model. None of the bias variance decompositions for zero one loss suggested in the literature (see (Breiman 1996, Dietterich Bakiri 1995, Friedman 1996, Kohavi Wolpert 1996, Tibshirani 1996, Wolpert 1997) and (James Hastie 1997) for a discussion of most of them) satisfies all three requirements 5 . Most of them either define the bias and take for granted that the variance depends on the distribution of targets (the approach sketched in the ....
Friedman, J. (1996), On bias, variance, 0/1-loss, and the curse of dimensionality, Technical report, Department of Statistics, Stanford University.
....and complexity issues. For any given trainingset size, learning performance may be terrible both if the model is too complex (leading to overfitting) and also, on the other hand, if it is too simple and has insufficient expressive power. This is sometimes formalized as a bias variance trade off [10, 9, 17]. So the question is this: Whenever the application of a hierarchical constraint in training appears to help, is it because of the prior semantic knowledge encoded, or is it just that (in a rather indirect fashion) we have reduced the model complexity to a more appropriate level Several of our ....
J. Friedman. On bias, variance, 0/1-loss, and the curse-of-dimensionality. Technical report, Stanford University, Statistics Department, 1996.
....the subsample size decreases. Hence, the tradeoff between variance and bias. The conventional thinking on bias variance tradeoff, based on analogy to model selection in regression analysis, has been centered on a bias 2 variance, or squared error loss, formulation. A recent paper by Friedman [16] casts this tradeoff in terms of a 0 1 loss function, i.e. a classifier s prediction is either right or wrong, rather than smoothly varying. In this formulation, Friedman shows that the tradeoff is sometimes counter intuitive relative to conventional squared error tradeoff, and that certain ....
J. H. Friedman. On bias, variance, 0/1-loss, and the curse-of-dimensionality. Technical report, Dept. of Statistics, Stanford University, 1996.
.... with distribution P(x; g) As we shall see this distribution plays a central role in the analysis of functional activation datasets [18] In the following we discuss the so called curse of dimensionality that results from the extremely ill posed nature of typical functional activation datasets [6,23]. The problem is discussed in terms of probability density estimation and we briefly mention ways to remedy the inevitable over parameterization that otherwise occurs in modeling procedures based on such datasets [12] The main point we hope to convey is how model generalization as studied ....
J. H. Friedman. On bias, variance, 0/1-loss, and the curse-of-dimensionality. Journal of Knowledge Discovery and Data Mining, 1996. In press.
....in the field does not permit a complete and rigorous answer to this question, but some elements can be gleaned from the results in this article, and from the literature. 20 P. DOMINGOS AND M. PAZZANI It is well known that squared error loss can be decomposed into three additive components (Friedman, 1996): the intrinsic error due to noise in the sample, the statistical bias (systematic component of the approximation error, or error for an infinite sample) and the variance (component of the error due to the approximation s sensitivity to the sample, or error due to the sample s finite size) A ....
....relative behavior of estimation algorithms: those with greater representational power, and thus greater ability to respond to the sample, tend to have lower bias, but also higher variance. Recently, several authors (Kong Dietterich, 1995; Kohavi Wolpert, 1996; Tibshirani, 1996; Breiman, 1996; Friedman, 1996) have proposed similar biasvariance decompositions for zero one loss functions. In particular, Friedman (1996) has shown, using normal approximations to the class probabilities, that the biasvariance interaction now takes a very different form. Zero one loss can be highly insensitive to ....
[Article contains additional citation context not shown here]
Friedman, J. H. (1996). On bias, variance, 0/1 - loss, and the curse-of-dimensionality (technical report). Department of Statistics, Stanford University, Stanford, CA. ftp://playfair.stanford.- edu/pub/friedman/kdd.ps.Z.
....5 we discuss ways in which variance can be reduced and our approach can be applied to tasks with few features. 2 The Bias and Variance Decomposition This section reviews the bias variance decomposition of the error of a classifier, following the definitions given in [ Breiman, 1996a ] and [ Friedman, 1996 ] In a classification problem one assumes that there exist two random variables, X and Y , where X describes the input parameters (i.e. instances) and Y is a discrete variable with a finite number of values, Y 2 f1; kg, called classes. A classification problem is completely described ....
.... f i (x) P (Y = ijX = x) The goal is to produce a classifier Y 2 f1; kg that minimizes the misclassification error (risk) E[r(X) where r(x) k X i=1 f i (x)1( Y (x) 6= i) 1) and where 1( Delta) is a function that takes the value 1 if the argument is true and 0 otherwise [ Friedman, 1996 ] The minimum misclassification rate is obtained using the Bayes optimal classifier: YB (x) arg max i f i (x) 2) with misclassification rate Er(YB ) E[rB (X) 1 Gamma Z max i f i (X)P (dX) 3) Given a finite training set T = f(x i ; y i ) i = 1; mg the classifier ....
J. H. Friedman. On bias, variance, 0/1 - loss, and the curse-of-dimensionality. Technical report, Stanford University, August 1996.
No context found.
J.H. Friedman. On bias, variance, 0/1 - loss, and the curse-of-dimensionality. In Technical Report. Stanford University, 1996.
No context found.
J. H. Friedman, \On bias, variance, loss, and the curse of dimensionality," tech. rep., Department of Statistics, Stanford University, 1996.
No context found.
J. H. Friedman, \On bias, variance, loss, and the curse of dimensionality," tech. rep., Department of Statistics, Stanford University, 1996.
No context found.
Friedman, J. H. (1996). On bias, variance, 0/1--loss, and the curse-of-dimensionality. Avaliable at ftp://playfair.stanford.edu/pub/friedman/kdd.ps.Z
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC