#### DMCA

## An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants (1999)

### Cached

### Download Links

- [www.lri.fr]
- [www-connex.lip6.fr]
- [robotics.stanford.edu]
- [robotics.stanford.edu]
- [www.cs.utsa.edu]
- [robotics.stanford.edu]
- [robotics.stanford.edu]
- [ai.stanford.edu]
- [ai.stanford.edu]
- [nlp.cs.swarthmore.edu]
- [robotics.stanford.edu]
- [ai.stanford.edu]
- [nlp.cs.swarthmore.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | MACHINE LEARNING |

Citations: | 707 - 2 self |

### Citations

6602 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...to Y . A deterministic inducer is a mapping from a sample S, referred to as the training set and containing m labeled instances, to a classifier. EMPIRICAL COMPARISON OF BOOSTING, BAGGING, AND VARIANTS 107 3. The base inducers We used four base inducers for our experiments; these came from two families of algorithms: decision trees and Naive-Bayes. 3.1. The decision tree inducers The basic decision tree inducer we used, called MC4 (MLC++ C4.5), is a Top-Down Decision Tree (TDDT) induction algorithm implemented inMLC++ (Kohavi, Sommerfield, & Dougherty, 1997). The algorithm is similar to C4.5 (Quinlan, 1993) with the exception that unknowns are regarded as a separate value. The algorithm grows the decision tree following the standard methodology of choosing the best attribute according to the evaluation criterion (gain-ratio). After the tree is grown, a pruning phase replaces subtrees with leaves using the same pruning algorithm that C4.5 uses. The main reason for choosing this algorithm over C4.5 is our familiarity with it, our ability to modify it for experiments, and its tight integration with multiple model mechanisms withinMLC++. MC4 is available off the web in source form as part ofMLC++ (K... |

5185 | An introduction to the bootstrap.
- Efron, Tibshirani
- 1993
(Show Context)
Citation Context ...nt attributes (Langley & Sage, 1997). 4. The voting algorithms The different voting algorithms used are described below. Each algorithm takes an inducer and a training set as input and runs the inducer multiple times by changing the distribution of training set instances. The generated classifiers are then combined to create a final classifier that is used to classify the test set. 4.1. The Bagging algorithm The Bagging algorithm (Bootstrap aggregating) by Breiman (1996b) votes classifiers generated by different bootstrap samples (replicates). Figure 1 shows the algorithm. A Bootstrap sample (Efron & Tibshirani, 1993) is generated by uniformly sampling m instances from the training set with replacement. T bootstrap samples B1, B2, . . . , BT are generated and a classifier Ci is built from each bootstrap sample Bi . A final classifier C∗ is built from Figure 1. The Bagging algorithm. EMPIRICAL COMPARISON OF BOOSTING, BAGGING, AND VARIANTS 109 C1,C2, . . . ,CT whose output is the class predicted most often by its sub-classifiers, with ties broken arbitrarily. For a given bootstrap sample, an instance in the training set has probability 1−(1−1/m)m of being selected at least once in the m times instances are r... |

4843 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...sholds. MC4(1)-disc is very similar to the 1R classifier of Holte (1993), except that the discretization step is based on entropy, which compared favorably with his 1R discretization in our previous work (Kohavi & Sahami, 1996). Both MC4(1) and MC4(1)-disc build very weak classifiers, but MC4(1)-disc is the more powerful of the two. Specifically for multi-class problems with continuous attributes, MC4(1) is usually unable to build a good classifier because the tree consists of a single binary root split with leaves as children. 3.2. The Naive-Bayes Inducer The Naive-Bayes Inducer (Good, 1965; Duda & Hart, 1973; Langley, Iba, & Thompson, 1992), sometimes called Simple-Bayes (Domingos & Pazzani, 1997), builds a simple conditional independence classifier. Formally, the probability of a class label value y for an unlabeled instance x containing n attributes 〈A1, . . . , An〉 is given by P(y |x) = P(x |y) · P(y)/P(x) by Bayes rule ∝ P(A1, . . . , An |y) · P(y) P(x) is same for all label values. = n∏ j=1 P(A j |y) · P(y) by conditional independence assumption. 108 E. BAUER AND R. KOHAVI The above probability is computed for each class and the prediction is made for the class with the largest posterior pro... |

3648 | Bagging predictors
- Breiman
- 1996
(Show Context)
Citation Context ...hods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets (=-=Breiman 1996-=-b, Freund & Schapire 1996, Quinlan 1996). Voting algorithms can be divided into two types: those that adaptively change the distribution of the training set based on the performance of previous classi... |

3497 | A decision-theoretic generalization of on-line learning and an application to boosting
- Freund, Schapire
- 1997
(Show Context)
Citation Context ...provides a recent review of related algorithms, and additional recent work can be found in Chan, Stolfo & Wolpert (1996). Algorithms that adaptively change the distribution include AdaBoost (Freund & =-=Schapire 1995-=-) and Arc-x4 (Breiman 1996a). Drucker & Cortes (1996) and Quinlan (1996) applied boosting to decision tree induction, observing both that error significantly decreases and that the generalization erro... |

3469 | UCI repository of machine learning databases - Blake, Merz - 1998 |

2213 | Experiments with a new boosting algorithm
- Freund, Schapire
- 1996
(Show Context)
Citation Context ...cation algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets (Breiman 1996b, Freund & =-=Schapire 1996-=-, Quinlan 1996). Voting algorithms can be divided into two types: those that adaptively change the distribution of the training set based on the performance of previous classifiers (as in boosting met... |

1507 | Bayesian Theory.
- Bernardo, Smith
- 1994
(Show Context)
Citation Context ...y stable algorithm, and Bagging is mostly a variance reduction technique. Specifically, the average variance for Naive-Bayes is 2.8%, which Bagging with probability estimates decreased to 2.5%. The average bias, however, is 10.8%, and Bagging reduces that to only 10.6%. The mean-squared errors generated by p-Bagging were significantly smaller than the non-Bagged variants for MC4, MC4(1), and MC4(1)-disc. We are not aware of anyone who reported any mean-squared errors results for voting algorithms in the past. Good probability estimates are crucial for applications when loss matrices are used (Bernardo & Smith, 1993), and the significant differences indicate that p-Bagging is a very promising approach. 8. Boosting algorithms: AdaBoost and Arc-x4 We now discuss boosting algorithms. First, we explore practical considerations for boosting algorithm implementation, specifically numerical instabilities and underflows. We then show a detailed example of a boosting run and emphasize underflow problems we experienced. Finally, we show results from experiments using AdaBoost and Arc-x4 and describe our conclusions. 8.1. Numerical instabilities and a detailed boosting example Before we detail the results of the exp... |

1282 | A study of cross-validation and bootstrap for accuracy estimation and model selection
- Kohavi
- 1995
(Show Context)
Citation Context ...rd deviations of the error estimate from each run were computed as the standard deviation of the three outer runs, assuming they were independent. Although such an assumption is not strictly correct (=-=Kohavi 1995-=-a, Dietterich 1998), it is quite reasonable given our circumstances because our training sets are small in size and we only average three values. 6. Experimental Design We now describe our desiderata ... |

1059 |
C4.5: Programs for
- Quinlan
- 1994
(Show Context)
Citation Context ...ion tree inducer we used, called MC4 (MLC++ C4.5), is a TopDown Decision Tree (TDDT) induction algorithm implemented in MLC++ (Kohavi, Sommerfield & Dougherty 1997). The algorithm is similar to C4.5 (=-=Quinlan 1993-=-) with the exception that unknowns are regarded as a separate value. The algorithm grows the decision tree following the standard methodology of choosing the best attribute according to the evaluation... |

896 | Boosting the margin: A new explanation for the effectiveness of voting methods
- Schapire, Freund, et al.
- 1998
(Show Context)
Citation Context ...limit of 25 such samples at a given trial; such a limit was never reached in our experiments if the first trial succeeded with one of the 25 samples. Some implementations of AdaBoost use boosting by resampling because the inducers used were unable to support weighted instances (e.g., Freund & Schapire, 1996). Our implementations of MC4, MC4(1), MC4(1)-disc, and Naive-Bayes support weighted instances, so we have implemented boosting by reweighting, which is a more direct implementation of the theory. Some evidence exists that reweighting works better in practice (Quinlan, 1996). Recent work by Schapire et al. (1997) suggests one explanation for the success of boosting and for the fact that test set error does not increase when many classifiers are combined as the theoretical model implies. Specifically, these successes are linked to the distribution of the “margins” of the training examples with respect to the generated voting classification rule, where the “margin” of an example is the difference between the number EMPIRICAL COMPARISON OF BOOSTING, BAGGING, AND VARIANTS 111 of correct votes it received and the maximum number of votes received by any incorrect label. Breiman (1997) claims that the framew... |

870 | The Strength of Weak Learnability
- Schapire
- 1990
(Show Context)
Citation Context ...s about 1− 1/e = 63.2%, which means that each bootstrap sample contains only about 63.2% unique instances from the training set. This perturbation causes different classifiers to be built if the inducer is unstable (e.g., neural networks, decision trees) (Breiman, 1994) and the performance can improve if the induced classifiers are good and not correlated; however, Bagging may slightly degrade the performance of stable algorithms (e.g., k-nearest neighbor) because effectively smaller training sets are used for training each classifier (Breiman, 1996b). 4.2. Boosting Boosting was introduced by Schapire (1990) as a method for boosting the performance of a weak learning algorithm. After improvements by Freund (1990), recently expanded in Freund (1996), AdaBoost (Adaptive Boosting) was introduced by Freund & Schapire (1995). In our work below, we concentrate on AdaBoost, sometimes called AdaBoost.M1 (e.g., Freund & Schapire, 1996). Like Bagging, the AdaBoost algorithm generates a set of classifiers and votes them. Beyond this, the two algorithms differ substantially. The AdaBoost algorithm, shown in figure 2, generates the classifiers sequentially, while Bagging can generate them in parallel. AdaBoos... |

832 | Multi-interval discretization of continuous-valued attributes for classification learning
- Fayyad, Irani
- 1993
(Show Context)
Citation Context ... part ofMLC++ (Kohavi, Sommerfield, & Dougherty, 1997). Along with the original algorithm, two variants of MC4 were explored: MC4(1) and MC4(1)-disc. MC4(1) limits the tree to a single root split; such a shallow tree is sometimes called a decision stump (Iba & Langley, 1992). If the root attribute is nominal, a multiway split is created with one branch for unknowns. If the root attribute is continuous, a three-way split is created: less than a threshold, greater than a threshold, and unknown. MC4(1)-disc first discretizes all the attributes using entropy discretization (Kohavi & Sahami, 1996; Fayyad & Irani, 1993), thus effectively allowing a root split with multiple thresholds. MC4(1)-disc is very similar to the 1R classifier of Holte (1993), except that the discretization step is based on entropy, which compared favorably with his 1R discretization in our previous work (Kohavi & Sahami, 1996). Both MC4(1) and MC4(1)-disc build very weak classifiers, but MC4(1)-disc is the more powerful of the two. Specifically for multi-class problems with continuous attributes, MC4(1) is usually unable to build a good classifier because the tree consists of a single binary root split with leaves as children. 3.2. Th... |

731 | Stacked generalization.
- Wolpert
- 1992
(Show Context)
Citation Context ...sifiers (as in boosting methods) and those that do not (as in Bagging). Algorithms that do not adaptively change the distribution include option decision tree algorithms that construct decision trees with multiple options at some nodes (Buntine, 106 E. BAUER AND R. KOHAVI 1992a, 1992b; Kohavi & Kunz, 1997); averaging path sets, fanned sets, and extended fanned sets as alternatives to pruning (Oliver & Hand, 1995); voting trees using different splitting criteria and human intervention (Kwok & Carter, 1990); and error-correcting output codes (Dietterich & Bakiri, 1991; Kong & Dietterich, 1995). Wolpert (1992) discusses “stacking” classifiers into a more complex classifier instead of using the simple uniform weighting scheme of Bagging. Ali (1996) provides a recent review of related algorithms, and additional recent work can be found in Chan, Stolfo, and Wolpert (1996). Algorithms that adaptively change the distribution include AdaBoost (Freund & Schapire, 1995) and Arc-x4 (Breiman, 1996a). Drucker and Cortes (1996) and Quinlan (1996) applied boosting to decision tree induction, observing both that error significantly decreases and that the generalization error does not degrade as more classifiers ... |

729 | Neural Networks and the Bias/Variance Dilemma - Geman, Bienenstock, et al. - 1992 |

722 | Approximate statistical test for comparing supervised classification learning algorithms
- Dietterich
- 1998
(Show Context)
Citation Context ...of the error estimate from each run were computed as the standard deviation of the three outer runs, assuming they were independent. Although such an assumption is not strictly correct (Kohavi 1995a, =-=Dietterich 1998-=-), it is quite reasonable given our circumstances because our training sets are small in size and we only average three values. 6. Experimental Design We now describe our desiderata for comparisons, s... |

547 | Very simple classification rules perform well on most commonly used datasets
- Holte
- 1993
(Show Context)
Citation Context ...(1)-disc. MC4(1) limits the tree to a single root split; such a shallow tree is sometimes called a decision stump (Iba & Langley, 1992). If the root attribute is nominal, a multiway split is created with one branch for unknowns. If the root attribute is continuous, a three-way split is created: less than a threshold, greater than a threshold, and unknown. MC4(1)-disc first discretizes all the attributes using entropy discretization (Kohavi & Sahami, 1996; Fayyad & Irani, 1993), thus effectively allowing a root split with multiple thresholds. MC4(1)-disc is very similar to the 1R classifier of Holte (1993), except that the discretization step is based on entropy, which compared favorably with his 1R discretization in our previous work (Kohavi & Sahami, 1996). Both MC4(1) and MC4(1)-disc build very weak classifiers, but MC4(1)-disc is the more powerful of the two. Specifically for multi-class problems with continuous attributes, MC4(1) is usually unable to build a good classifier because the tree consists of a single binary root split with leaves as children. 3.2. The Naive-Bayes Inducer The Naive-Bayes Inducer (Good, 1965; Duda & Hart, 1973; Langley, Iba, & Thompson, 1992), sometimes called Sim... |

516 | Boosting a Weak Learning Algorithm by Majority
- Freund
- 1995
(Show Context)
Citation Context ...m the training set. This perturbation causes different classifiers to be built if the inducer is unstable (e.g., neural networks, decision trees) (Breiman, 1994) and the performance can improve if the induced classifiers are good and not correlated; however, Bagging may slightly degrade the performance of stable algorithms (e.g., k-nearest neighbor) because effectively smaller training sets are used for training each classifier (Breiman, 1996b). 4.2. Boosting Boosting was introduced by Schapire (1990) as a method for boosting the performance of a weak learning algorithm. After improvements by Freund (1990), recently expanded in Freund (1996), AdaBoost (Adaptive Boosting) was introduced by Freund & Schapire (1995). In our work below, we concentrate on AdaBoost, sometimes called AdaBoost.M1 (e.g., Freund & Schapire, 1996). Like Bagging, the AdaBoost algorithm generates a set of classifiers and votes them. Beyond this, the two algorithms differ substantially. The AdaBoost algorithm, shown in figure 2, generates the classifiers sequentially, while Bagging can generate them in parallel. AdaBoost also changes the weights of the training instances provided as input to each inducer based on classifiers... |

439 | An analysis of Bayesian classifier
- Langley, Iba, et al.
- 1992
(Show Context)
Citation Context ... is very similar to the 1R classifier of Holte (1993), except that the discretization step is based on entropy, which compared favorably with his 1R discretization in our previous work (Kohavi & Sahami, 1996). Both MC4(1) and MC4(1)-disc build very weak classifiers, but MC4(1)-disc is the more powerful of the two. Specifically for multi-class problems with continuous attributes, MC4(1) is usually unable to build a good classifier because the tree consists of a single binary root split with leaves as children. 3.2. The Naive-Bayes Inducer The Naive-Bayes Inducer (Good, 1965; Duda & Hart, 1973; Langley, Iba, & Thompson, 1992), sometimes called Simple-Bayes (Domingos & Pazzani, 1997), builds a simple conditional independence classifier. Formally, the probability of a class label value y for an unlabeled instance x containing n attributes 〈A1, . . . , An〉 is given by P(y |x) = P(x |y) · P(y)/P(x) by Bayes rule ∝ P(A1, . . . , An |y) · P(y) P(x) is same for all label values. = n∏ j=1 P(A j |y) · P(y) by conditional independence assumption. 108 E. BAUER AND R. KOHAVI The above probability is computed for each class and the prediction is made for the class with the largest posterior probability. The probabilities in t... |

361 | Beyond independence: Conditions for the optimality of the simple bayesian classifier.
- Domingos, Pazzani
- 1996
(Show Context)
Citation Context ...oost and Arc-x4 and describe our conclusions. 8.1. Numerical instabilities and a detailed boosting example Before we detail the results of the experiments, we would like to step through a detailed example of an AdaBoost run for two reasons: first, to get a better understanding of the process, and second, to highlight the important issue of numerical instabilities and underflows that is rarely discussed yet common in boosting algorithms. We believe that many authors have either faced these problems and corrected them or do not even know that they exist, as the following example shows. Example. Domingos and Pazzani (1997) reported very poor accuracy of 24.1% (error of 75.9%) on the Sonar dataset with the Naive-Bayes induction algorithm, which otherwise EMPIRICAL COMPARISON OF BOOSTING, BAGGING, AND VARIANTS 125 performed very well. Since this is a two-class problem, predicting majority would have done much better. Kohavi, Becker, and Sommerfield (1997) reported an accuracy of 74.5% (error of 25.5%) on the same problem with a very similar algorithm. Further investigation of the discrepancy by Domingos and Kohavi revealed that Domingos’ Naive-Bayes algorithm did not normalize the probabilities after every attrib... |

345 | Arcing classifiers.
- Breiman
- 1996
(Show Context)
Citation Context ...hods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets (=-=Breiman 1996-=-b, Freund & Schapire 1996, Quinlan 1996). Voting algorithms can be divided into two types: those that adaptively change the distribution of the training set based on the performance of previous classi... |

326 | Bagging, boosting, and C4.5.
- Quinlan
- 1996
(Show Context)
Citation Context ...ms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets (Breiman 1996b, Freund & Schapire 1996, =-=Quinlan 1996-=-). Voting algorithms can be divided into two types: those that adaptively change the distribution of the training set based on the performance of previous classifiers (as in boosting methods) and thos... |

212 | Bias plus variance decomposition for zero-one loss functions. In
- Kohavi, Wolpert
- 1996
(Show Context)
Citation Context ...he Bayes-optimal classifier. Squared “bias” (bias2). This quantity measures how closely the learning algorithm’s average guess (over all possible training sets of the given training set size) matches the target. “Variance” (variance). This quantity measures how much the learning algorithm’s guess fluctuates for the different training sets of the given size. For classification, the quadratic loss function is inappropriate because class labels are not numeric. Several proposals for decomposing classification error into bias and variance have been suggested, including Kong and Dietterich (1995), Kohavi and Wolpert (1996), and Breiman (1996a). We believe that the decomposition proposed by Kong and Dietterich (1995) is inferior to the others because it allows for negative variance values. Of the remaining two, we chose to use the decomposition by Kohavi and Wolpert (1996) because its code was available from previous work and because it mirrors the quadratic decomposition best. Let YH be the random variable representing the label of an instance in the hypothesis space and YF be the random variable representing the label of an instance in the target function. It can be shown that the error can be decomposed into ... |

199 |
The Estimation of Probabilities: An Essay on Modern Bayesian Methods.
- Good
- 1965
(Show Context)
Citation Context ...ly unable to build a good classifier because the tree consists of a single binary root split with leaves as children. 4 ERIC BAUER AND RON KOHAVI 3.2. The Naive-Bayes Inducer The Naive-Bayes Inducer (=-=Good 1965-=-, Duda & Hart 1973, Langley, Iba & Thompson 1992), sometimes called Simple-Bayes (Domingos & Pazzani 1997), builds a simple conditional independence classifier. Formally, the probability of a class la... |

194 | Estimating probabilities: A crucial task in machine learning - CESTNIK - 1990 |

171 | Data Mining using MLC++, a Machine Learning Library in C++.
- Kohavi, Sommerfield, et al.
- 1996
(Show Context)
Citation Context ...uild a structured model that has the same a#ect as Bagging. Ridgeway, Madigan & Richardson (1998) convert a boosted Naive-Bayes to a regular Naive-Bayes, which then allows for visualizations (Becker, =-=Kohavi & Sommerfield 1997-=-). Are there ways to make boosting comprehensible for general models? Craven & Shavlik (1993) built a single decision tree that attempts to make the same classifications as a neural network. Quinlan (... |

170 | Error-correcting output coding corrects bias and variance, in: ICML,
- Kong, Dietterich
- 1995
(Show Context)
Citation Context ...t is the expected error of the Bayes-optimal classifier. Squared “bias” (bias2). This quantity measures how closely the learning algorithm’s average guess (over all possible training sets of the given training set size) matches the target. “Variance” (variance). This quantity measures how much the learning algorithm’s guess fluctuates for the different training sets of the given size. For classification, the quadratic loss function is inappropriate because class labels are not numeric. Several proposals for decomposing classification error into bias and variance have been suggested, including Kong and Dietterich (1995), Kohavi and Wolpert (1996), and Breiman (1996a). We believe that the decomposition proposed by Kong and Dietterich (1995) is inferior to the others because it allows for negative variance values. Of the remaining two, we chose to use the decomposition by Kohavi and Wolpert (1996) because its code was available from previous work and because it mirrors the quadratic decomposition best. Let YH be the random variable representing the label of an instance in the hypothesis space and YF be the random variable representing the label of an instance in the target function. It can be shown that the er... |

169 |
A conservation law for generalization performance.
- Schaffer
- 1994
(Show Context)
Citation Context ...the segment dataset with MC4(1), the error increased as the training set size grew. While in theory such behavior must happen for everys12 ERIC BAUER AND RON KOHAVI induction algorithm (Wolpert 1994, =-=Schaffer 1994-=-), this is the first time we have seen it in a real dataset. Further investigation revealed that in this problem all seven classes are equiprobable, i.e., the dataset was stratified. A strong majority... |

144 | Learning Classification Trees,
- Buntine
- 1992
(Show Context)
Citation Context ...ging). 2 ERIC BAUER AND RON KOHAVI Algorithms that do not adaptively change the distribution include option decision tree algorithms that construct decision trees with multiple options at some nodes (=-=Buntine 1992-=-b, Buntine 1992a, Kohavi & Kunz 1997); averaging path sets, fanned sets, and extended fanned sets as alternatives to pruning (Oliver & Hand 1995); voting trees using di#erent splitting criteria and hu... |

126 | Reducing misclassification costs. - Pazzani, Merz, et al. - 1994 |

125 | Wrappers for Performance Enhancement and Oblivious Decision Graphs.
- Kohavi
- 1995
(Show Context)
Citation Context ...rd deviations of the error estimate from each run were computed as the standard deviation of the three outer runs, assuming they were independent. Although such an assumption is not strictly correct (=-=Kohavi 1995-=-a, Dietterich 1998), it is quite reasonable given our circumstances because our training sets are small in size and we only average three values. 6. Experimental Design We now describe our desiderata ... |

120 | Error-based and entropy-based discretization of continuous features.
- Kohavi, Sahami
- 1996
(Show Context)
Citation Context ...e web in source form as part ofMLC++ (Kohavi, Sommerfield, & Dougherty, 1997). Along with the original algorithm, two variants of MC4 were explored: MC4(1) and MC4(1)-disc. MC4(1) limits the tree to a single root split; such a shallow tree is sometimes called a decision stump (Iba & Langley, 1992). If the root attribute is nominal, a multiway split is created with one branch for unknowns. If the root attribute is continuous, a three-way split is created: less than a threshold, greater than a threshold, and unknown. MC4(1)-disc first discretizes all the attributes using entropy discretization (Kohavi & Sahami, 1996; Fayyad & Irani, 1993), thus effectively allowing a root split with multiple thresholds. MC4(1)-disc is very similar to the 1R classifier of Holte (1993), except that the discretization step is based on entropy, which compared favorably with his 1R discretization in our previous work (Kohavi & Sahami, 1996). Both MC4(1) and MC4(1)-disc build very weak classifiers, but MC4(1)-disc is the more powerful of the two. Specifically for multi-class problems with continuous attributes, MC4(1) is usually unable to build a good classifier because the tree consists of a single binary root split with leav... |

104 | Error-correcting output codes: a general method for improving multiclass inductive learning programs,”
- Dietterich, Bakiri
- 1994
(Show Context)
Citation Context ...raining set based on the performance of previous classifiers (as in boosting methods) and those that do not (as in Bagging). Algorithms that do not adaptively change the distribution include option decision tree algorithms that construct decision trees with multiple options at some nodes (Buntine, 106 E. BAUER AND R. KOHAVI 1992a, 1992b; Kohavi & Kunz, 1997); averaging path sets, fanned sets, and extended fanned sets as alternatives to pruning (Oliver & Hand, 1995); voting trees using different splitting criteria and human intervention (Kwok & Carter, 1990); and error-correcting output codes (Dietterich & Bakiri, 1991; Kong & Dietterich, 1995). Wolpert (1992) discusses “stacking” classifiers into a more complex classifier instead of using the simple uniform weighting scheme of Bagging. Ali (1996) provides a recent review of related algorithms, and additional recent work can be found in Chan, Stolfo, and Wolpert (1996). Algorithms that adaptively change the distribution include AdaBoost (Freund & Schapire, 1995) and Arc-x4 (Breiman, 1996a). Drucker and Cortes (1996) and Quinlan (1996) applied boosting to decision tree induction, observing both that error significantly decreases and that the generalization e... |

99 | Boosting decision trees.
- Drucker, Cortes
- 1996
(Show Context)
Citation Context ...& Hand, 1995); voting trees using different splitting criteria and human intervention (Kwok & Carter, 1990); and error-correcting output codes (Dietterich & Bakiri, 1991; Kong & Dietterich, 1995). Wolpert (1992) discusses “stacking” classifiers into a more complex classifier instead of using the simple uniform weighting scheme of Bagging. Ali (1996) provides a recent review of related algorithms, and additional recent work can be found in Chan, Stolfo, and Wolpert (1996). Algorithms that adaptively change the distribution include AdaBoost (Freund & Schapire, 1995) and Arc-x4 (Breiman, 1996a). Drucker and Cortes (1996) and Quinlan (1996) applied boosting to decision tree induction, observing both that error significantly decreases and that the generalization error does not degrade as more classifiers are combined. Elkan (1997) applied boosting to a simple Naive-Bayesian inducer that performs uniform discretization and achieved excellent results on two real-world datasets and one artificial dataset, but failed to achieve significant improvements on two other artificial datasets. We review several voting algorithms, including Bagging, AdaBoost, and Arc-x4, and describe a large empirical study whose purpose wa... |

95 | Boosting the Margin: A New Explanation for the Eectiveness of Voting Methods - Schapire, Freund, et al. - 1997 |

89 | A Theory of Learning C’lassification Rules.
- Buntine
- 1990
(Show Context)
Citation Context ...ging). 2 ERIC BAUER AND RON KOHAVI Algorithms that do not adaptively change the distribution include option decision tree algorithms that construct decision trees with multiple options at some nodes (=-=Buntine 1992-=-b, Buntine 1992a, Kohavi & Kunz 1997); averaging path sets, fanned sets, and extended fanned sets as alternatives to pruning (Oliver & Hand 1995); voting trees using di#erent splitting criteria and hu... |

81 | Naive Bayesian Learning, - Elkan - 1997 |

77 |
Induction of one-level decision trees.
- Iba, Langley
- 1992
(Show Context)
Citation Context ...is grown, a pruning phase replaces subtrees with leaves using the same pruning algorithm that C4.5 uses. The main reason for choosing this algorithm over C4.5 is our familiarity with it, our ability to modify it for experiments, and its tight integration with multiple model mechanisms withinMLC++. MC4 is available off the web in source form as part ofMLC++ (Kohavi, Sommerfield, & Dougherty, 1997). Along with the original algorithm, two variants of MC4 were explored: MC4(1) and MC4(1)-disc. MC4(1) limits the tree to a single root split; such a shallow tree is sometimes called a decision stump (Iba & Langley, 1992). If the root attribute is nominal, a multiway split is created with one branch for unknowns. If the root attribute is continuous, a three-way split is created: less than a threshold, greater than a threshold, and unknown. MC4(1)-disc first discretizes all the attributes using entropy discretization (Kohavi & Sahami, 1996; Fayyad & Irani, 1993), thus effectively allowing a root split with multiple thresholds. MC4(1)-disc is very similar to the 1R classifier of Holte (1993), except that the discretization step is based on entropy, which compared favorably with his 1R discretization in our previ... |

76 | Stacked generalization, Neural Networks 5 - Wolpert - 1992 |

74 | The effects of training set size on decision tree complexity.
- Oates, Jensen
- 1997
(Show Context)
Citation Context ... are pure or until a split cannot be found where two children each contain at least two instances. The unpruned trees for MC4 had an average size of 667 and the unpruned trees for Bagged MC4 trees had an average size of 496—25% smaller. Moreover, the averaged size for trees generated by MC4 on the bootstrap samples for a given dataset was always smaller than the corresponding size of the trees generated by MC4 alone. We postulate that this effect is due to the smaller effective size of training sets under bagging, which contain only about 63.2% unique instances from the original training set. Oates and Jensen (1997) have shown that there is a close correlation between the training set size and the tree complexity for the reduced error pruning algorithm used in C4.5 and MC4. 120 E. BAUER AND R. KOHAVI The trees generated from the bootstrap samples were initially grown to be smaller than the corresponding MC4 trees, yet they were larger after pruning was invoked. The experiment confirms our hypothesis that the structure of the bootstrap replicates inhibits reduced-error pruning. We believe the reason for this inhibition is that instances are duplicated in the bootstrap sample, reinforcing patterns that mig... |

64 |
Multiple decision trees. In
- Kwok, Carter
- 1990
(Show Context)
Citation Context ...s: those that adaptively change the distribution of the training set based on the performance of previous classifiers (as in boosting methods) and those that do not (as in Bagging). Algorithms that do not adaptively change the distribution include option decision tree algorithms that construct decision trees with multiple options at some nodes (Buntine, 106 E. BAUER AND R. KOHAVI 1992a, 1992b; Kohavi & Kunz, 1997); averaging path sets, fanned sets, and extended fanned sets as alternatives to pruning (Oliver & Hand, 1995); voting trees using different splitting criteria and human intervention (Kwok & Carter, 1990); and error-correcting output codes (Dietterich & Bakiri, 1991; Kong & Dietterich, 1995). Wolpert (1992) discusses “stacking” classifiers into a more complex classifier instead of using the simple uniform weighting scheme of Bagging. Ali (1996) provides a recent review of related algorithms, and additional recent work can be found in Chan, Stolfo, and Wolpert (1996). Algorithms that adaptively change the distribution include AdaBoost (Freund & Schapire, 1995) and Arc-x4 (Breiman, 1996a). Drucker and Cortes (1996) and Quinlan (1996) applied boosting to decision tree induction, observing both th... |

61 | Arcing the Edge,
- Breiman
- 1997
(Show Context)
Citation Context ...Recent work by Schapire et al. (1997) suggests one explanation for the success of boosting and for the fact that test set error does not increase when many classifiers are combined as the theoretical model implies. Specifically, these successes are linked to the distribution of the “margins” of the training examples with respect to the generated voting classification rule, where the “margin” of an example is the difference between the number EMPIRICAL COMPARISON OF BOOSTING, BAGGING, AND VARIANTS 111 of correct votes it received and the maximum number of votes received by any incorrect label. Breiman (1997) claims that the framework he proposed “gives results which are the opposite of what we would expect given Schapire et al. explanation of why arcing works.” 4.3. Arc-x4 The term Arcing (Adaptively resample and combine) was coined by Breiman (1996a) to describe the family of algorithms that adaptively resample and combine; AdaBoost, which he calls arc-fs, is the primary example of an arcing algorithm. Breiman contrasts arcing with the P&C family (Perturb and Combine), of which Bagging is the primary example. Breiman (1996a) wrote: After testing arc-fs I suspected that its success lay not in its... |

56 | Comparing Connectionist and Symbolic Learning Methods.
- Quinlan
- 1994
(Show Context)
Citation Context ...utions. 10. Voting techniques usually result in incomprehensible classifiers that cannot easily be shown to users. One solution proposed by Kohavi and Kunz (1997) attempts to build a structured model that has the same affect as Bagging. Ridgeway, Madigan, and Richardson (1998) convert a boosted Naive-Bayes to a regular Naive-Bayes, which then allows for visualizations (Becker, Kohavi, & Sommerfield, 1997). Are there ways to make boosting comprehensible for general models? Craven and Shavlik (1993) built a single decision tree that attempts to make the same classifications as a neural network. Quinlan (1994) notes that there are parallel problems that require testing all attributes. A single tree for such problems must be large. 136 E. BAUER AND R. KOHAVI 11. In parallel environments, Bagging has a strong advantage because the sub-classifiers can be built in parallel. Boosting methods, on the other hand, require the estimated training set error on trial T to generate the distribution for trial T + 1. This makes coarse-grain parallelization very hard. Can some efficient parallel implementations be devised? 10. Conclusions We provided a brief review of two families of voting algorithms: perturb and... |

55 |
On bias, variance, 0/1-loss, and the curse of dimensionality.
- Friedman
- 1997
(Show Context)
Citation Context ... independence assumption is not true in many cases, causing a single factor to a#ect several attributes whose probabilities are multiplied assuming they are conditionally independent given the label (=-=Friedman 1997-=-). To summarize, we have seen error reductions for the family of decision-tree algorithms when probabilistic estimates were used. The error reductions were larger for the one level decision trees. Thi... |

51 |
The heuristics of instability in model selection.
- Breiman
- 1996
(Show Context)
Citation Context ... sample contains only about 63.2% unique instances from the training set. This perturbation causes di#erent classifiers to be built if the inducer is unstable (e.g., neural networks, decision trees) (=-=Breiman 1994-=-) and the performance can improve if the induced classifiers are good and not correlated; however, Bagging may slightly degrade the performance of stable algorithms (e.g., k-nearest neighbor) because ... |

50 | Learning symbolic rules using artificial neural networks.
- Craven, Shavlik
- 1993
(Show Context)
Citation Context ...inlan (1996) used it as the voting strength, but this ignores the fact that the classifiers were built using skewed distributions. 10. Voting techniques usually result in incomprehensible classifiers that cannot easily be shown to users. One solution proposed by Kohavi and Kunz (1997) attempts to build a structured model that has the same affect as Bagging. Ridgeway, Madigan, and Richardson (1998) convert a boosted Naive-Bayes to a regular Naive-Bayes, which then allows for visualizations (Becker, Kohavi, & Sommerfield, 1997). Are there ways to make boosting comprehensible for general models? Craven and Shavlik (1993) built a single decision tree that attempts to make the same classifications as a neural network. Quinlan (1994) notes that there are parallel problems that require testing all attributes. A single tree for such problems must be large. 136 E. BAUER AND R. KOHAVI 11. In parallel environments, Bagging has a strong advantage because the sub-classifiers can be built in parallel. Boosting methods, on the other hand, require the estimated training set error on trial T to generate the distribution for trial T + 1. This makes coarse-grain parallelization very hard. Can some efficient parallel implemen... |

48 | Visualizing the Simple Bayesian Classifier,”
- Becker, Kohavi, et al.
- 1997
(Show Context)
Citation Context ...hbors, not to classify itself. 9. Could probabilistic predictions made by the sub-classifiers be used? Quinlan (1996) used it as the voting strength, but this ignores the fact that the classifiers were built using skewed distributions. 10. Voting techniques usually result in incomprehensible classifiers that cannot easily be shown to users. One solution proposed by Kohavi and Kunz (1997) attempts to build a structured model that has the same affect as Bagging. Ridgeway, Madigan, and Richardson (1998) convert a boosted Naive-Bayes to a regular Naive-Bayes, which then allows for visualizations (Becker, Kohavi, & Sommerfield, 1997). Are there ways to make boosting comprehensible for general models? Craven and Shavlik (1993) built a single decision tree that attempts to make the same classifications as a neural network. Quinlan (1994) notes that there are parallel problems that require testing all attributes. A single tree for such problems must be large. 136 E. BAUER AND R. KOHAVI 11. In parallel environments, Bagging has a strong advantage because the sub-classifiers can be built in parallel. Boosting methods, on the other hand, require the estimated training set error on trial T to generate the distribution for trial... |

47 | Option decision trees with majority votes - Kohavi, Kunz - 1997 |

47 | The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework.
- Wolpert
- 1994
(Show Context)
Citation Context ...t (German) 1,000 300 7 13 2 Image segmentation (segment) 2,310 500 19 0 7 Hypothyroid 3,163 1,000 7 18 2 Sick-euthyroid 3,163 800 7 18 2 DNA 3,186 500 0 60 3 Chess 3,196 500 0 36 2 LED-24 3,200 500 0 24 10 Waveform-40 5,000 1,000 40 0 3 Satellite image (satimage) 6,435 1,500 36 0 7 Mushroom 8,124 1,000 0 22 2 Nursery 12,960 3,000 0 8 5 Letter 20,000 5,000 16 0 26 Adult 48,842 11,000 6 8 2 Shuttle 58,000 5,000 9 0 7 In one surprising case, the segment dataset with MC4(1), the error increased as the training set size grew. While in theory such behavior must happen for every induction algorithm (Wolpert, 1994; Schaffer, 1994), this is the first time we have seen it in a real dataset. Further investigation revealed that in this problem all seven classes are equiprobable, i.e., the dataset was stratified. A strong majority in the training set implies a non-majority in the test set, resulting in poor performance. A stratified holdout might be more appropriate in such cases, mimicking the original sampling methodology (Kohavi, 1995b). For our experiments, only relative performance mattered, so we did not specifically stratify the holdout samples. 3. The voting algorithms should combine relatively few ... |

43 | On Pruning and Averaging Decision Trees,
- Oliver, Hand
- 1995
(Show Context)
Citation Context ... 1996b; Freund & Schapire, 1996; Quinlan,1996). Voting algorithms can be divided into two types: those that adaptively change the distribution of the training set based on the performance of previous classifiers (as in boosting methods) and those that do not (as in Bagging). Algorithms that do not adaptively change the distribution include option decision tree algorithms that construct decision trees with multiple options at some nodes (Buntine, 106 E. BAUER AND R. KOHAVI 1992a, 1992b; Kohavi & Kunz, 1997); averaging path sets, fanned sets, and extended fanned sets as alternatives to pruning (Oliver & Hand, 1995); voting trees using different splitting criteria and human intervention (Kwok & Carter, 1990); and error-correcting output codes (Dietterich & Bakiri, 1991; Kong & Dietterich, 1995). Wolpert (1992) discusses “stacking” classifiers into a more complex classifier instead of using the simple uniform weighting scheme of Bagging. Ali (1996) provides a recent review of related algorithms, and additional recent work can be found in Chan, Stolfo, and Wolpert (1996). Algorithms that adaptively change the distribution include AdaBoost (Freund & Schapire, 1995) and Arc-x4 (Breiman, 1996a). Drucker and C... |

42 |
Feature subset selection using the wrapper model: Over and dynamic search space topology
- Kohavi, Sommer
- 1995
(Show Context)
Citation Context ...rd deviations of the error estimate from each run were computed as the standard deviation of the three outer runs, assuming they were independent. Although such an assumption is not strictly correct (=-=Kohavi 1995-=-a, Dietterich 1998), it is quite reasonable given our circumstances because our training sets are small in size and we only average three values. 6. Experimental Design We now describe our desiderata ... |

40 | Why does bagging work? A Bayesian account and its implications.
- Domingos
- 1997
(Show Context)
Citation Context ...ively the single classifier. Are there better methods for handling this extreme situation? 4. In the case of the shuttle dataset, a single decision tree was built that was significantly better than the original MC4 tree. The decision tree had zero error on the training set and thus became the only voter. Are there more situations when this is true, i.e., where one of the classifiers that was learned from a sample with a skewed distribution performs well by itself on the unskewed test set? 5. Boosting and Bagging both create very complex classifiers, yet they do not seem to “overfit” the data. Domingos (1997) claims that the multiple trees do not simply implement a Bayesian approach, but actually shift the learner’s bias (machine learning bias, not statistical bias) away from the commonly used simplicity bias. Can this bias be made more explicit? 6. We found that Bagging works well without pruning. Pruning in decision trees is a method for reducing the variance by introducing bias. Since Bagging reduces the variance, disabling pruning indirectly reduces the bias. How does the error rate change as pruning is increased? Specifically, are there cases where pruning should still happen within Bagging? ... |

28 | Interpretable boosted naive bayes classification. - Ridgeway, Madigan, et al. - 1998 |

25 |
Learning probabilistic relational concept descriptions.
- Ali
- 1996
(Show Context)
Citation Context ... decision tree algorithms that construct decision trees with multiple options at some nodes (Buntine, 106 E. BAUER AND R. KOHAVI 1992a, 1992b; Kohavi & Kunz, 1997); averaging path sets, fanned sets, and extended fanned sets as alternatives to pruning (Oliver & Hand, 1995); voting trees using different splitting criteria and human intervention (Kwok & Carter, 1990); and error-correcting output codes (Dietterich & Bakiri, 1991; Kong & Dietterich, 1995). Wolpert (1992) discusses “stacking” classifiers into a more complex classifier instead of using the simple uniform weighting scheme of Bagging. Ali (1996) provides a recent review of related algorithms, and additional recent work can be found in Chan, Stolfo, and Wolpert (1996). Algorithms that adaptively change the distribution include AdaBoost (Freund & Schapire, 1995) and Arc-x4 (Breiman, 1996a). Drucker and Cortes (1996) and Quinlan (1996) applied boosting to decision tree induction, observing both that error significantly decreases and that the generalization error does not degrade as more classifiers are combined. Elkan (1997) applied boosting to a simple Naive-Bayesian inducer that performs uniform discretization and achieved excellent r... |

14 | A conservation law for generalization performance - Schaer - 1994 |

7 | Integrating multiple learned models for improving and scaling machine learning algorithms. - Chan, Stolfo, et al. - 1996 |

7 |
Scaling to domains with many irrelevant features. In
- Langley, Sage
- 1997
(Show Context)
Citation Context ...ich is part ofMLC++ (Kohavi, Sommerfield, & Dougherty, 1997), continuous attributes are discretized using entropy discretization (Kohavi & Sahami, 1996; Fayyad & Irani, 1993). Probabilities are estimated using frequency counts with an m-estimate Laplace correction (Cestnik, 1990) as described in (Kohavi, Becker, & Sommerfield, 1997). The Naive-Bayes classifier is relatively simple but very robust to violations of its independence assumptions. It performs well for many real-world datasets (Domingos & Pazzani, 1997; Kohavi & Sommerfield, 1995) and is excellent at handling irrelevant attributes (Langley & Sage, 1997). 4. The voting algorithms The different voting algorithms used are described below. Each algorithm takes an inducer and a training set as input and runs the inducer multiple times by changing the distribution of training set instances. The generated classifiers are then combined to create a final classifier that is used to classify the test set. 4.1. The Bagging algorithm The Bagging algorithm (Bootstrap aggregating) by Breiman (1996b) votes classifiers generated by different bootstrap samples (replicates). Figure 1 shows the algorithm. A Bootstrap sample (Efron & Tibshirani, 1993) is generat... |

7 |
Boosting and naive bayesian learning (Technical Report
- Elkan
- 1997
(Show Context)
Citation Context ...acking” classifiers into a more complex classifier instead of using the simple uniform weighting scheme of Bagging. Ali (1996) provides a recent review of related algorithms, and additional recent work can be found in Chan, Stolfo, and Wolpert (1996). Algorithms that adaptively change the distribution include AdaBoost (Freund & Schapire, 1995) and Arc-x4 (Breiman, 1996a). Drucker and Cortes (1996) and Quinlan (1996) applied boosting to decision tree induction, observing both that error significantly decreases and that the generalization error does not degrade as more classifiers are combined. Elkan (1997) applied boosting to a simple Naive-Bayesian inducer that performs uniform discretization and achieved excellent results on two real-world datasets and one artificial dataset, but failed to achieve significant improvements on two other artificial datasets. We review several voting algorithms, including Bagging, AdaBoost, and Arc-x4, and describe a large empirical study whose purpose was to improve our understanding of why and when these algorithms affect classification error. To ensure the study was reliable, we used over a dozen datasets, none of which had fewer than 1000 instances and four o... |

6 |
Arcing classifiers (Technical Report).
- Breiman
- 1996
(Show Context)
Citation Context ...an-squared errors. Practical problems that arise in implementing boosting algorithms are explored, including numerical instabilities and underflows. We use scatterplots that graphically show how AdaBoost reweights instances, emphasizing not only “hard” areas but also outliers and noise. Keywords: classification, boosting, Bagging, decision trees, Naive-Bayes, mean-squared error 1. Introduction Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets (Breiman, 1996b; Freund & Schapire, 1996; Quinlan,1996). Voting algorithms can be divided into two types: those that adaptively change the distribution of the training set based on the performance of previous classifiers (as in boosting methods) and those that do not (as in Bagging). Algorithms that do not adaptively change the distribution include option decision tree algorithms that construct decision trees with multiple options at some nodes (Buntine, 106 E. BAUER AND R. KOHAVI 1992a, 1992b; Kohavi & Kunz, 1997); averaging path sets, fanned sets, and extended fanned sets as alternatives to pruning (Olive... |

5 | Data mining using MLC++: Amachine learning library in C++, inTools with Arti cial Intelligence - Kohavi, eld, et al. - 1996 |

1 |
Heuristics of instability in model selection (Technical Report).
- Breiman
- 1994
(Show Context)
Citation Context ...ND VARIANTS 109 C1,C2, . . . ,CT whose output is the class predicted most often by its sub-classifiers, with ties broken arbitrarily. For a given bootstrap sample, an instance in the training set has probability 1−(1−1/m)m of being selected at least once in the m times instances are randomly selected from the training set. For large m, this is about 1− 1/e = 63.2%, which means that each bootstrap sample contains only about 63.2% unique instances from the training set. This perturbation causes different classifiers to be built if the inducer is unstable (e.g., neural networks, decision trees) (Breiman, 1994) and the performance can improve if the induced classifiers are good and not correlated; however, Bagging may slightly degrade the performance of stable algorithms (e.g., k-nearest neighbor) because effectively smaller training sets are used for training each classifier (Breiman, 1996b). 4.2. Boosting Boosting was introduced by Schapire (1990) as a method for boosting the performance of a weak learning algorithm. After improvements by Freund (1990), recently expanded in Freund (1996), AdaBoost (Adaptive Boosting) was introduced by Freund & Schapire (1995). In our work below, we concentrate on ... |

1 | Option decision trees with majority votes. In - BOOSTING, BAGGING, et al. - 1997 |