Why does bagging work? A bayesian account and its implications (1997) [26 citations — 6 self]
Abstract:
The error rate of decision-tree and other classification learners can often be much reduced by bagging: learning multiple models from bootstrap samples of the database, and combining them by uniform voting. In this paper we empirically test two alternative explanations for this, both based on Bayesian learning theory: (1) bagging works because it is an approximation to the optimal procedure of Bayesian model averaging, with an appropriate implicit prior; (2) bagging works because it effectively shifts the prior to a more appropriate region of model space. All the experimental evidence contradicts the first hypothesis, and confirms the second. Bagging Bagging (Breiman 1996a) is a simple and effective way to reduce the error rate of many classification learning algorithms. For example, in the empirical study described below, it reduces the error of a decision-tree learner in 19 of 26 databases, by 4 % on average. In the bagging procedure, given a training set of size s, a "bootstrap " replicate of it is constructed by taking s samples with replacement from the training set. Thus a new training set of the same size is produced, where each of the original examples may appear once, more than once, or not. On average, 63 % of the original examples will appear in the bootstrap sample. The learning algorithm is then applied to this training set. This procedure is repeated m times, and the resulting m models are aggregated by uniform voting. Bagging is one of several "multiple model " approaches that have recently received much attention (see, for example, (Chan, Stolfo, & Wolpert 1996)). Other procedures of this type include boosting (Freund & Schapire 1996) and stacking (Wolpert 1992). Two related explanations have been proposed for bagging's success, both in a classical statistical framework.
Citations
| 3307 | C4.5: Programs for machine learning – Quinlan - 1993 |
| 2195 | UCI Repository of Machine Learning Databases – Blake, Merz - 1998 |
| 1504 | Bagging Predictors – Breiman - 1996 |
| 1031 | Experiments with a new boosting algorithm – Freund, Schapire - 1996 |
| 594 | Bayesian Theory – Bernardo, Smith - 1994 |
| 156 | On bias, variance, 0/1 - loss, and the curse-of-dimensionality – Friedman - 1997 |
| 140 | Constructing optimal binary decision trees is NP-complete – Hyafil, Rivest - 1976 |
| 138 | Bias plus variance decomposition for zeroone loss functions – Kohavi - 1996 |
| 127 | Error-correcting output coding corrects bias and variance – Kong, Dietterich - 1995 |
| 89 | Bias, variance and arcing classifiers – BREIMAN - 1996 |
| 82 | A Theory of Learning Classification Rules – Buntine - 1990 |
| 46 | Knowledge acquisition from examples via multiple models – Domingos - 1997 |
| 38 | On finding the most probable model – Cheeseman - 1990 |
| 30 | Bayesian model averaging – Madigan, Raftery, et al. - 1996 |

