• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Very simple classification rules perform well on most commonly used dataset (1993)

by R C Holte
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 547
Next 10 →

Experiments with a New Boosting Algorithm

by Yoav Freund, Robert E. Schapire , 1996
"... In an earlier paper, we introduced a new “boosting” algorithm called AdaBoost which, theoretically, can be used to significantly reduce the error of any learning algorithm that consistently generates classifiers whose performance is a little better than random guessing. We also introduced the relate ..."
Abstract - Cited by 2213 (20 self) - Add to MetaCart
In an earlier paper, we introduced a new “boosting” algorithm called AdaBoost which, theoretically, can be used to significantly reduce the error of any learning algorithm that consistently generates classifiers whose performance is a little better than random guessing. We also introduced the related notion of a “pseudo-loss ” which is a method for forcing a learning algorithm of multi-label conceptsto concentrate on the labels that are hardest to discriminate. In this paper, we describe experiments we carried out to assess how well AdaBoost with and without pseudo-loss, performs on real learning problems. We performed two sets of experiments. The first set compared boosting to Breiman’s “bagging ” method when used to aggregate various classifiers (including decision trees and single attribute-value tests). We compared the performance of the two methods on a collection of machine-learning benchmarks. In the second set of experiments, we studied in more detail the performance of boosting using a nearest-neighbor classifier on an OCR problem.
(Show Context)

Citation Context

... varying levels of sophistication. These include: (1) an algorithm that searches for very simple prediction rules which test on a single attribute (similar to Holte's very simple classification rules =-=[14]-=-); (2) an algorithm that searches for a single good decision rule that tests on a conjunction of attribute tests (similar in flavor to the rule-formation part of Cohen's RIPPER algorithm [3] and Furnk...

Additive Logistic Regression: a Statistical View of Boosting

by Jerome Friedman, Trevor Hastie, Robert Tibshirani - Annals of Statistics , 1998
"... Boosting (Freund & Schapire 1996, Schapire & Singer 1998) is one of the most important recent developments in classification methodology. The performance of many classification algorithms can often be dramatically improved by sequentially applying them to reweighted versions of the input dat ..."
Abstract - Cited by 1750 (25 self) - Add to MetaCart
Boosting (Freund & Schapire 1996, Schapire & Singer 1998) is one of the most important recent developments in classification methodology. The performance of many classification algorithms can often be dramatically improved by sequentially applying them to reweighted versions of the input data, and taking a weighted majority vote of the sequence of classifiers thereby produced. We show that this seemingly mysterious phenomenon can be understood in terms of well known statistical principles, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multi-class generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multi-class generalizations of boosting in most...
(Show Context)

Citation Context

...illicit performance differences among the methods being tested. Such complicated boundaries are not likely to often occur in practice. Many practical problems involve comparatively simple boundaries (=-=Holte 1993-=-); in such cases performance differences will still be situation dependent, but correspondingly less pronounced. 6 Some experiments with data In this section we show the results of running the four fi...

Irrelevant Features and the Subset Selection Problem

by George H. John, Ron Kohavi, Karl Pfleger - MACHINE LEARNING: PROCEEDINGS OF THE ELEVENTH INTERNATIONAL , 1994
"... We address the problem of finding a subset of features that allows a supervised induction algorithm to induce small high-accuracy concepts. We examine notions of relevance and irrelevance, and show that the definitions used in the machine learning literature do not adequately partition the features ..."
Abstract - Cited by 757 (26 self) - Add to MetaCart
We address the problem of finding a subset of features that allows a supervised induction algorithm to induce small high-accuracy concepts. We examine notions of relevance and irrelevance, and show that the definitions used in the machine learning literature do not adequately partition the features into useful categories of relevance. We present definitions for irrelevance and for two degrees of relevance. These definitions improve our understanding of the behavior of previous subset selection algorithms, and help define the subset of features that should be sought. The features selected should depend not only on the features and the target concept, but also on the induction algorithm. We describe a method for feature subset selection using cross-validation that is applicable to any induction algorithm, and discuss experiments conducted with ID3 and C4.5 on artificial and real datasets.
(Show Context)

Citation Context

... improvement of prediction accuracy over C4.5 is that C4.5 does quite well on most of the datasets tested here, leaving little room for improvement. This seems to be in line with with Holte's claims (=-=Holte 1993-=-). Harder datasets might show more significant improvement. Indeed the wrapper model produced the most significant improvement for the two datasets (parity5+5 and CorrAL) on which C4.5 performed the w...

An empirical comparison of voting classification algorithms: Bagging, boosting, and variants.

by Eric Bauer , Philip Chan , Salvatore Stolfo , David Wolpert - Machine Learning, , 1999
"... Abstract. Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets. We review these algorithms and describe a large empirical study comparing several vari ..."
Abstract - Cited by 707 (2 self) - Add to MetaCart
Abstract. Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets. We review these algorithms and describe a large empirical study comparing several variants in conjunction with a decision tree inducer (three variants) and a Naive-Bayes inducer. The purpose of the study is to improve our understanding of why and when these algorithms, which use perturbation, reweighting, and combination techniques, affect classification error. We provide a bias and variance decomposition of the error to show how different methods and variants influence these two terms. This allowed us to determine that Bagging reduced variance of unstable methods, while boosting methods (AdaBoost and Arc-x4) reduced both the bias and variance of unstable methods but increased the variance for Naive-Bayes, which was very stable. We observed that Arc-x4 behaves differently than AdaBoost if reweighting is used instead of resampling, indicating a fundamental difference. Voting variants, some of which are introduced in this paper, include: pruning versus no pruning, use of probabilistic estimates, weight perturbations (Wagging), and backfitting of data. We found that Bagging improves when probabilistic estimates in conjunction with no-pruning are used, as well as when the data was backfit. We measure tree sizes and show an interesting positive correlation between the increase in the average tree size in AdaBoost trials and its success in reducing the error. We compare the mean-squared error of voting methods to non-voting methods and show that the voting methods lead to large and significant reductions in the mean-squared errors. Practical problems that arise in implementing boosting algorithms are explored, including numerical instabilities and underflows. We use scatterplots that graphically show how AdaBoost reweights instances, emphasizing not only "hard" areas but also outliers and noise.
(Show Context)

Citation Context

...(1)-disc. MC4(1) limits the tree to a single root split; such a shallow tree is sometimes called a decision stump (Iba & Langley, 1992). If the root attribute is nominal, a multiway split is created with one branch for unknowns. If the root attribute is continuous, a three-way split is created: less than a threshold, greater than a threshold, and unknown. MC4(1)-disc first discretizes all the attributes using entropy discretization (Kohavi & Sahami, 1996; Fayyad & Irani, 1993), thus effectively allowing a root split with multiple thresholds. MC4(1)-disc is very similar to the 1R classifier of Holte (1993), except that the discretization step is based on entropy, which compared favorably with his 1R discretization in our previous work (Kohavi & Sahami, 1996). Both MC4(1) and MC4(1)-disc build very weak classifiers, but MC4(1)-disc is the more powerful of the two. Specifically for multi-class problems with continuous attributes, MC4(1) is usually unable to build a good classifier because the tree consists of a single binary root split with leaves as children. 3.2. The Naive-Bayes Inducer The Naive-Bayes Inducer (Good, 1965; Duda & Hart, 1973; Langley, Iba, & Thompson, 1992), sometimes called Sim...

Selection of relevant features and examples in machine learning

by Avrim L. Blum, Pat Langley - ARTIFICIAL INTELLIGENCE , 1997
"... In this survey, we review work in machine learning on methods for handling data sets containing large amounts of irrelevant information. We focus on two key issues: the problem of selecting relevant features, and the problem of selecting relevant examples. We describe the advances that have been mad ..."
Abstract - Cited by 606 (2 self) - Add to MetaCart
In this survey, we review work in machine learning on methods for handling data sets containing large amounts of irrelevant information. We focus on two key issues: the problem of selecting relevant features, and the problem of selecting relevant examples. We describe the advances that have been made on these topics in both empirical and theoretical work in machine learning, and we present a general framework that we use to compare different methods. We close with some challenges for future work in this area.

Supervised and unsupervised discretization of continuous features

by James Dougherty, Ron Kohavi, Mehran Sahami - in A. Prieditis & S. Russell, eds, Machine Learning: Proceedings of the Twelfth International Conference , 1995
"... Many supervised machine learning algorithms require a discrete feature space. In this paper, we review previous work on continuous feature discretization, identify de n-ing characteristics of the methods, and conduct an empirical evaluation of several methods. We compare binning, an unsupervised dis ..."
Abstract - Cited by 540 (11 self) - Add to MetaCart
Many supervised machine learning algorithms require a discrete feature space. In this paper, we review previous work on continuous feature discretization, identify de n-ing characteristics of the methods, and conduct an empirical evaluation of several methods. We compare binning, an unsupervised discretization method, to entropy-based and purity-based methods, which are supervised algorithms. We found that the performance of the Naive-Bayes algorithm signi cantly improved when features were discretized using an entropy-based method. In fact, over the 16 tested datasets, the discretized version of Naive-Bayes slightly outperformed C4.5 on average. We also show that in some cases, the performance of the C4.5 induction algorithm signi cantly improved if features were discretized in advance � in our experiments, the performance never signi cantly degraded, an interesting phenomenon considering the fact that C4.5 is capable of locally discretizing features. 1
(Show Context)

Citation Context

...imum number of intervals to produce in discretizing a feature. Static methods, such as binning, entropy-based partitioning (Catlett 1991b, Fayyad & Irani 1993, Pfahringer 1995), and the 1R algorithm (=-=Holte 1993-=-), perform one discretization pass of the data for each feature and determine the value of k for each feature independent of the other features. Dynamic methods conduct a search through the space of p...

Simple Heuristics That Make Us Smart

by Gerd Gigerenzer, Peter M. Todd , 2008
"... To survive in a world where knowledge is limited, time is pressing, and deep thought is often an unattainable luxury, decision-makers must use bounded rationality. In this precis of Simple heuristics that make us smart, we explore fast and frugal heuristics—simple rules for making decisions with re ..."
Abstract - Cited by 456 (15 self) - Add to MetaCart
To survive in a world where knowledge is limited, time is pressing, and deep thought is often an unattainable luxury, decision-makers must use bounded rationality. In this precis of Simple heuristics that make us smart, we explore fast and frugal heuristics—simple rules for making decisions with realistic mental resources. These heuristics enable smart choices to be made quickly and with a minimum of information by exploiting the way that information is structured in particular environments. Despite limiting information search and processing, simple heuristics perform comparably to more complex algorithms, particularly when generalizing to new data—simplicity leads to robustness.

Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier

by Pedro Domingos, Michael Pazzani
"... The simple Bayesian classifier (SBC) is commonly thought to assume that attributes are independent given the class, but this is apparently contradicted by the surprisingly good performance it exhibits in many domains that contain clear attribute dependences. No explanation for this has been proposed ..."
Abstract - Cited by 361 (8 self) - Add to MetaCart
The simple Bayesian classifier (SBC) is commonly thought to assume that attributes are independent given the class, but this is apparently contradicted by the surprisingly good performance it exhibits in many domains that contain clear attribute dependences. No explanation for this has been proposed so far. In this paper we show that the SBC does not in fact assume attribute independence, and can be optimal even when this assumption is violated by a wide margin. The key to this finding lies in the distinction between classification and probability estimation: correct classification can be achieved even when the probability estimates used contain large errors. We show that the previously-assumed region of optimality of the SBC is a second-order infinitesimal fraction of the actual one. This is followed by the derivation of several necessary and several sufficient conditions for the optimality of the SBC. For example, the SBC is optimal for learning arbitrary conjunctions and disjunctions, even though they violate the independence assumption. The paper also reports empirical evidence of the SBC's competitive performance in domains containing substantial degrees of attribute dependence.
(Show Context)

Citation Context

...titive with the other approaches. This is a remarkably good result for such a simple and apparently limited classifier. However, it can be due to the datasets themselves representing "easy" =-=concepts (Holte, 1993-=-) , and does not by itself disprove the notion that the SBC relies on the assumption of attribute independence. To investigate this, we need to measure the degree of attribute dependence in the data i...

Correlation-based feature selection for machine learning

by Mark A. Hall , 1998
"... A central problem in machine learning is identifying a representative set of features from which to construct a classification model for a particular task. This thesis addresses the problem of feature selection for machine learning through a correlation based approach. The central hypothesis is that ..."
Abstract - Cited by 318 (3 self) - Add to MetaCart
A central problem in machine learning is identifying a representative set of features from which to construct a classification model for a particular task. This thesis addresses the problem of feature selection for machine learning through a correlation based approach. The central hypothesis is that good feature sets contain features that are highly correlated with the class, yet uncorrelated with each other. A feature evaluation formula, based on ideas from test theory, provides an operational definition of this hypothesis. CFS (Correlation based Feature Selection) is an algorithm that couples this evaluation formula with an appropriate correlation measure and a heuristic search strategy. CFS was evaluated by experiments on artificial and natural datasets. Three machine learning algorithms were used: C4.5 (a decision tree learner), IB1 (an instance based learner), and naive Bayes. Experiments on artificial datasets showed that CFS quickly identifies and screens irrelevant, redundant, and noisy features, and identifies relevant features as long as their relevance does not strongly depend on other features. On natural domains, CFS typically eliminated well over half the features. In most cases, classification accuracy using the reduced feature set equaled or bettered accuracy using the complete feature set.
(Show Context)

Citation Context

...values that are strongly associated with different classes in the same interval. The next section discusses methods for supervised discretization which overcome this problem. Supervised Methods Holte =-=[Hol93]-=- presents a simple supervised discretization method 20 that is incorporated in his one-level decision tree algorithm (1R). The method first sorts the values of a feature, and then attempts to find int...

Feature Selection for Classification

by M. Dash, H. Liu - Intelligent Data Analysis , 1997
"... Feature selection has been the focus of interest for quite some time and much work has been done. With the creation of huge databases and the consequent requirements for good machine learning techniques, new problems arise and novel approaches to feature selection are in demand. This survey is a com ..."
Abstract - Cited by 299 (9 self) - Add to MetaCart
Feature selection has been the focus of interest for quite some time and much work has been done. With the creation of huge databases and the consequent requirements for good machine learning techniques, new problems arise and novel approaches to feature selection are in demand. This survey is a comprehensive overview of many existing methods from the 1970's to the present. It identifies four steps of a typical feature selection method, and categorizes the different existing methods in terms of generation procedures and evaluation functions, and reveals hitherto unattempted combinations of generation procedures and evaluation functions. Representative methods are chosen from each category for detailed explanation and discussion via example. Benchmark datasets with different characteristics are used for comparative study. The strengths and weaknesses of different methods are explained. Guidelines for applying feature selection methods are given based on data types and domain characteris...
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University