• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms (0)

by T Lim, W Loh, Y Shih
Venue:Machine Learning
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 86
Next 10 →

Tree Induction for Probability-based Ranking

by Foster Provost , Pedro Domingos , 2002
"... Tree induction is one of the most effective and widely used methods for building classification models. However, many applications require cases to be ranked by the probability of class membership. Probability estimation trees (PETs) have the same attractive features as classification trees (e.g., c ..."
Abstract - Cited by 97 (4 self) - Add to MetaCart
Tree induction is one of the most effective and widely used methods for building classification models. However, many applications require cases to be ranked by the probability of class membership. Probability estimation trees (PETs) have the same attractive features as classification trees (e.g., comprehensibility, accuracy and efficiency in high dimensions and on large data sets). Unfortunately, decision trees have been found to provide poor probability estimates. Several techniques have been proposed to build more accurate PETs, but, to our knowledge, there has not been a systematic experimental analysis of which techniques actually improve the probability-based rankings, and by how much. In this paper we first discuss why the decision-tree representation is not intrinsically inadequate for probability estimation. Inaccurate probabilities are partially the result of decision-tree induction algorithms that focus on maximizing classification accuracy and minimizing tree size (for example via reduced-error pruning). Larger trees can be better for probability estimation, even if the extra size is superfluous for accuracy maximization. We then present the results of a comprehensive set of experiments, testing some straghtforward methods for improving probability-based rankings. We show that using a simple, common smoothing method--the Laplace correction--uniformly improves probability-based rankings. In addition, bagging substantioJly improves the rankings, and is even more effective for this purpose than for improving accuracy. We conclude that PETs, with these simple modifications, should be considered when rankings based on class-membership probability are required.

An Empirical Comparison of Supervised Learning Algorithms

by Rich Caruana, Alexandru Niculescu-mizil - In Proc. 23 rd Intl. Conf. Machine learning (ICML’06 , 2006
"... A number of supervised learning methods have been introduced in the last decade. Unfortunately, the last comprehensive empirical evaluation of supervised learning was the Statlog Project in the early 90’s. We present a large-scale empirical comparison between ten supervised learning methods: SVMs, n ..."
Abstract - Cited by 55 (3 self) - Add to MetaCart
A number of supervised learning methods have been introduced in the last decade. Unfortunately, the last comprehensive empirical evaluation of supervised learning was the Statlog Project in the early 90’s. We present a large-scale empirical comparison between ten supervised learning methods: SVMs, neural nets, logistic regression, naive bayes, memory-based learning, random forests, decision trees, bagged trees, boosted trees, and boosted stumps. We also examine the effect that calibrating the models via Platt Scaling and Isotonic Regression has on their performance. An important aspect of our study is the use of a variety of performance criteria to evaluate the learning methods. 1.

Tree induction vs. logistic regression: A learning-curve analysis

by Claudia Perlich, Foster Provost, Jeffrey S. Simonoff - CEDER WORKING PAPER #IS-01-02, STERN SCHOOL OF BUSINESS , 2001
"... Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classi cation. We present a large-scale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on class-membership pr ..."
Abstract - Cited by 50 (16 self) - Add to MetaCart
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classi cation. We present a large-scale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on class-membership probabilities. We use a learning-curve analysis to examine the relationship of these measures to the size of the training set. The results of the study show several remarkable things. (1) Contrary to prior observations, logistic regression does not generally outperform tree induction. (2) More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (i.e., the learning curves cross), so conclusions about induction-algorithm superiority on a given domain must be based on an analysis of the learning curves. (3) Contrary to conventional wisdom, tree induction is effective atproducing probability-based rankings, although apparently comparatively less so foragiven training{set size than at making classifications. Finally, (4) the domains on which tree induction and logistic regression are ultimately preferable canbecharacterized surprisingly well by a simple measure of signal-to-noise ratio.

Revisiting the foundations of Artificial Immune Systems: a problem-oriented perspective

by Alex A. Freitas, Jon Timmis - Hart (Eds.) Artificial Immune Systems (Proc. ICARIS-2003), LNCS 2787 , 2003
"... This paper advocates a problem-oriented approach for the design of Artificial Immune Systems (AIS) for data mining. By problem-oriented approach we mean that, in real-world data mining applications, the design of an AIS should take into account the characteristics of the data to be mined together wi ..."
Abstract - Cited by 39 (23 self) - Add to MetaCart
This paper advocates a problem-oriented approach for the design of Artificial Immune Systems (AIS) for data mining. By problem-oriented approach we mean that, in real-world data mining applications, the design of an AIS should take into account the characteristics of the data to be mined together with the application domain: the components of the AIS – such as its representation, affinity function and immune process – should be tailored for the data and the application. This is in contrast with the majority of the literature, where a very generic AIS algorithm for data mining is developed and there is little or no concern in tailoring the components of the AIS for the data to be mined or the application domain. To support this problem-oriented approach, we provide an extensive critical review of the current literature on AIS for data mining, focusing on the data mining tasks of classification and anomaly detection. We discuss several important lessons to be taken from the natural immune system to design new AIS that are considerably more adaptive than current AIS. Finally, we conclude the paper with a summary of seven limitations of current AIS for data mining and 10 suggested research directions.

A Preliminary Performance Comparison of Five Machine Learning Algorithms for Practical IP Traffic Flow Classification

by Nigel Williams, Sebastian Z, Grenville Armitage - Computer Communication Review , 2006
"... The identification of network applications through observation of associated packet traffic flows is vital to the areas of network management and surveillance. Currently popular methods such as port number and payload-based identification exhibit a number of shortfalls. An alternative is to use mach ..."
Abstract - Cited by 39 (1 self) - Add to MetaCart
The identification of network applications through observation of associated packet traffic flows is vital to the areas of network management and surveillance. Currently popular methods such as port number and payload-based identification exhibit a number of shortfalls. An alternative is to use machine learning (ML) techniques and identify network applications based on per-flow statistics, derived from payload-independent features such as packet length and inter-arrival time distributions. The performance impact of feature set reduction, using Consistencybased and Correlation-based feature selection, is demonstrated on Naïve Bayes, C4.5, Bayesian Network and Naïve Bayes Tree algorithms. We then show that it is useful to differentiate algorithms based on computational performance rather than classification accuracy alone, as although classification accuracy between the algorithms is similar, computational performance can differ significantly.

Statistical Relational Learning for Document Mining

by Alexandrin Popescul, Lyle H. Ungar, Steve Lawrence, David M. Pennock , 2003
"... A major obstacle to fully integrated deployment of statistical learners is the assumption that data sits in a single table, even though most real-world databases have complex relational structures. In this paper, we introduce an integrated approach to building regression models from data stored ..."
Abstract - Cited by 35 (5 self) - Add to MetaCart
A major obstacle to fully integrated deployment of statistical learners is the assumption that data sits in a single table, even though most real-world databases have complex relational structures. In this paper, we introduce an integrated approach to building regression models from data stored in relational databases. Potential features are generated by structured search of the space of queries to the database, and then tested for inclusion in a logistic regression. We present experimental results for the task of predicting where scientific papers will be published based on relational data taken from CiteSeer. This data includes word counts in the document, frequently cited authors or papers, co-citations, publication venues of cited papers, word co-occurrences, and word counts in cited or citing documents. Our approach results in classification accuracies superior to those achieved when using classical "flat" features. Our classification task also serves as a "where to publish?" conference/journal recommendation task.

Classification trees with unbiased multiway splits

by Hyunjoong Kim, Wei-yin Loh - Journal of the American Statistical Association , 2001
"... Two univariate split methods and one linear combination split method are proposed for the construction of classification trees with multiway splits. Examples are given where the trees are more compact and hence easier to interpret than binary trees. A major strength of the univariate split methods i ..."
Abstract - Cited by 35 (6 self) - Add to MetaCart
Two univariate split methods and one linear combination split method are proposed for the construction of classification trees with multiway splits. Examples are given where the trees are more compact and hence easier to interpret than binary trees. A major strength of the univariate split methods is that they have negligible bias in variable selection, both when the variables differ in the number of splits they offer and when they differ in number of missing values. This is an advantage because inferences from the tree structures can be adversely affected by selection bias. The new methods are shown to be highly competitive in terms of computational speed and classification accuracy of future observations. Key words and phrases: Decision tree, linear discriminant analysis, missing value, selection bias. 1

Well-Trained PETs: Improving Probability Estimation Trees

by Foster Provost, Pedro Domingos , 2000
"... Decision trees are one of the most effective and widely used classification methods. However, many applications require class probability estimates, and probability estimation trees (PETs) have the same attractive features as classification trees (e.g., comprehensibility, accuracy and efficiency in ..."
Abstract - Cited by 30 (5 self) - Add to MetaCart
Decision trees are one of the most effective and widely used classification methods. However, many applications require class probability estimates, and probability estimation trees (PETs) have the same attractive features as classification trees (e.g., comprehensibility, accuracy and efficiency in high dimensions and on large data sets). Unfortunately, decision trees have been found to provide poor probability estimates. Several techniques have been proposed to build more accurate PETs, but, to our knowledge, there has not been a systematic experimental analysis of which techniques actually improve the probability estimates, and by how much. In this paper we first discuss why the decision-tree representation is not intrinsically inadequate for probability estimation. Inaccurate probabilities are partially the result of decision-tree induction algorithms that focus on maximizing classification accuracy and minimizing tree size (for example via reduced-error pruning). Larger tree...

Discovering Interesting Patterns for Investment Decision Making with GLOWER - A Genetic Learner Overlaid With Entropy Reduction

by Vasant Dhar, Dashin Chou, Foster Provost , 2000
"... Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open ended by forcing us to consider a large number of independent variables and thereby increasing the dimensionality of the search spac ..."
Abstract - Cited by 27 (0 self) - Add to MetaCart
Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open ended by forcing us to consider a large number of independent variables and thereby increasing the dimensionality of the search space. Second, the weak relationships among variables tend to be nonlinear, and may hold only in limited areas of the search space. Third, in financial practice, where analysts conduct extensive manual analysis of historically well performing indicators, a key is to find the hidden interactions among variables that perform well in combination. Unfortunately, these are exactly the patterns that the greedy search biases incorporated by many standard rule learning algorithms will miss. In this paper, we describe and evaluate several variations of a new genetic learning algorithm (GLOWER) on a variety of data sets. The design of GLOWER has been motivated by financial prediction problems, but incorpo...

A novel approach to design classifiers using genetic programming

by Durga Prasad Muni, Nikhil R. Pal, Jyotirmoy Das - IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION , 2004
"... We propose a new approach for designing classifiers for a-class ( 2) problem using genetic programming (GP). The proposed approach takes an integrated view of all classes when the GP evolves. A multitree representation of chromosomes is used. In this context, we propose a modified crossover operatio ..."
Abstract - Cited by 16 (0 self) - Add to MetaCart
We propose a new approach for designing classifiers for a-class ( 2) problem using genetic programming (GP). The proposed approach takes an integrated view of all classes when the GP evolves. A multitree representation of chromosomes is used. In this context, we propose a modified crossover operation and a new mutation operation that reduces the destructive nature of conventional genetic operations. We use a new concept of unfitness of a tree to select trees for genetic operations. This gives more opportunity to unfit trees to become fit. A new concept of OR-ing chromosomes in the terminal population is introduced, which enables us to get a classifier with better performance. Finally, a weight-based scheme and some heuristic rules characterizing typical ambiguous situations are used for conflict resolution. The classifier is capable of saying “don’t know” when faced with unfamiliar examples. The effectiveness of our scheme is demonstrated on several real data sets.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University