Results 1  10
of
26
Large margin dags for multiclass classification
 Advances in Neural Information Processing Systems 12
, 2000
"... We present a new learning architecture: the Decision Directed Acyclic Graph (DDAG), which is used to combine many twoclass classifiers into a multiclass classifier. For anclass problem, the DDAG contains � classifiers, one for each pair of classes. We present a VC analysis of the case when the nod ..."
Abstract

Cited by 374 (1 self)
 Add to MetaCart
(Show Context)
We present a new learning architecture: the Decision Directed Acyclic Graph (DDAG), which is used to combine many twoclass classifiers into a multiclass classifier. For anclass problem, the DDAG contains � classifiers, one for each pair of classes. We present a VC analysis of the case when the node classifiers are hyperplanes; the resulting bound on the test error depends on and on the margin achieved at the nodes, but not on the dimension of the space. This motivates an algorithm, DAGSVM, which operates in a kernelinduced feature space and uses twoclass maximal margin hyperplanes at each decisionnode of the DDAG. The DAGSVM is substantially faster to train and evaluate than either the standard algorithm or Max Wins, while maintaining comparable accuracy to both of these algorithms. 1
Inducing Oblique Decision Trees with Evolutionary Algorithms
 IEEE Transactions on Evolutionary Computation
, 2003
"... This paper illustrates the application of evolutionary algorithms (EAs) to the problem of oblique decisiontree (DT) induction. The objectives are to demonstrate that EAs can find classifiers whose accuracy is competitive with other oblique tree construction methods, and that, at least in some cases ..."
Abstract

Cited by 45 (0 self)
 Add to MetaCart
This paper illustrates the application of evolutionary algorithms (EAs) to the problem of oblique decisiontree (DT) induction. The objectives are to demonstrate that EAs can find classifiers whose accuracy is competitive with other oblique tree construction methods, and that, at least in some cases, this can be accomplished in a shorter time. We performed experiments with a (1+1) evolution strategy and a simple genetic algorithm on public domain and artificial data sets, and compared the results with three other oblique and one axisparallel DT algorithms. The empirical results suggest that the EAs quickly find competitive classifiers, and that EAs scale up better than traditional methods to the dimensionality of the domain and the number of instances used in training. In addition, we show that the classification accuracy improves when the trees obtained with the EAs are combined in ensembles, and that sometimes it is possible to build the ensemble of evolutionary trees in less time than a single traditional oblique tree. Index TermsClassification, decision trees, ensembles, machine learning, sampling.
Minimaxoptimal classification with dyadic decision trees
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2006
"... Decision trees are among the most popular types of classifiers, with interpretability and ease of implementation being among their chief attributes. Despite the widespread use of decision trees, theoretical analysis of their performance has only begun to emerge in recent years. In this paper it is ..."
Abstract

Cited by 36 (5 self)
 Add to MetaCart
(Show Context)
Decision trees are among the most popular types of classifiers, with interpretability and ease of implementation being among their chief attributes. Despite the widespread use of decision trees, theoretical analysis of their performance has only begun to emerge in recent years. In this paper it is shown that a new family of decision trees, dyadic decision trees (DDTs), attain nearly optimal (in a minimax sense) rates of convergence for a broad range of classification problems. Furthermore, DDTs are surprisingly adaptive in three important respects: They automatically (1) adapt to favorable conditions near the Bayes decision boundary; (2) focus on data distributed on lower dimensional manifolds; and (3) reject irrelevant features. DDTs are constructed by penalized empirical risk minimization using a new datadependent penalty and may be computed exactly with computational complexity that is nearly linear in the training sample size. DDTs are the first classifier known to achieve nearly optimal rates for the diverse class of distributions studied here while also being practical and implementable. This is also the first study (of which we are aware) to consider rates for adaptation to intrinsic data dimension and relevant features.
A hierarchical method for multiclass support vector machines
 In ICML’2004
, 2004
"... We introduce a framework, which we call Divideby2 (DB2), for extending support vector machines (SVM) to multiclass problems. DB2 offers an alternative to the standard oneagainstone and oneagainstrest algorithms. For an N class problem, DB2 produces an N − 1 node binary decision tree where ..."
Abstract

Cited by 28 (0 self)
 Add to MetaCart
(Show Context)
We introduce a framework, which we call Divideby2 (DB2), for extending support vector machines (SVM) to multiclass problems. DB2 offers an alternative to the standard oneagainstone and oneagainstrest algorithms. For an N class problem, DB2 produces an N − 1 node binary decision tree where nodes represent decision boundaries formed by N−1 SVM binary classifiers. This tree structure allows us to present a generalization and a time complexity analysis of DB2. Our analysis and related experiments show that, DB2 is faster than oneagainstone and oneagainstrest algorithms in terms of testing time, significantly faster than oneagainstrest in terms of training time, and that the crossvalidation accuracy of DB2 is comparable to these two methods. 1.
Fast perceptron decision tree learning from evolving data streams
 In PAKDD
"... Abstract. Mining of data streams must balance three evaluation dimensions: accuracy, time and memory. Excellent accuracy on data streams has been obtained with Naive Bayes Hoeffding Trees—Hoeffding Trees with naive Bayes models at the leaf nodes—albeit with increased runtime compared to standard Hoe ..."
Abstract

Cited by 17 (9 self)
 Add to MetaCart
(Show Context)
Abstract. Mining of data streams must balance three evaluation dimensions: accuracy, time and memory. Excellent accuracy on data streams has been obtained with Naive Bayes Hoeffding Trees—Hoeffding Trees with naive Bayes models at the leaf nodes—albeit with increased runtime compared to standard Hoeffding Trees. In this paper, we show that runtime can be reduced by replacing naive Bayes with perceptron classifiers, while maintaining highly competitive accuracy. We also show that accuracy can be increased even further by combining majority vote, naive Bayes, and perceptrons. We evaluate four perceptronbased learning strategies and compare them against appropriate baselines: simple perceptrons, Perceptron Hoeffding Trees, hybrid Naive Bayes Perceptron Trees, and bagged versions thereof. We implement a perceptron that uses the sigmoid activation function instead of the threshold activation function and optimizes the squared error, with one perceptron per class value. We test our methods by performing an evaluation study on synthetic and realworld datasets comprising up to ten million examples. 1
On the Generalisation of Soft Margin Algorithms
 IEEE Transactions on Information Theory
, 2000
"... Generalisation bounds depending on the margin of a classier are a relatively recent development. They provide an explanation of the performance of stateoftheart learning systems such as Support Vector Machines (SVM) [12] and Adaboost [24]. The diculty with these bounds has been either their lack ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
(Show Context)
Generalisation bounds depending on the margin of a classier are a relatively recent development. They provide an explanation of the performance of stateoftheart learning systems such as Support Vector Machines (SVM) [12] and Adaboost [24]. The diculty with these bounds has been either their lack of robustness or their looseness. The question of whether the generalisation of a classier can be more tightly bounded in terms of a robust measure of the distribution of margin values has remained open for some time. The paper answers this open question in the armative and furthermore the analysis leads to bounds that motivate the previously heuristic soft margin SVM algorithms as well as justifying the use of the quadratic loss in neural network training algorithms. The results are extended to give bounds for the probability of failing to achieve a target accuracy in regression prediction, with a statistical analysis of Ridge Regression and Gaussian Processes as a special case. The analysis presented in the paper has also lead to new boosting algorithms described elsewhere [7].
Tree Decomposition for LargeScale SVM Problems: Experimental and Theoretical Results
, 2009
"... To handle problems created by large data sets, we propose a method that uses a decision tree to decompose a data space and trains SVMs on the decomposed regions. Although there are other means of decomposing a data space, we show that the decision tree has several merits for largescale SVM training ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
To handle problems created by large data sets, we propose a method that uses a decision tree to decompose a data space and trains SVMs on the decomposed regions. Although there are other means of decomposing a data space, we show that the decision tree has several merits for largescale SVM training. First, it can classify some data points by its own means, thereby reducing the cost of SVM training applied to the remaining data points. Second, it is efficient for seeking the parameter values that maximize the validation accuracy, which helps maintain good test accuracy. Third, we can provide a generalization error bound for the classifier derived by the tree decomposition method. For experiment data sets whose size can be handled by current nonlinear, or kernelbased SVM training techniques, the proposed method can speed up the training by a factor of thousands, and still achieve comparable test accuracy.
Fast Support Vector Machine Classification of very large Datasets
 University of Freiburg, Department of Computer
"... Abstract. In many classification applications, Support Vector Machines (SVMs) have proven to be highly performing and easy to handle classifiers with very good generalization abilities. However, one drawback of the SVM is its rather high classification complexity which scales linearly with the numbe ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
Abstract. In many classification applications, Support Vector Machines (SVMs) have proven to be highly performing and easy to handle classifiers with very good generalization abilities. However, one drawback of the SVM is its rather high classification complexity which scales linearly with the number of Support Vectors (SVs). This is due to the fact that for the classification of one sample, the kernel function has to be evaluated for all SVs. To speed up classification, different approaches have been published, most which of try to reduce the number of SVs. In our work, which is especially suitable for very large datasets, we follow a different approach: as we showed in [12], it is effectively possible to approximate large SVM problems by decomposing the original problem into linear subproblems, where each subproblem can be evaluated in Ω(1). This approach is especially successful, when the assumption holds that a large classification problem can be split into mainly easy and only a few hard subproblems. On standard benchmark datasets, this approach achieved great speedups while suffering only sightly in terms of classification accuracy and generalization ability. In this contribution, we extend the methods introduced in [12] using not only linear, but also nonlinear subproblems for the decomposition of the original problem which further increases the classification performance with only a little loss in terms of speed. An implementation of our method is available in [13]. Due to page limitations, we had to move some of theoretic details (e.g. proofs) and extensive experimental results to a technical report [14]. 1
Nearminimax optimal classification with dyadic classification trees
, 2003
"... This paper reports on a family of computationally practical classifiers that converge to the Bayes error at nearminimax optimal rates for a variety of distributions. The classifiers are based on dyadic classification trees (DCTs), which involve adaptively pruned partitions of the feature space. A ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
This paper reports on a family of computationally practical classifiers that converge to the Bayes error at nearminimax optimal rates for a variety of distributions. The classifiers are based on dyadic classification trees (DCTs), which involve adaptively pruned partitions of the feature space. A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) fitting of the decision boundary. Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, datadependent pruning criterion. For any distribution on (X,Y) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate. We also study DCTs equipped with polynomial classification rules at each leaf, and show that as the smoothness of the boundary increases their errors converge to the Bayes error at a rate approaching n−1/2, the parametric rate. We are not aware of any other practical classifiers that provide similar rate of convergence guarantees. Fast algorithms for tree pruning are discussed.
L2 kernel classification
 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 2009
"... Nonparametric kernel methods are widely used and proven to be successful in many statistical learning problems. Wellknown examples include the kernel density estimate (KDE) for density estimation and the support vector machine (SVM) for classification. We propose a kernel classifier that optimizes ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Nonparametric kernel methods are widely used and proven to be successful in many statistical learning problems. Wellknown examples include the kernel density estimate (KDE) for density estimation and the support vector machine (SVM) for classification. We propose a kernel classifier that optimizes the L2 or integrated squared error (ISE) of a “difference of densities”. We focus on the Gaussian kernel, although the method applies to other kernels suitable for density estimation. Like a support vector machine (SVM), the classifier is sparse and results from solving a quadratic program. We provide statistical performance guarantees for the proposed L2 kernel classifier in the form of a finite sample oracle inequality, and strong consistency in the sense of both ISE and probability of error. A special case of our analysis applies to a previously introduced ISEbased method for kernel density estimation. For dimensionality greater than 15, the basic L2 kernel classifier performs poorly in practice. Thus, we extend the method through the introduction of a natural regularization parameter, which allows it to remain competitive with the SVM in high dimensions. Simulation results for both synthetic and realworld data are presented.