Results 21  30
of
663
Support vector machines classification with very large scale taxonomy
 SIGKDD Explorations
"... Very largescale classification taxonomies typically have hundreds of thousands of categories, deep hierarchies, and skewed category distribution over documents. However, it is still an open question whether the stateoftheart technologies in automated text categorization can scale to (and perform ..."
Abstract

Cited by 82 (5 self)
 Add to MetaCart
(Show Context)
Very largescale classification taxonomies typically have hundreds of thousands of categories, deep hierarchies, and skewed category distribution over documents. However, it is still an open question whether the stateoftheart technologies in automated text categorization can scale to (and perform well on) such large taxonomies. In this paper, we report the first evaluation of Support Vector Machines (SVMs) in webpage classification over the full taxonomy of the Yahoo! categories. Our accomplishments include: 1) a data analysis on the Yahoo! taxonomy; 2) the development of a scalable system for largescale text categorization; 3) theoretical analysis and experimental evaluation of SVMs in hierarchical and nonhierarchical settings for classification; 4) an investigation of threshold tuning algorithms with respect to time complexity and their effect on the classification accuracy of SVMs. We found that, in terms of scalability, the hierarchical use of SVMs is efficient enough for very largescale classification; however, in terms of effectiveness, the performance of SVMs over the Yahoo! Directory is still far from satisfactory, which indicates that more substantial investigation is needed.
Efficient Projections onto the ℓ1Ball for Learning in High Dimensions
"... We describe efficient algorithms for projecting a vector onto the ℓ1ball. We present two methods for projection. The first performs exact projection in O(n) expected time, where n is the dimension of the space. The second works on vectors k of whose elements are perturbed outside the ℓ1ball, proje ..."
Abstract

Cited by 80 (7 self)
 Add to MetaCart
(Show Context)
We describe efficient algorithms for projecting a vector onto the ℓ1ball. We present two methods for projection. The first performs exact projection in O(n) expected time, where n is the dimension of the space. The second works on vectors k of whose elements are perturbed outside the ℓ1ball, projecting in O(k log(n)) time. This setting is especially useful for online learning in sparse feature spaces such as text categorization applications. We demonstrate the merits and effectiveness of our algorithms in numerous batch and online learning tasks. We show that variants of stochastic gradient projection methods augmented with our efficient projection procedures outperform interior point methods, which are considered stateoftheart optimization techniques. We also show that in online settings gradient updates with ℓ1 projections outperform the exponentiated gradient algorithm while obtaining models with high degrees of sparsity. 1.
KernelBased Learning of Hierarchical Multilabel Classification Models
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... We present a kernelbased algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Mark ..."
Abstract

Cited by 78 (8 self)
 Add to MetaCart
We present a kernelbased algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in singleexample subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show
Large Scale Semisupervised Linear SVMs
, 2006
"... Large scale learning is often realistic only in a semisupervised setting where a small set of labeled examples is available together with a large collection of unlabeled data. In many information retrieval and data mining applications, linear classifiers are strongly preferred because of their ease ..."
Abstract

Cited by 75 (9 self)
 Add to MetaCart
Large scale learning is often realistic only in a semisupervised setting where a small set of labeled examples is available together with a large collection of unlabeled data. In many information retrieval and data mining applications, linear classifiers are strongly preferred because of their ease of implementation, interpretability and empirical performance. In this work, we present a family of semisupervised linear support vector classifiers that are designed to handle partiallylabeled sparse datasets with possibly very large number of examples and features. At their core, our algorithms employ recently developed modified finite Newton techniques. Our contributions in this paper are as follows: (a) We provide an implementation of Transductive SVM (TSVM) that is significantly more efficient and scalable than currently used dual techniques, for linear classification problems involving large, sparse datasets. (b) We propose a variant of TSVM that involves multiple switching of labels. Experimental results show that this variant provides an order of magnitude further improvement in training efficiency. (c) We present a new algorithm for semisupervised learning based on a Deterministic Annealing (DA) approach. This algorithm alleviates the problem of local minimum in the TSVM optimization procedure while also being computationally attractive. We conduct an empirical study on several document classification tasks which confirms the value of our methods in large scale semisupervised settings.
Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5
 In ICML’04
, 2004
"... Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge numbers of features. Most previous studies found that the majority of these features are relevant for classification, and that the performance of text categorization with support ..."
Abstract

Cited by 75 (4 self)
 Add to MetaCart
(Show Context)
Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge numbers of features. Most previous studies found that the majority of these features are relevant for classification, and that the performance of text categorization with support vector machines peaks when no feature selection is performed.
A stochastic gradient method with an exponential convergence rate for finite training sets.
 In NIPS,
, 2012
"... Abstract We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient ..."
Abstract

Cited by 73 (10 self)
 Add to MetaCart
(Show Context)
Abstract We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. Numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms.
Adaptive Regularization of Weight Vectors
 Advances in Neural Information Processing Systems 22
, 2009
"... We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle nonseparable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform ..."
Abstract

Cited by 71 (17 self)
 Add to MetaCart
(Show Context)
We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle nonseparable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent confidenceweighted online learning techniques and show empirically that AROW achieves stateoftheart performance and notable robustness in the case of nonseparable data. 1
Multilabel classification via calibrated label ranking
 MACH LEARN
, 2008
"... Label ranking studies the problem of learning a mapping from instances to rankings over a predefined set of labels. Hitherto existing approaches to label ranking implicitly operate on an underlying (utility) scale which is not calibrated in the sense that it lacks a natural zero point. We propose a ..."
Abstract

Cited by 69 (10 self)
 Add to MetaCart
Label ranking studies the problem of learning a mapping from instances to rankings over a predefined set of labels. Hitherto existing approaches to label ranking implicitly operate on an underlying (utility) scale which is not calibrated in the sense that it lacks a natural zero point. We propose a suitable extension of label ranking that incorporates the calibrated scenario and substantially extends the expressive power of these approaches. In particular, our extension suggests a conceptually novel technique for extending the common learning by pairwise comparison approach to the multilabel scenario, a setting previously not being amenable to the pairwise decomposition technique. The key idea of the approach is to introduce an artificial calibration label that, in each example, separates the relevant from the irrelevant labels. We show that this technique can be viewed as a combination of pairwise preference learning and the conventional relevance classification technique, where a separate classifier is trained to predict whether a label is relevant or not. Empirical results in the area of text categorization, image classification and gene analysis underscore the merits of the calibrated model in comparison to stateoftheart multilabel learning methods.
Replicated softmax: an undirected topic model
 In Advances in Neural Information Processing Systems
"... We introduce a twolayer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract lowdimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this m ..."
Abstract

Cited by 67 (14 self)
 Add to MetaCart
(Show Context)
We introduce a twolayer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract lowdimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this model, and show how a MonteCarlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the logprobability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the logprobability of heldout documents and the retrieval accuracy. 1
Classifying Software Changes: Clean or Buggy?
, 2008
"... This paper introduces a new technique for predicting latent software bugs, called change classification. Change classification uses a machine learning classifier to determine whether a new software change is more similar to prior buggy changes or clean changes. In this manner, change classification ..."
Abstract

Cited by 65 (11 self)
 Add to MetaCart
This paper introduces a new technique for predicting latent software bugs, called change classification. Change classification uses a machine learning classifier to determine whether a new software change is more similar to prior buggy changes or clean changes. In this manner, change classification predicts the existence of bugs in software changes. The classifier is trained using features (in the machine learning sense) extracted from the revision history of a software project stored in its software configuration management repository. The trained classifier can classify changes as buggy or clean, with a 78 percent accuracy and a 60 percent buggy change recall on average. Change classification has several desirable qualities: 1) The prediction granularity is small (a change to a single file), 2) predictions do not require semantic information about the source code, 3) the technique works for a broad array of project types and programming languages, and 4) predictions can be made immediately upon the completion of a change. Contributions of this paper include a description of the change classification approach, techniques for extracting features from the source code and change histories, a characterization of the performance of change classification across 12 open source projects, and an evaluation of the predictive power of different groups of features.