Results 1  10
of
93
Large Scale Semisupervised Linear SVMs
, 2006
"... Large scale learning is often realistic only in a semisupervised setting where a small set of labeled examples is available together with a large collection of unlabeled data. In many information retrieval and data mining applications, linear classifiers are strongly preferred because of their ease ..."
Abstract

Cited by 75 (9 self)
 Add to MetaCart
Large scale learning is often realistic only in a semisupervised setting where a small set of labeled examples is available together with a large collection of unlabeled data. In many information retrieval and data mining applications, linear classifiers are strongly preferred because of their ease of implementation, interpretability and empirical performance. In this work, we present a family of semisupervised linear support vector classifiers that are designed to handle partiallylabeled sparse datasets with possibly very large number of examples and features. At their core, our algorithms employ recently developed modified finite Newton techniques. Our contributions in this paper are as follows: (a) We provide an implementation of Transductive SVM (TSVM) that is significantly more efficient and scalable than currently used dual techniques, for linear classification problems involving large, sparse datasets. (b) We propose a variant of TSVM that involves multiple switching of labels. Experimental results show that this variant provides an order of magnitude further improvement in training efficiency. (c) We present a new algorithm for semisupervised learning based on a Deterministic Annealing (DA) approach. This algorithm alleviates the problem of local minimum in the TSVM optimization procedure while also being computationally attractive. We conduct an empirical study on several document classification tasks which confirms the value of our methods in large scale semisupervised settings.
Maximum margin clustering made practical.
 IEEE Transactions on Neural Networks,
, 2009
"... ..."
(Show Context)
Optimization Techniques for SemiSupervised Support Vector Machines
"... Due to its wide applicability, the problem of semisupervised classification is attracting increasing attention in machine learning. SemiSupervised Support Vector Machines (S 3 VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their fo ..."
Abstract

Cited by 68 (6 self)
 Add to MetaCart
Due to its wide applicability, the problem of semisupervised classification is attracting increasing attention in machine learning. SemiSupervised Support Vector Machines (S 3 VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their formulation leads to a nonconvex optimization problem. A suite of algorithms have recently been proposed for solving S 3 VMs. This paper reviews key ideas in this literature. The performance and behavior of various S 3 VM algorithms is studied together, under a common experimental setting.
Deep learning via semisupervised embedding
 International Conference on Machine Learning
, 2008
"... We show how nonlinear embedding algorithms popular for use with shallow semisupervised learning techniques such as kernel methods can be applied to deep multilayer architectures, either as a regularizer at the output layer, or on each layer of the architecture. This provides a simple alternative to ..."
Abstract

Cited by 67 (5 self)
 Add to MetaCart
(Show Context)
We show how nonlinear embedding algorithms popular for use with shallow semisupervised learning techniques such as kernel methods can be applied to deep multilayer architectures, either as a regularizer at the output layer, or on each layer of the architecture. This provides a simple alternative to existing approaches to deep learning whilst yielding competitive error rates compared to those methods, and existing shallow semisupervised techniques. 1.
Large Graph Construction for Scalable SemiSupervised Learning
"... In this paper, we address the scalability issue plaguing graphbased semisupervised learningviaasmallnumberofanchorpointswhich adequatelycovertheentirepointcloud. Critically, these anchor points enable nonparametric regression that predicts the label for each data point as a locally weighted averag ..."
Abstract

Cited by 53 (14 self)
 Add to MetaCart
In this paper, we address the scalability issue plaguing graphbased semisupervised learningviaasmallnumberofanchorpointswhich adequatelycovertheentirepointcloud. Critically, these anchor points enable nonparametric regression that predicts the label for each data point as a locally weighted average of the labels on anchor points. Becauseconventionalgraphconstructionisinefficient in large scale, we propose to construct a tractable large graph by coupling anchorbased label prediction and adjacency matrix design. Contrary to the Nyström approximation of adjacency matrices which results in indefinite graph Laplacians and in turn leads to potential nonconvex optimization over graphs, the proposed graph construction approach based on a unique idea called AnchorGraph provides nonnegative adjacency matrices to guarantee positive semidefinite graph Laplacians. Our approach scales linearly with the data size and in practice usually produces a large sparse graph. Experiments on large datasets demonstrate the significant accuracy improvement and scalability of the proposed approach. 1.
On the convergence of concaveconvex procedure
 In NIPS Workshop on Optimization for Machine Learning
, 2009
"... The concaveconvex procedure (CCCP) is a majorizationminimization algorithm that solves d.c. (difference of convex functions) programs as a sequence of convex programs. In machine learning, CCCP is extensively used in many learning algorithms like sparse support vector machines (SVMs), transductive ..."
Abstract

Cited by 48 (1 self)
 Add to MetaCart
(Show Context)
The concaveconvex procedure (CCCP) is a majorizationminimization algorithm that solves d.c. (difference of convex functions) programs as a sequence of convex programs. In machine learning, CCCP is extensively used in many learning algorithms like sparse support vector machines (SVMs), transductive SVMs, sparse principal component analysis, etc. Though widely used in many applications, the convergence behavior of CCCP has not gotten a lot of specific attention. Yuille and Rangarajan analyzed its convergence in their original paper, however, we believe the analysis is not complete. Although the convergence of CCCP can be derived from the convergence of the d.c. algorithm (DCA), its proof is more specialized and technical than actually required for the specific case of CCCP. In this paper, we follow a different reasoning and show how Zangwill’s global convergence theory of iterative algorithms provides a natural framework to prove the convergence of CCCP, allowing a more elegant and simple proof. This underlines Zangwill’s theory as a powerful and general framework to deal with the convergence issues of iterative algorithms, after also being used to prove the convergence of algorithms like expectationmaximization, generalized alternating minimization, etc. In this paper, we provide a rigorous analysis of the convergence of CCCP by addressing these questions: (i) When does CCCP find a local minimum or a stationary point of the d.c. program under consideration? (ii) When does the sequence generated by CCCP converge? We also present an open problem on the issue of local convergence of CCCP. 1
A continuation method for semisupervised svms
 In International Conference on Machine Learning
, 2006
"... SemiSupervised Support Vector Machines (S3VMs) are an appealing method for using unlabeled data in classification: their objective function favors decision boundaries which do not cut clusters. However their main problem is that the optimization problem is nonconvex and has many local minima, whic ..."
Abstract

Cited by 42 (4 self)
 Add to MetaCart
(Show Context)
SemiSupervised Support Vector Machines (S3VMs) are an appealing method for using unlabeled data in classification: their objective function favors decision boundaries which do not cut clusters. However their main problem is that the optimization problem is nonconvex and has many local minima, which often results in suboptimal performances. In this paper we propose to use a global optimization technique known as continuation to alleviate this problem. Compared to other algorithms minimizing the same objective function, our continuation method often leads to lower test errors. 1.
Semisupervised classification with hybrid generative discriminative methods
 In KDD
, 2007
"... We compare two recently proposed frameworks for combining generative and discriminative probabilistic classifiers and apply them to semisupervised classification. In both cases we explore the tradeoff between maximizing a discriminative likelihood of labeled data and a generative likelihood of labe ..."
Abstract

Cited by 32 (3 self)
 Add to MetaCart
(Show Context)
We compare two recently proposed frameworks for combining generative and discriminative probabilistic classifiers and apply them to semisupervised classification. In both cases we explore the tradeoff between maximizing a discriminative likelihood of labeled data and a generative likelihood of labeled and unlabeled data. While prominent semisupervised learning methods assume low density regions between classes or are subject to generative modeling assumptions, we conjecture that hybrid generative/discriminative methods allow semisupervised learning in the presence of strongly overlapping classes and reduce the risk of modeling structure in the unlabeled data that is irrelevant for the specific classification task of interest. We apply both hybrid approaches within naively structured Markov random field models and provide a thorough empirical comparison with two wellknown semisupervised learning methods on six text classification tasks. A semisupervised hybrid generative/discriminative method provides the best accuracy in 75 % of the experiments, and the multiconditional learning hybrid approach achieves the highest overall mean accuracy across all tasks.
Efficient GraphBased SemiSupervised Learning of Structured Tagging Models
"... We describe a new scalable algorithm for semisupervised training of conditional random fields (CRF) and its application to partofspeech (POS) tagging. The algorithm uses a similarity graph to encourage similar ngrams to have similar POS tags. We demonstrate the efficacy of our approach on a domai ..."
Abstract

Cited by 31 (1 self)
 Add to MetaCart
(Show Context)
We describe a new scalable algorithm for semisupervised training of conditional random fields (CRF) and its application to partofspeech (POS) tagging. The algorithm uses a similarity graph to encourage similar ngrams to have similar POS tags. We demonstrate the efficacy of our approach on a domain adaptation task, where we assume that we have access to large amounts of unlabeled data from the target domain, but no additional labeled data. The similarity graph is used during training to smooth the state posteriors on the target domain. Standard inference can be used at test time. Our approach is able to scale to very large problems and yields significantly improved target domain accuracy. 1
Branch and Bound for SemiSupervised Support Vector Machines
"... Semisupervised SVMs (S³VM) attempt to learn lowdensity separators by maximizing the margin over labeled and unlabeled examples. The associated optimization problem is nonconvex. To examine the full potential of S3VMs modulo local minima problems in current implementations, we apply branch and bou ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
Semisupervised SVMs (S³VM) attempt to learn lowdensity separators by maximizing the margin over labeled and unlabeled examples. The associated optimization problem is nonconvex. To examine the full potential of S3VMs modulo local minima problems in current implementations, we apply branch and bound techniques for obtaining exact, globally optimal solutions. Empirical evidence suggests that the globally optimal solution can return excellent generalization performance in situations where other implementations fail completely. While our current implementation is only applicable to small datasets, we discuss variants that can potentially lead to practically useful algorithms.