Results 1  10
of
21
A reliable effective terascale linear learning system
, 2011
"... We present a system and a set of techniques for learning linear predictors with convex losses on terascale data sets, with trillions of features,1 billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. Individually none of the component techniques are n ..."
Abstract

Cited by 72 (6 self)
 Add to MetaCart
We present a system and a set of techniques for learning linear predictors with convex losses on terascale data sets, with trillions of features,1 billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. Individually none of the component techniques are new, but the careful synthesis required to obtain an efficient implementation is. The result is, up to our knowledge, the most scalable and efficient linear learning system reported in the literature.2 We describe and thoroughly evaluate the components of the system, showing the importance of the various design choices.
Recent Advances of Largescale Linear Classification
"... Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much ..."
Abstract

Cited by 32 (6 self)
 Add to MetaCart
(Show Context)
Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much faster. Recently, many research works have developed efficient optimization methods to construct linear classifiers and applied them to some largescale applications. In this paper, we give a comprehensive survey on the recent development of this active research area.
Trading Representability for Scalability: Adaptive MultiHyperplane Machine for Nonlinear Classification ABSTRACT
"... Support Vector Machines (SVMs) are among the most popular and successful classification algorithms. Kernel SVMs often reach stateoftheart accuracies, but suffer from the curse of kernelization due to linear model growth with data size on noisy data. Linear SVMs have the ability to efficiently lea ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
Support Vector Machines (SVMs) are among the most popular and successful classification algorithms. Kernel SVMs often reach stateoftheart accuracies, but suffer from the curse of kernelization due to linear model growth with data size on noisy data. Linear SVMs have the ability to efficiently learn from truly large data, but they are applicable to a limited number of domains due to low representational power. To fill the representability and scalability gap between linear and nonlinear SVMs, we propose the Adaptive Multihyperplane Machine (AMM) algorithm that accomplishes fast training and prediction and has capability to solve nonlinear classification problems. AMM model consists of a set of hyperplanes (weights), each assigned to one of the multiple classes, and predicts based on the associated class of the weight that provides the largest prediction. The
Normalized online learning
"... We introduce online learning algorithms which are independent of feature scales, proving regret bounds dependent on the ratio of scales existent in the data rather than the absolute scale. This has several useful effects: there is no need to prenormalize data, the testtime and testspace complexity ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
We introduce online learning algorithms which are independent of feature scales, proving regret bounds dependent on the ratio of scales existent in the data rather than the absolute scale. This has several useful effects: there is no need to prenormalize data, the testtime and testspace complexity are reduced, and the algorithms are more robust. 1
Bounded CoordinateDescent for Biological Sequence Classification in High Dimensional Predictor Space ABSTRACT
"... We present a framework for discriminative sequence classification where linear classifiers work directly in the explicit highdimensional predictor space of all subsequences in the training set (as opposed to kernelinduced spaces). This is made feasible by employing a gradientbounded coordinatedes ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
We present a framework for discriminative sequence classification where linear classifiers work directly in the explicit highdimensional predictor space of all subsequences in the training set (as opposed to kernelinduced spaces). This is made feasible by employing a gradientbounded coordinatedescent algorithm for efficiently selecting discriminative subsequences without having to expand the whole space. Our framework can be applied to a wide range of loss functions, including binomial loglikelihood loss of logistic regression and squared hinge loss of support vector machines. When applied to protein remote homology detection and remote fold recognition, our framework achieves comparable performance to the stateoftheart (e.g., kernel support vector machines). In contrast to stateoftheart sequence classifiers, our models are simply lists of weighted discriminative subsequences and can thus be interpreted and related to the biological problem – a crucial requirement for the bioinformatics and medical communities.
Solving Large Scale Linear SVM with Distributed Block Minimization
"... Over recent years we have seen the appearance of huge datasets that do not fit into memory and do not even fit on the hard disk of a single computer. Moreover, even when processed on a cluster of machines, data are usually stored in a distributed way. The transfer of significant subsets of such data ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Over recent years we have seen the appearance of huge datasets that do not fit into memory and do not even fit on the hard disk of a single computer. Moreover, even when processed on a cluster of machines, data are usually stored in a distributed way. The transfer of significant subsets of such datasets from one node to another is very slow. We present a new algorithm for training linear Support Vector Machines over such large datasets. Our algorithm assumes that the dataset is partitioned over several nodes on a cluster and performs a distributed block minimization along with the subsequent line search. The communication complexity of our algorithm is independent of the number of training examples. With our MapReduce/Hadoop implementation of this algorithm the accurate training of SVM over the datasets of tens of millions of examples takes less than 11 minutes. 1
How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets
"... † and ‡: shared first and second coauthorships, respectively ¶: to whom questions and comments should be sent ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
† and ‡: shared first and second coauthorships, respectively ¶: to whom questions and comments should be sent
Explicit approximations of the Gaussian kernel
, 2011
"... We investigate training and using Gaussian kernel SVMs by approximating the kernel with an explicit finitedimensional polynomial feature representation based on the Taylor expansion of the exponential. Although not as efficient as the recentlyproposed random Fourier features [Rahimi and Recht, 200 ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We investigate training and using Gaussian kernel SVMs by approximating the kernel with an explicit finitedimensional polynomial feature representation based on the Taylor expansion of the exponential. Although not as efficient as the recentlyproposed random Fourier features [Rahimi and Recht, 2007] in terms of the number of features, we show how this polynomial representation can provide a better approximation in terms of the computational cost involved. This makes our “Taylor features” especially attractive for use on very large data sets, in conjunction with online or stochastic training.
Polynomial networks and factorization machines: New insights and efficient training algorithms.
 In Proceedings of International Conference on Machine Learning (ICML),
, 2016
"... Abstract Polynomial networks and factorization machines are two recentlyproposed models that can efficiently use feature interactions in classification and regression tasks. In this paper, we revisit both models from a unified perspective. Based on this new view, we study the properties of both mo ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract Polynomial networks and factorization machines are two recentlyproposed models that can efficiently use feature interactions in classification and regression tasks. In this paper, we revisit both models from a unified perspective. Based on this new view, we study the properties of both models and propose new efficient training algorithms. Key to our approach is to cast parameter learning as a lowrank symmetric tensor estimation problem, which we solve by multiconvex optimization. We demonstrate our approach on regression and recommender system tasks.
Smoothing Multivariate Performance Measures
"... A Support Vector Method for multivariate performance measures was recently introduced by Joachims (2005). The underlying optimization problem is currently solved using cutting plane methods such as SVMPerf and BMRM. One can show that these algorithms converge to an ɛ accurate solution in O () 1 λɛ ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
A Support Vector Method for multivariate performance measures was recently introduced by Joachims (2005). The underlying optimization problem is currently solved using cutting plane methods such as SVMPerf and BMRM. One can show that these algorithms converge to an ɛ accurate solution in O () 1 λɛ iterations, where λ is the tradeoff parameter between the regularizer and the loss function. We present a smoothing strategy for multivariate performance scores, in particular precision/recall breakeven point and ROCArea. When combined with Nesterov’s accelerated gradient algorithm our smoothing strategy yields an optimization algorithm which converges to an ɛ accurate solution in