Results 1  10
of
542
LIBLINEAR: A Library for Large Linear Classification
, 2008
"... LIBLINEAR is an open source library for largescale linear classification. It supports logistic regression and linear support vector machines. We provide easytouse commandline tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced u ..."
Abstract

Cited by 1407 (40 self)
 Add to MetaCart
(Show Context)
LIBLINEAR is an open source library for largescale linear classification. It supports logistic regression and linear support vector machines. We provide easytouse commandline tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.
Pegasos: Primal Estimated subgradient solver for SVM
"... We describe and analyze a simple and effective stochastic subgradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy ɛ is Õ(1/ɛ), where each iteration operates on a singl ..."
Abstract

Cited by 541 (21 self)
 Add to MetaCart
We describe and analyze a simple and effective stochastic subgradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy ɛ is Õ(1/ɛ), where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require Ω(1/ɛ2) iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total runtime of our method is Õ(d/(λɛ)), where d is a bound on the number of nonzero features in each example. Since the runtime does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to nonlinear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an orderofmagnitude speedup over previous SVM learning methods.
Anomaly Detection: A Survey
, 2007
"... Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and c ..."
Abstract

Cited by 538 (5 self)
 Add to MetaCart
Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the di®erent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
Improving the Fisher kernel for largescale image classification.
 In ECCV,
, 2010
"... Abstract. The Fisher kernel (FK) is a generic framework which combines the benefits of generative and discriminative approaches. In the context of image classification the FK was shown to extend the popular bagofvisualwords (BOV) by going beyond count statistics. However, in practice, this enric ..."
Abstract

Cited by 357 (20 self)
 Add to MetaCart
(Show Context)
Abstract. The Fisher kernel (FK) is a generic framework which combines the benefits of generative and discriminative approaches. In the context of image classification the FK was shown to extend the popular bagofvisualwords (BOV) by going beyond count statistics. However, in practice, this enriched representation has not yet shown its superiority over the BOV. In the first part we show that with several wellmotivated modifications over the original framework we can boost the accuracy of the FK. On PASCAL VOC 2007 we increase the Average Precision (AP) from 47.9% to 58.3%. Similarly, we demonstrate stateoftheart accuracy on CalTech 256. A major advantage is that these results are obtained using only SIFT descriptors and costless linear classifiers. Equipped with this representation, we can now explore image classification on a larger scale. In the second part, as an application, we compare two abundant resources of labeled images to learn classifiers: ImageNet and Flickr groups. In an evaluation involving hundreds of thousands of training images we show that classifiers learned on Flickr groups perform surprisingly well (although they were not intended for this purpose) and that they can complement classifiers learned on more carefully annotated datasets.
Random features for largescale kernel machines
 In Neural Infomration Processing Systems
, 2007
"... To accelerate the training of kernel machines, we propose to map the input data to a randomized lowdimensional feature space and then apply existing fast linear methods. Our randomized features are designed so that the inner products of the transformed data are approximately equal to those in the f ..."
Abstract

Cited by 256 (4 self)
 Add to MetaCart
To accelerate the training of kernel machines, we propose to map the input data to a randomized lowdimensional feature space and then apply existing fast linear methods. Our randomized features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shiftinvariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in largescale classification and regression tasks linear machine learning algorithms that use these features outperform stateoftheart largescale kernel machines. 1
Efficient Additive Kernels via Explicit Feature Maps
"... Maji and Berg [13] have recently introduced an explicit feature map approximating the intersection kernel. This enables efficient learning methods for linear kernels to be applied to the nonlinear intersection kernel, expanding the applicability of this model to much larger problems. In this paper ..."
Abstract

Cited by 244 (9 self)
 Add to MetaCart
(Show Context)
Maji and Berg [13] have recently introduced an explicit feature map approximating the intersection kernel. This enables efficient learning methods for linear kernels to be applied to the nonlinear intersection kernel, expanding the applicability of this model to much larger problems. In this paper we generalize this idea, and analyse a large family of additive kernels, called homogeneous, in a unified framework. The family includes the intersection, Hellinger’s, and χ2 kernels commonly employed in computer vision. Using the framework we are able to: (i) provide explicit feature maps for all homogeneous additive kernels along with closed form expression for all common kernels; (ii) derive corresponding approximate finitedimensional feature maps based on the Fourier sampling theorem; and (iii) quantify the extent of the approximation. We demonstrate that the approximations have indistinguishable performance from the full kernel on a number of standard datasets, yet greatly reduce the train/test times of SVM implementations. We show that the χ2 kernel, which has been found to yield the best performance in most applications, also has the most compact feature representation. Given these train/test advantages we are able to obtain a significant performance improvement over current state of the art results based on the intersection kernel. 1.
A dual coordinate descent method for largescale linear SVM.
 In ICML,
, 2008
"... Abstract In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such largescale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1and L2l ..."
Abstract

Cited by 207 (19 self)
 Add to MetaCart
(Show Context)
Abstract In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such largescale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1and L2loss functions. The proposed method is simple and reaches an accurate solution in O(log(1/ )) iterations. Experiments indicate that our method is much faster than state of the art solvers such as Pegasos, TRON, SVM perf , and a recent primal coordinate descent implementation.
Hogwild!: A lockfree approach to parallelizing stochastic gradient descent
, 2011
"... Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateoftheart performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work a ..."
Abstract

Cited by 161 (9 self)
 Add to MetaCart
(Show Context)
Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateoftheart performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude. 1
Trust region Newton method for largescale logistic regression
 In Proceedings of the 24th International Conference on Machine Learning (ICML
, 2007
"... Largescale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the loglikelihood of the logistic regression model. The proposed method uses only approximate Newton steps in ..."
Abstract

Cited by 99 (22 self)
 Add to MetaCart
(Show Context)
Largescale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the loglikelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also compare it with existing linear SVM implementations. 1
Exponentiated gradient algorithms for conditional random fields and maxmargin Markov networks
, 2008
"... Loglinear and maximummargin models are two commonlyused methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large dat ..."
Abstract

Cited by 94 (2 self)
 Add to MetaCart
(Show Context)
Loglinear and maximummargin models are two commonlyused methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the loglinear or maxmargin objective function; the dual in both the loglinear and maxmargin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the maxmargin case, O ( 1 ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for loglinear models only O(log (1/ε)) updates are required. For both the maxmargin and loglinear cases, our bounds suggest that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be