Results 1  10
of
821
An interiorpoint method for largescale l1regularized logistic regression
 Journal of Machine Learning Research
, 2007
"... Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand ..."
Abstract

Cited by 284 (8 self)
 Add to MetaCart
(Show Context)
Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warmstart techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently.
Working set selection using second order information for training SVM
 Journal of Machine Learning Research
"... Working set selection is an important step in decomposition methods for training support vector machines (SVMs). This paper develops a new technique for working set selection in SMOtype decomposition methods. It uses second order information to achieve fast convergence. Theoretical properties such ..."
Abstract

Cited by 276 (11 self)
 Add to MetaCart
(Show Context)
Working set selection is an important step in decomposition methods for training support vector machines (SVMs). This paper develops a new technique for working set selection in SMOtype decomposition methods. It uses second order information to achieve fast convergence. Theoretical properties such as linear convergence are established. Experiments demonstrate that the proposed method is faster than existing selection methods using first order information.
A Note on Platt's Probabilistic Outputs for Support Vector Machines
, 2003
"... Platt's probabilistic outputs for Support Vector Machines [6] has been popular for applications that require posterior class probabilities. In this note, we propose an improvement which theoretically converges and avoids numerical difficulties. A simpler and readytouse pseudo code is includ ..."
Abstract

Cited by 185 (5 self)
 Add to MetaCart
(Show Context)
Platt's probabilistic outputs for Support Vector Machines [6] has been popular for applications that require posterior class probabilities. In this note, we propose an improvement which theoretically converges and avoids numerical difficulties. A simpler and readytouse pseudo code is included.
Trust region Newton method for largescale logistic regression
 In Proceedings of the 24th International Conference on Machine Learning (ICML
, 2007
"... Largescale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the loglikelihood of the logistic regression model. The proposed method uses only approximate Newton steps in ..."
Abstract

Cited by 96 (22 self)
 Add to MetaCart
(Show Context)
Largescale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the loglikelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also compare it with existing linear SVM implementations. 1
Multitask feature selection
 In the workshop of structural Knowledge Transfer for Machine Learning in the 23rd International Conference on Machine Learning (ICML 2006). Citeseer
, 2006
"... We address joint feature selection across a group of classification or regression tasks. In many multitask learning scenarios, different but related tasks share a large proportion of relevant features. We propose a novel type of joint regularization for the parameters of support vector machines in ..."
Abstract

Cited by 93 (1 self)
 Add to MetaCart
(Show Context)
We address joint feature selection across a group of classification or regression tasks. In many multitask learning scenarios, different but related tasks share a large proportion of relevant features. We propose a novel type of joint regularization for the parameters of support vector machines in order to couple feature selection across tasks. Intuitively, we extend the ℓ1 regularization for singletask estimation to the multitask setting. By penalizing the sum of ℓ2norms of the blocks of coefficients associated with each feature across different tasks, we encourage multiple predictors to have similar parameter sparsity patterns. This approach yields convex, nondifferentiable optimization problems that can be solved efficiently using a simple and scalable extragradient algorithm. We show empirically that our approach outperforms independent ℓ1based feature selection on several datasets. 1.
Bayesian Treed Gaussian Process Models with an Application to Computer Modeling
 Journal of the American Statistical Association
, 2007
"... This paper explores nonparametric and semiparametric nonstationary modeling methodologies that couple stationary Gaussian processes and (limiting) linear models with treed partitioning. Partitioning is a simple but effective method for dealing with nonstationarity. Mixing between full Gaussian proce ..."
Abstract

Cited by 87 (19 self)
 Add to MetaCart
This paper explores nonparametric and semiparametric nonstationary modeling methodologies that couple stationary Gaussian processes and (limiting) linear models with treed partitioning. Partitioning is a simple but effective method for dealing with nonstationarity. Mixing between full Gaussian processes and simple linear models can yield a more parsimonious spatial model while significantly reducing computational effort. The methodological developments and statistical computing details which make this approach efficient are described in detail. Illustrations of our model are given for both synthetic and real datasets. Key words: recursive partitioning, nonstationary spatial model, nonparametric regression, Bayesian model averaging 1
Classifier technology and the illusion of progress. Statist
 Sci
, 2006
"... Abstract. A great many tools have been developed for supervised classification, ranging from early methods such as linear discriminant analysis through to modern developments such as neural networks and support vector machines. A large number of comparative studies have been conducted in attempts to ..."
Abstract

Cited by 83 (2 self)
 Add to MetaCart
(Show Context)
Abstract. A great many tools have been developed for supervised classification, ranging from early methods such as linear discriminant analysis through to modern developments such as neural networks and support vector machines. A large number of comparative studies have been conducted in attempts to establish the relative superiority of these methods. This paper argues that these comparisons often fail to take into account important aspects of real problems, so that the apparent superiority of more sophisticated methods may be something of an illusion. In particular, simple methods typically yield performance almost as good as more sophisticated methods, to the extent that the difference in performance may be swamped by other sources of uncertainty that generally are not considered in the classical supervised classification paradigm.
A WEAKLY INFORMATIVE DEFAULT PRIOR DISTRIBUTION FOR LOGISTIC AND OTHER REGRESSION MODELS
"... We propose a new prior distribution for classical (nonhierarchical) logistic regression models, constructed by first scaling all nonbinary variables to have mean 0 and standard deviation 0.5, and then placing independent Studentt prior distributions on the coefficients. As a default choice, we reco ..."
Abstract

Cited by 78 (12 self)
 Add to MetaCart
We propose a new prior distribution for classical (nonhierarchical) logistic regression models, constructed by first scaling all nonbinary variables to have mean 0 and standard deviation 0.5, and then placing independent Studentt prior distributions on the coefficients. As a default choice, we recommend the Cauchy distribution with center 0 and scale 2.5, which in the simplest setting is a longertailed version of the distribution attained by assuming onehalf additional success and onehalf additional failure in a logistic regression. Crossvalidation on a corpus of datasets shows the Cauchy class of prior distributions to outperform existing implementations of Gaussian and Laplace priors. We recommend this prior distribution as a default choice for routine applied use. It has the advantage of always giving answers, even when there is complete separation in logistic regression (a common problem, even when the sample size is large and the number of predictors is small), and also automatically applying more shrinkage to higherorder interactions. This can
Assessing approximate inference for binary Gaussian process classification
 Journal of Machine Learning Research
, 2005
"... Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for ..."
Abstract

Cited by 73 (3 self)
 Add to MetaCart
Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace’s method. Keywords: Gaussian process priors, probabilistic classification, Laplace’s approximation, expectation propagation, marginal likelihood, evidence, MCMC
Proactive learning: Costsensitive active learning with multiple imperfect oracles
 In Proceedings of CIKM08
, 2008
"... Proactive learning is a generalization of active learning designed to relax unrealistic assumptions and thereby reach practical applications. Active learning seeks to select the most informative unlabeled instances and ask an omniscient oracle for their labels, so as to retrain the learning algorith ..."
Abstract

Cited by 70 (6 self)
 Add to MetaCart
(Show Context)
Proactive learning is a generalization of active learning designed to relax unrealistic assumptions and thereby reach practical applications. Active learning seeks to select the most informative unlabeled instances and ask an omniscient oracle for their labels, so as to retrain the learning algorithm maximizing accuracy. However, the oracle is assumed to be infallible (never wrong), indefatigable (always answers), individual (only one oracle), and insensitive to costs (always free or always charges the same). Proactive learning relaxes all four of these assumptions, relying on a decisiontheoretic approach to jointly select the optimal oracle and instance, by casting the problem as a utility optimization problem subject to a budget constraint. Results on multioracle optimization over several data sets demonstrate the superiority of our approach over the singleimperfectoracle baselines in most cases. Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: Learning—concept learning, knowledge acquisition; H.0 [General]: [Classification, costsensitive