Results 1  10
of
46
A unified framework for highdimensional analysis of Mestimators with decomposable regularizers
"... ..."
Structured variable selection with sparsityinducing norms
, 2011
"... We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsityinducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1norm and the group ℓ1norm by allowing the subsets to ov ..."
Abstract

Cited by 187 (27 self)
 Add to MetaCart
(Show Context)
We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsityinducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1norm and the group ℓ1norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for leastsquares linear regression in low and highdimensional settings.
Structured Sparsity through Convex Optimization
, 2012
"... Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. While naturally cast as a combinatorial optimization problem, variable or feature selection admits a convex relaxation through the regularization by the ℓ1norm. In this paper, we consider sit ..."
Abstract

Cited by 47 (6 self)
 Add to MetaCart
(Show Context)
Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. While naturally cast as a combinatorial optimization problem, variable or feature selection admits a convex relaxation through the regularization by the ℓ1norm. In this paper, we consider situations where we are not only interested in sparsity, but where some structural prior knowledge is available as well. We show that the ℓ1norm can then be extended to structured norms built on either disjoint or overlapping groups of variables, leading to a flexible framework that can deal with various structures. We present applications to unsupervised learning, for structured sparse principal component analysis and hierarchical dictionary learning, and to supervised learning in the context of nonlinear variable selection.
NonAsymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
"... We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and leastsquares regression, and is commonly ref ..."
Abstract

Cited by 47 (10 self)
 Add to MetaCart
(Show Context)
We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and leastsquares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a nonasymptotic analysis of the convergence of two wellknown algorithms, stochastic gradient descent (a.k.a. RobbinsMonro algorithm) as well as a simple modification where iterates are averaged (a.k.a. PolyakRuppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets. 1
Robust 1bit compressed sensing and sparse logistic regression: A convex programming approach. Preprint. Available at http://arxiv.org/abs/1202.1212
"... Abstract. This paper develops theoretical results regarding noisy 1bit compressed sensing and sparse binomial regression. Wedemonstrate thatasingle convexprogram gives anaccurate estimate of the signal, or coefficient vector, for both of these models. We show that an ssparse signal in R n can be a ..."
Abstract

Cited by 44 (4 self)
 Add to MetaCart
(Show Context)
Abstract. This paper develops theoretical results regarding noisy 1bit compressed sensing and sparse binomial regression. Wedemonstrate thatasingle convexprogram gives anaccurate estimate of the signal, or coefficient vector, for both of these models. We show that an ssparse signal in R n can be accurately estimated from m = O(slog(n/s)) singlebit measurements using a simple convex program. This remains true even if each measurement bit is flipped with probability nearly 1/2. Worstcase (adversarial) noise can also be accounted for, and uniform results that hold for all sparse inputs are derived as well. In the terminology of sparse logistic regression, we show that O(slog(2n/s)) Bernoulli trials are sufficient to estimate a coefficient vector in R n which is approximately ssparse. Moreover, the same convex program works for virtually all generalized linear models, in which the link function may be unknown. To our knowledge, these are the first results that tie together the theory of sparse logistic regression to 1bit compressed sensing. Our results apply to general signal structures aside from sparsity; one only needs to know the size of the set K where signals reside. The size is given by the mean width of K, a computable quantity whose square serves as a robust extension of the dimension. 1.
On learning discrete graphical models using greedy methods
 In Neural Information Processing Systems (NIPS) (currently under review
, 2011
"... In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a highdimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forwardbackward greedy algorithm as applied to gener ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
(Show Context)
In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a highdimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forwardbackward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegreed and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling asn = Ω(d 2 log(p)), in contrast to existing convexoptimization based algorithms that require a sample complexity of Ω(d 3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end. 1
Learning Exponential Families in HighDimensions:
"... The versatility of exponential families, along with their attendant convexity properties, make them a popular and effective statistical model. A central issue is learning these models in highdimensions when the optimal parameter vector is sparse. This work characterizes a certain strong convexity p ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
The versatility of exponential families, along with their attendant convexity properties, make them a popular and effective statistical model. A central issue is learning these models in highdimensions when the optimal parameter vector is sparse. This work characterizes a certain strong convexity property of general exponential families, which allows their generalization ability to be quantified. In particular, we show how this property can be used to analyze generic exponential families under L1 regularization. 1
Strong oracle optimality of folded concave penalized estimation
 Ann. Statist
, 2014
"... All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
Sharp analysis of lowrank kernel matrix approximations
 JMLR: WORKSHOP AND CONFERENCE PROCEEDINGS VOL 30 (2013) 1–25
, 2013
"... We consider supervised learning problems within the positivedefinite kernel framework, such as kernel ridge regression, kernel logistic regression or the support vector machine. With kernels leading to infinitedimensional feature spaces, a common practical limiting difficulty is the necessity of c ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
We consider supervised learning problems within the positivedefinite kernel framework, such as kernel ridge regression, kernel logistic regression or the support vector machine. With kernels leading to infinitedimensional feature spaces, a common practical limiting difficulty is the necessity of computing the kernel matrix, which most frequently leads to algorithms with running time at least quadratic in the number of observations n, i.e., O(n 2). Lowrank approximations of the kernel matrix are often considered as they allow the reduction of running time complexities to O(p 2 n), where p is the rank of the approximation. The practicality of such methods thus depends on the required rank p. In this paper, we show that in the context of kernel ridge regression, for approximations based on a random subset of columns of the original kernel matrix, the rank p may be chosen to be linear in the degrees of freedom associated with the problem, a quantity which is classically used in the statistical analysis of such methods, and is often seen as the implicit number of parameters of nonparametric estimators. This result enables simple algorithms that have subquadratic running time complexity, but provably exhibit the same predictive performance than existing algorithms, for any given problem instance, and not only for worstcase situations.
2013): Program Evaluation with HighDimensional Data,Working paper
"... Abstract. In the first part of the paper, we consider estimation and inference on policy relevant treatment effects, such as local average and quantile treatment effects, in a datarich environment where there may be many more control variables available than there are observations. In addition to a ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
Abstract. In the first part of the paper, we consider estimation and inference on policy relevant treatment effects, such as local average and quantile treatment effects, in a datarich environment where there may be many more control variables available than there are observations. In addition to allowing many control variables, the setting we consider allows endogenous receipt of treatment, heterogeneous treatment effects, and functionvalued outcomes. To make informative inference possible, we assume that some reduced form predictive relationships are approximately sparse. That is, we require that the relationship between the control variables and the outcome, treatment status, and instrument status can be captured up to a small approximation error using a small number of the control variables whose identities are unknown to the researcher. This condition allows estimation and inference for a wide variety of treatment parameters to proceed after datadriven selection of control variables. We provide conditions under which postselection inference is uniformly valid across a widerange of models and show that a key condition underlying the uniform validity of postselection inference allowing for imperfect model selection is the use of approximately unbiased estimating equations. We illustrate the use of the proposed methods with an application to estimating the effect of 401(k) participation on accumulated assets.