#### DMCA

## An introduction to variable and feature selection (2003)

### Cached

### Download Links

- [www.clopinet.com]
- [www.ai.mit.edu]
- [www.cs.sunyit.edu]
- [www.cs.utah.edu]
- [www.cs.utah.edu]
- [clopinet.com]
- [www.ai.mit.edu]
- [web.cs.sunyit.edu]
- [web.cs.sunyit.edu]
- [www.eecs.wsu.edu]
- [perso.telecom-paristech.fr]
- [web.cs.sunyit.edu]
- [web.cs.sunyit.edu]
- [web.cs.sunyit.edu]
- [www.yaroslavvb.com]
- [people.sabanciuniv.edu]
- [www.cs.xu.edu]
- [people.sabanciuniv.edu]
- [tsam-fich.wdfiles.com]
- [cbio.ensmp.fr]
- [www.kernel-machines.org]
- [www.jmlr.org]
- [www.jmlr.org]
- [jmlr.csail.mit.edu]
- [jmlr.org]
- [axon.cs.byu.edu]
- [www2.mta.ac.il]
- [vis.lbl.gov]
- [axon.cs.byu.edu]
- [www.cs.utah.edu]
- [www.ai.mit.edu]
- CiteULike
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Machine Learning Research |

Citations: | 1352 - 16 self |

### Citations

13234 | The Nature of Statistical Learning Theory
- Vapnik
- 1995
(Show Context)
Citation Context ...treated. Approximations for nonlinear least-squares have also been computed elsewhere (Monari and Dreyfus, 2000). The proposal of Rakotomamonjy (2003) is to train non-linear SVMs (Boser et al., 1992, =-=Vapnik, 1998-=-) with a regular training procedure and select features with backward elimination like in RFE (Guyon et al., 2002). The variable ranking criterion however is not computed using the sensitivity of the ... |

5968 |
Classification and regression trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...ning a predictor from scratch for every variable subset investigated. Embedded methods are not new: decision trees such as CART, for instance, have a built-in mechanism to perform variable selection (=-=Breiman et al., 1984-=-). The next two sections are devoted to two families of embedded methods illustrated by algorithms published in this issue. 4.2 Nested Subset Methods Some embedded methods guide their search by estima... |

2848 |
Pattern Classification.
- Duda, Hart, et al.
- 2001
(Show Context)
Citation Context ...given predictor. For instance, using Fisher’s criterion 4 to rank variables in a classification problem where the covariance matrix is diagonal is optimum for Fisher’s linear discriminant classifi=-=er (Duda et al., 2001-=-). Even when variable ranking is not optimal, it may be preferable to other variable subset selection methods because of its computational and statistical scalability: Computationally, it is efficient... |

2831 | R.C.: Online learning with kernels - Kivinen, Smola, et al. - 2002 |

2485 |
Significance analysis of microarrays applied to transcriptional responses to ionizing radiation’,
- Tusher, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ... a given value of y, e.g., ±1. R(i) 2 can then be shown to be closely related to Fisher’s criterion (Furey et al., 2000), to the T-test criterion, and other similar criteria (see, e.g., et al., 199=-=9, Tusher et al., 2001-=-, Hastie et al., 2001). As further developed in Section 6, the link to the T-test shows that the score R(i) may be used as a test statistic to assess the significance of a variable. Correlation criter... |

1865 | A training algorithm for optimal margin classifiers
- Boser, Guyon, et al.
- 1992
(Show Context)
Citation Context ...omamonjy, 2003) are treated. Approximations for nonlinear least-squares have also been computed elsewhere (Monari and Dreyfus, 2000). The proposal of Rakotomamonjy (2003) is to train non-linear SVMs (=-=Boser et al., 1992-=-, Vapnik, 1998) with a regular training procedure and select features with backward elimination like in RFE (Guyon et al., 2002). The variable ranking criterion however is not computed using the sensi... |

1115 | Gene selection for cancer classification using support vector machines’,
- Guyon, Weston, et al.
- 2002
(Show Context)
Citation Context ...unction, which measures the similarity between x and xk (Schoelkopf and Smola, 2002). The variation in J(s) is computed by keeping the αk values constant. This procedure originally proposed for SVMs =-=(Guyon et al., 2002) is-=- used in this issue as a baseline method (Rakotomamonjy, 2003, Weston et al., 2003). The “optimum brain damage” (OBD) procedure (method 2) is mentioned in this issue in the paper of Rivals and Per... |

723 | Approximate statistical tests for comparing supervised classification learning algorithms. - Dietterich - 1998 |

629 | Distributional clustering of English words. - Pereira, Tishby, et al. - 1993 |

606 | Selection of relevant features and examples in machine learning”.
- Blum, Langley
- 1997
(Show Context)
Citation Context ...s, proteomics, QSAR, text classification, information retrieval. 1 Introduction As of 1997, when a special issue on relevance including several papers on variable and feature selection was published (=-=Blum and Langley, 1997-=-, Kohavi and John, 1997), few domains explored used more than 40 features. The situation has changed considerably in the past few years and, in this special issue, most papers explore domains with hun... |

569 | Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics - TS, Cristianini, et al. - 2000 |

540 | The information bottleneck method. - Tishby, Pereira, et al. - 1999 |

531 |
A practical approach to feature selection.
- Kira, Rendell
- 1992
(Show Context)
Citation Context ...ented in Section 7.1, may be instrumental in producing a good variable ranking incorporating the context of others. The relief algorithm uses another approach based on the nearest-neighbor algorithm (=-=Kira and Rendell, 1992-=-). For each example, the closest example of the same class (nearest hit) and the closest example of a different class (nearest miss) are selected. The score S(i) of the i th variable is computed as th... |

510 | Optimal brain damage,”
- Cun, Denker, et al.
- 1990
(Show Context)
Citation Context ...1) is computed for the variables that are candidates for addition or removal. 2. Quadratic approximation of the cost function: This method was originally proposed to prune weights in neural networks (=-=LeCun et al., 1990-=-). It can be used for backward elimination of variables, via the pruning of the input variable weights wi. A second order Taylor expansion of J is made. At the optimum of J, the first-order term can b... |

496 | An extensive empirical study of feature selection metrics for text classification.
- Forman
- 2003
(Show Context)
Citation Context ...because of its simplicity, scalability, and good empirical success. Several papers in this issue use variable ranking as a baseline method (see, e.g., Bekkerman et al., 2003, Caruana and de Sa, 2003, =-=Forman, 2003-=-, Weston et al., 2003). Variable ranking is not necessarily used to build predictors. One of its common uses in the microarray analysis domain is to discover a set of drug leads (see, e.g., et al., 19... |

480 | Toward Optimal Feature Selection,
- Koller, Sahami
- 1996
(Show Context)
Citation Context ...roarray data for instance) one may need to resort to selecting variables with correlation coefficients (see Section 2.2). Information theoretic filtering methods such as Markov blanket 11 algorithms (=-=Koller and Sahami, 1996-=-) constitute another broad family. The justification for classification problems is that the measure of mutual information does not rely on any prediction process, but provides a bound on the error ra... |

432 | Stacked Regression. - Breiman - 1996 |

330 | et al., “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring - Golub, Slonim, et al. - 1999 |

282 | Feature selection for svms.
- Weston, Mukherjee, et al.
- 2000
(Show Context)
Citation Context ...rectly minimize the number of variables for non-linear predictors. Instead, several authors have substituted for the problem of variable selection that of variable scaling (Jebara and Jaakkola, 2000, =-=Weston et al., 2000, Gr-=-andvalet and Canu, 2002). The variable scaling factors are “hyper-parameters” adjusted by model selection. The scaling factors obtained are used to assess variable relevance. A variant of the meth... |

184 | Inference for the generalization error. - Nadeau, Bengio - 2003 |

174 | Use of the zero-norm with linear models and kernel methods.
- Weston, Elisseeff, et al.
- 2003
(Show Context)
Citation Context ... simplicity, scalability, and good empirical success. Several papers in this issue use variable ranking as a baseline method (see, e.g., Bekkerman et al., 2003, Caruana and de Sa, 2003, Forman, 2003, =-=Weston et al., 2003-=-). Variable ranking is not necessarily used to build predictors. One of its common uses in the microarray analysis domain is to discover a set of drug leads (see, e.g., et al., 1999): A ranking criter... |

162 |
Feature extraction by nonparametric mutual information maximization",
- Torkkola
- 2003
(Show Context)
Citation Context ... Several approaches to the variable selection problem using information theoretic criteria have been proposed (as reviewed in this issue by Bekkerman et al., 2003, Dhillon et al., 2003, Forman, 2003, =-=Torkkola, 2003). M-=-any rely on empirical estimates of the mutual information between each variable and the target: � I(i) = xi � y p(xi, y) log p(xi, y) dxdy , (3) p(xi)p(y) where p(xi) and p(y) are the probability ... |

138 | A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification,
- Dhillon, Mallela, et al.
- 2003
(Show Context)
Citation Context ...formation Theoretic Ranking Criteria Several approaches to the variable selection problem using information theoretic criteria have been proposed (as reviewed in this issue by Bekkerman et al., 2003, =-=Dhillon et al., 2003, Fo-=-rman, 2003, Torkkola, 2003). Many rely on empirical estimates of the mutual information between each variable and the target: � I(i) = xi � y p(xi, y) log p(xi, y) dxdy , (3) p(xi)p(y) where p(xi)... |

125 | On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems.
- Amaldi, Kann
- 1998
(Show Context)
Citation Context ...uide the search and halt it; and (iii) which predictor to use. An exhaustive search can conceivably be performed, if the number of variables is not too large. But, the problem is known to be NP-hard (=-=Amaldi and Kann, 1998-=-) and the search becomes quickly computationally intractable. A wide range of search strategies can be used, including best-first, branch-and-bound, simulated annealing, genetic algorithms (see Kohavi... |

125 | Variable Selection using SVM-based criteria,"
- Rakotomamonjy
- 2003
(Show Context)
Citation Context ...opf and Smola, 2002). The variation in J(s) is computed by keeping the αk values constant. This procedure originally proposed for SVMs (Guyon et al., 2002) is used in this issue as a baseline method =-=(Rakotomamonjy, 2003, Wes-=-ton et al., 2003). The “optimum brain damage” (OBD) procedure (method 2) is mentioned in this issue in the paper of Rivals and Personnaz (2003). The case of linear predictors f(x) = w · x + b is ... |

121 | Dimensionality Reduction via Sparse Support Vector
- Bi, Bennett, et al.
- 2003
(Show Context)
Citation Context ...well; (ii) results are not reproducible; and (iii) one subset fails to capture the “whole picture”. One method to “stabilize” variable selection explored in this issue is to use several “boo=-=tstraps” (Bi et al., 2003). T-=-he variable selection process is repeated with sub-samples of the training data. The union of the subsets of variables selected in the various bootstraps is taken as the final “stable” subset. Thi... |

106 | Grafting: Fast, incremental feature selection by gradient descent in function space,”
- Perkins, Lacker, et al.
- 2003
(Show Context)
Citation Context ...res of uniformly distributed examples with alternating class labels. The latter problem is also generalizable to the multi-dimensional case. Similar examples are used in several papers in this issue (=-=Perkins et al., 2003, Stoppiglia e-=-t al., 2003). 8. Incidentally, the two variables are also uncorrelated with one another. 1165s2 1 0 −1 −2 5 0 −5 −2 −1 0 1 2 Guyon and Elisseeff −5 0 5 −0.5 0 0.5 1 1.5 (a) (b) 1.5 1 0.5... |

89 | Distributional Word Clusters vs. Words for Text Categorization. - Bekkerman, Ran, et al. - 2003 |

81 | Overfitting in making comparisons between variable selection methods. - Reunanen - 2003 |

73 |
Wrappers for feature selection.
- Kohavi, John
- 1997
(Show Context)
Citation Context ...t classification, information retrieval. 1 Introduction As of 1997, when a special issue on relevance including several papers on variable and feature selection was published (Blum and Langley, 1997, =-=Kohavi and John, 1997-=-), few domains explored used more than 40 features. The situation has changed considerably in the past few years and, in this special issue, most papers explore domains with hundreds to tens of thousa... |

65 | Ranking a Random Feature for Variable and Feature Selection,”
- Stoppiglia, Dreyfus, et al.
- 2003
(Show Context)
Citation Context ...ibuted examples with alternating class labels. The latter problem is also generalizable to the multi-dimensional case. Similar examples are used in several papers in this issue (Perkins et al., 2003, =-=Stoppiglia et al., 2003). 8. Incidentally-=-, the two variables are also uncorrelated with one another. 1165s2 1 0 −1 −2 5 0 −5 −2 −1 0 1 2 Guyon and Elisseeff −5 0 5 −0.5 0 0.5 1 1.5 (a) (b) 1.5 1 0.5 0 −0.5 1.5 1 0.5 0 −0.5 ... |

57 |
The elements of statistical learning. Springer series in statistics
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...cient since it requires only the computation of n scores and sorting the scores; Statistically, it is robust against overfitting because it introduces bias but it may have considerably less variance (=-=Hastie et al., 2001-=-). 5 We introduce some additional notation: If the input vector x can be interpreted as the realization of a random vector drawn from an underlying unknown distribution, we denote by Xi the random var... |

53 | Adaptive scaling for feature selection in svms.
- Grandvalet, Canu
- 2002
(Show Context)
Citation Context ...umber of variables for non-linear predictors. Instead, several authors have substituted for the problem of variable selection that of variable scaling (Jebara and Jaakkola, 2000, Weston et al., 2000, =-=Grandvalet and Canu, 2002). T-=-he variable scaling factors are “hyper-parameters” adjusted by model selection. The scaling factors obtained are used to assess variable relevance. A variant of the method consists of adjusting th... |

50 | Feature selection and dualities in maximum entropy discrimination.
- Jebara, Jaakkola
- 2000
(Show Context)
Citation Context ...thm has been proposed to directly minimize the number of variables for non-linear predictors. Instead, several authors have substituted for the problem of variable selection that of variable scaling (=-=Jebara and Jaakkola, 2000-=-, Weston et al., 2000, Grandvalet and Canu, 2002). The variable scaling factors are “hyper-parameters” adjusted by model selection. The scaling factors obtained are used to assess variable relevance. ... |

47 | On feature selection: Learning with exponentially many irrelevant features as training examples.
- Ng
- 1998
(Show Context)
Citation Context ... the m dimensional vector containing all the target values. 4. The ratio of the between class variance to the within-class variance. 5. The similarity of variable ranking to the ORDERED-FS algorithm (=-=Ng, 1998) in-=-dicates that its sample complexity may be logarithmic in the number of irrelevant features, compared to a power law for “wrapper” subset selection methods. This would mean that variable ranking ca... |

45 | A new metric-based approach to model selection.
- Schuurmans
- 1997
(Show Context)
Citation Context ...ion of the training error have been proposed in the literature (see, e.g., Vapnik, 1998, Hastie et al., 2001). Recently, a new family of such methods called “metric-based methods” have been propos=-=ed (Schuurmans, 1997-=-). The paper of Bengio and Chapados (2003) in this issue illustrates their application to variable selection. The authors make use of unlabelled data, which are readily available in the application co... |

45 |
Estimation of dependences based on empirical data. Springer Series in Statistics
- Vapnik
- 1982
(Show Context)
Citation Context ...s the case, for instance, for the linear least square model using J = � m k=1 (w·xk +b−yk) 2 and for the linear SVM or optimum margin classifier, which minimizes J = (1/2)||w|| 2 , under constrai=-=nts (Vapnik, 1982-=-). Interestingly, for linear SVMs the finite difference method (method 1) and the sensitivity method (method 3) also boil down to selecting the variable with smallest |wi| for elimination at each step... |

36 | Sufficient dimensionality reduction
- Globerson, Tishby
- 2003
(Show Context)
Citation Context ...heoretic approach to the problem. Two of them illustrate the use of clustering to construct features (Bekkerman et al., 2003, Dhillon et al., 2003), one provides a new matrix factorization algorithm (=-=Globerson and Tishby, 2003-=-), and one provides a supervised means of learning features from a variety of models (Torkkola, 2003). In addition, two papers whose main focus is directed to variable selection also address the selec... |

35 | Regression selection and shrinkage via the lasso. - Tibshirani - 1994 |

21 | Personnaz L: MLPs (mono-layer polynomials and multi-layer perceptrons) for nonlinear modeling
- Rivals
(Show Context)
Citation Context ...formance of forward variable selection by adding at each step the variable that most decreases the mean-squared-error. Two papers in this issue are devoted to this technique (Stoppiglia et al., 2003, =-=Rivals and Personnaz, 2003). -=-For other algorithms like kernel methods, approximations of the difference can be computed efficiently. Kernel methods are learning machines of the form f(x) = � m k=1 αkK(x, xk), where K is the ke... |

19 | Extensions to Metric-Based Model Selection.
- Bengio, Chapados
- 2003
(Show Context)
Citation Context ...ally expensive procedure is used in cases where data is extremely scarce. 1173sGuyon and Elisseeff do not hold anymore, it is still possible to estimate generalization error confidence intervals (see =-=Bengio and Chapados, 2003-=-, in this issue). Choosing what fraction of the data should be used for training and for validation is an open problem. Many authors resort to using the leave-one-out cross-validation procedure, even ... |

17 | G: Withdrawing an example from the training set: an analytic estimation of its effect on a nonlinear parameterized model - Monari, Dreyfus |

12 | Convergence rates of the voting gibbs classifier, with application to bayesian feature selection
- Ng, Jordan
- 2001
(Show Context)
Citation Context ...ariables can be created considering how frequently they appear in the bootstraps. Related ideas have been described elsewhere in the context of Bayesian variable selection (Jebara and Jaakkola, 2000, =-=Ng and Jordan, 2001-=-, Vehtari and Lampinen, 2002). A distribution over a population of models using various variable subsets is estimated. Variables are then ranked according to the marginal distribution, reflecting how ... |

10 | Benefitting from the variables that variable selection discards - Caruana, Sa - 2003 |

9 | Bayesian input variable selection using posterior probabilities and expected utilities
- Vehtari, Lampinen
(Show Context)
Citation Context ...ed considering how frequently they appear in the bootstraps. Related ideas have been described elsewhere in the context of Bayesian variable selection (Jebara and Jaakkola, 2000, Ng and Jordan, 2001, =-=Vehtari and Lampinen, 2002-=-). A distribution over a population of models using various variable subsets is estimated. Variables are then ranked according to the marginal distribution, reflecting how often they appear in importa... |

4 | Distributional Word Clusters vs. - Bekkerman, El-Yaniv, et al. - 2003 |

1 | 1180 Introduction to Variable and Feature Selection - Furey, Cristianini, et al. |

1 | words for text categorization. Journal of Machine Learning Research, 3:1183--1208 (this issue - Ben-Hur, Guyon - 2003 |

1 | An Introduction to Variable and Feature Selection T. R. Golub et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring - Forman - 1999 |

1 |
Extensions to metric-based model selection. JMLR, 3:1209–1227 (this issue
- Bengio, Chapados
- 2003
(Show Context)
Citation Context ...in this issue). Cross-validation can be extended to time-series data and, while i.i.d. assumptions do not hold anymore, it is still possible to estimate generalization error confidence intervals (see =-=Bengio and Chapados, 2003-=-, in this issue). Choosing what fraction of the data should be used for training and for validation is an open problem. Many authors resort to using the leave-one-out cross-validation procedure, even ... |

1 | 3:1245–1264 (this issue - JMLR - 2003 |

1 | 3:1289–1306 (this issue - JMLR - 2003 |

1 | 3:1415–1438 (this issue - JMLR - 2003 |