Results 1 - 10
of
569
An introduction to variable and feature selection
- Journal of Machine Learning Research
, 2003
"... Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. ..."
Abstract
-
Cited by 1352 (16 self)
- Add to MetaCart
Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available.
Gene selection for cancer classification using support vector machines
- Machine Learning
"... Abstract. DNA micro-arrays now permit scientists to screen thousands of genes simultaneously and determine whether those genes are active, hyperactive or silent in normal or cancerous tissue. Because these new micro-array devices generate bewildering amounts of raw data, new analytical methods must ..."
Abstract
-
Cited by 1115 (24 self)
- Add to MetaCart
Abstract. DNA micro-arrays now permit scientists to screen thousands of genes simultaneously and determine whether those genes are active, hyperactive or silent in normal or cancerous tissue. Because these new micro-array devices generate bewildering amounts of raw data, new analytical methods must be developed to sort out whether cancer tissues have distinctive signatures of gene expression over normal tissues or other types of cancer tissues. In this paper, we address the problem of selection of a small subset of genes from broad patterns of gene expression data, recorded on DNA micro-arrays. Using available training examples from cancer and normal patients, we build a classifier suitable for genetic diagnosis, as well as drug discovery. Previous attempts to address this problem select genes with correlation techniques. We propose a new method of gene selection utilizing Support Vector Machine methods based on Recursive Feature Elimination (RFE). We demonstrate experimentally that the genes selected by our techniques yield better classification performance and are biologically relevant to cancer. In contrast with the baseline method, our method eliminates gene redundancy automatically and yields better and more compact gene subsets. In patients with leukemia our method discovered 2 genes that yield zero leaveone-out error, while 64 genes are necessary for the baseline method to get the best result (one leave-one-out error). In the colon cancer database, using only 4 genes our method is 98 % accurate, while the baseline method is only 86 % accurate.
A Bayesian Framework for the Analysis of Microarray Expression Data: Regularized t-Test and Statistical Inferences of Gene Changes
- Bioinformatics
, 2001
"... Motivation: DNA microarrays are now capable of providing genome-wide patterns of gene expression across many different conditions. The first level of analysis of these patterns requires determining whether observed differences in expression are significant or not. Current methods are unsatisfactory ..."
Abstract
-
Cited by 491 (6 self)
- Add to MetaCart
Motivation: DNA microarrays are now capable of providing genome-wide patterns of gene expression across many different conditions. The first level of analysis of these patterns requires determining whether observed differences in expression are significant or not. Current methods are unsatisfactory due to the lack of a systematic framework that can accommodate noise, variability, and low replication often typical of microarray data. Results: We develop a Bayesian probabilistic framework for microarray data analysis. At the simplest level, we model log-expression values by independent normal distributions, parameterized by corresponding means and variances with hierarchical prior distributions. We derive point estimates for both parameters and hyperparameters, and regularized expressions for the variance of each gene by combining the empirical variance with a local background variance associated with neighboring genes. An additional hyperparameter, inversely related to the number of empirical observations, determines the strength of the background variance. Simulations show that these point estimates, combined with a t-test, provide a systematic inference approach that compares favorably with simple t-test or fold methods, and partly compensate for the lack of replication. Availability: The approach is implemented in a software called Cyber-T accessible through a Web interface at www.genomics.uci.edu/software.html. The code is available as Open Source and is written in the freely available statistical language R. and Department of Biological Chemistry, College of Medicine, University of California, Irvine. To whom all correspondence should be addressed. Contact: pfbaldi@ics.uci.edu, tdlong@uci.edu. 1
Minimum redundancy feature selection from microarray gene expression data
, 2003
"... Selecting a small subset of genes out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. We observe that feature sets ..."
Abstract
-
Cited by 239 (8 self)
- Add to MetaCart
(Show Context)
Selecting a small subset of genes out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. We observe that feature sets so obtained have certain redundancy and study methods to minimize it. Feature sets obtained through the minimum redundancy – maximum relevance framework represent broader spectrum of characteristics of phenotypes than those obtained through standard ranking methods; they are more robust, generalize well to unseen data, and lead to significantly improved classifications in extensive experiments on 5 gene expressions data sets.
Kernel Methods for Relation Extraction
, 2002
"... We present an application of kernel methods to extracting relations from unstructured natural language sources. ..."
Abstract
-
Cited by 219 (0 self)
- Add to MetaCart
We present an application of kernel methods to extracting relations from unstructured natural language sources.
BagBoosting for tumor classification with gene expression data
- Bioinformatics
, 2004
"... Motivation: Microarray experiments are expected to contribute significantly to the progress in cancer treatment by enabling a precise and early diagnosis. They create a need for class prediction tools, which can deal with a large number of highly correlated input variables, perform feature selection ..."
Abstract
-
Cited by 194 (2 self)
- Add to MetaCart
Motivation: Microarray experiments are expected to contribute significantly to the progress in cancer treatment by enabling a precise and early diagnosis. They create a need for class prediction tools, which can deal with a large number of highly correlated input variables, perform feature selection and provide class probability estimates that serve as a quantification of the predictive uncertainty. A very promising solution is to combine the two ensemble schemes bagging and boosting to a novel algorithm called BagBoosting.
Results: When bagging is used as a module in boosting, the resulting classifier consistently improves the predictive performance and the probability estimates of both bagging and boosting on real and simulated gene expression data. This quasi-guaranteed improvement can be obtained by simply making a bigger computing effort. The advantageous predictive potential is also confirmed by comparing BagBoosting to several established class prediction tools for microarray data.
Fast Binary Feature Selection with Conditional Mutual Information
- Journal of Machine Learning Research
, 2004
"... We propose in this paper a very fast feature selection technique based on conditional mutual information. ..."
Abstract
-
Cited by 176 (1 self)
- Add to MetaCart
We propose in this paper a very fast feature selection technique based on conditional mutual information.
From patterns to pathways: gene expression data analysis comes of age.
- Nature Genetics
, 2002
"... ..."
Support Vector Machines: Hype or Hallelujah?
- SIGKDD Explorations
, 2003
"... Support Vector Machines (SVMs) and related kernel methods have become increasingly popular tools for data mining tasks such as classification, regression, and novelty detection. The goal of this tutorial is to provide an intuitive explanation of SVMs from a geometric perspective. The classification ..."
Abstract
-
Cited by 119 (1 self)
- Add to MetaCart
(Show Context)
Support Vector Machines (SVMs) and related kernel methods have become increasingly popular tools for data mining tasks such as classification, regression, and novelty detection. The goal of this tutorial is to provide an intuitive explanation of SVMs from a geometric perspective. The classification problem is used to investigate the basic concepts behind SVMs and to examine their strengths and weaknesses from a data mining perspective. While this overview is not comprehensive, it does provide resources for those interested in further exploring SVMs.
Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data
- JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2002
"... Monitoring gene expression profiles is a novel approach in cancer diagnosis. Several studies showed that prediction of cancer types using gene expression data is promising and very informative. The Support Vector Machine (SVM) is one of the classification methods successfully applied to the cancer d ..."
Abstract
-
Cited by 118 (4 self)
- Add to MetaCart
Monitoring gene expression profiles is a novel approach in cancer diagnosis. Several studies showed that prediction of cancer types using gene expression data is promising and very informative. The Support Vector Machine (SVM) is one of the classification methods successfully applied to the cancer diagnosis problems using gene expression data. However, its optimal extension to more than two classes was not obvious, which might impose limitations in its application to multiple tumor types. In this paper, we analyze a couple of published multiple cancer types data sets by the multicategory SVM, which is a recently proposed extension of the binary SVM.