Results 1 - 10
of
358
Efficient and robust feature selection via joint l21-norms minimization. NIPS
, 2010
"... Feature selection is an important component of many machine learning applications. Especially in many bioinformatics tasks, efficient and robust feature selection methods are desired to extract meaningful features and eliminate noisy ones. In this paper, we propose a new robust feature selection met ..."
Abstract
-
Cited by 71 (24 self)
- Add to MetaCart
(Show Context)
Feature selection is an important component of many machine learning applications. Especially in many bioinformatics tasks, efficient and robust feature selection methods are desired to extract meaningful features and eliminate noisy ones. In this paper, we propose a new robust feature selection method with emphasizing joint ℓ2,1-norm minimization on both loss function and regularization. The ℓ2,1-norm based loss function is robust to outliers in data points and the ℓ2,1norm regularization selects features across all data points with joint sparsity. An efficient algorithm is introduced with proved convergence. Our regression based objective makes the feature selection process more efficient. Our method has been applied into both genomic and proteomic biomarkers discovery. Extensive empirical studies are performed on six data sets to demonstrate the performance of our feature selection method. 1
Mutual information estimation reveals global associations between stimuli and biological processes
- BMC Bioinformatics
, 2009
"... Background: Although gene expression analysis with microarray has become popular, it remains difficult to interpret the biological changes caused by stimuli or variation of conditions. Clustering of genes and associating each group with biolog-ical functions are often used methods. However, such met ..."
Abstract
-
Cited by 42 (37 self)
- Add to MetaCart
(Show Context)
Background: Although gene expression analysis with microarray has become popular, it remains difficult to interpret the biological changes caused by stimuli or variation of conditions. Clustering of genes and associating each group with biolog-ical functions are often used methods. However, such methods only detect partial changes within cell processes. Herein, we propose a method for discovering global changes within a cell by associ-ating observed conditions of gene expression with gene functions. Results: To elucidate the association, we intro-duce a novel feature selection method called Least-Squares Mutual Information (LSMI), which com-putes the relation based on mutual information, and therefore LSMI can detect nonlinear associa-tions within a cell. We demonstrate the effective-ness of LSMI through comparison with existing methods. The results of the application to yeast microarray datasets reveal that non-natural stimuli affect various biological processes, whereas others are no significant relation to specific cell processes. Furthermore, we discover that biological processes can be categorized into four types according to the responses of various stimuli. They are those re-lated to DNA/RNA metabolic processes, gene ex-pression, protein metabolic processes, and protein localization. Conclusions: We proposed a novel feature selection method called LSMI, and applied LSMI to mining the association between conditions of yeast and bi-ological processes through microarray datasets. In fact, LSMI allows us to elucidate the global orga-nization of cellular process control.
Robust feature selection using ensemble feature selection techniques,”
- in Machine Learning and Knowledge Discovery in Databases,
, 2008
"... Abstract. Robustness or stability of feature selection techniques is a topic of recent interest, and is an important issue when selected feature subsets are subsequently analysed by domain experts to gain more insight into the problem modelled. In this work, we investigate the use of ensemble featu ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
(Show Context)
Abstract. Robustness or stability of feature selection techniques is a topic of recent interest, and is an important issue when selected feature subsets are subsequently analysed by domain experts to gain more insight into the problem modelled. In this work, we investigate the use of ensemble feature selection techniques, where multiple feature selection methods are combined to yield more robust results. We show that these techniques show great promise for high-dimensional domains with small sample sizes, and provide more robust feature subsets than a single feature selection technique. In addition, we also investigate the effect of ensemble feature selection techniques on classification performance, giving rise to a new model selection strategy.
Scorpion: Explaining Away Outliers in Aggregate Queries ABSTRACT
"... Database users commonly explore large data sets by running aggregate queries that project the data down to a smaller number of points and dimensions, and visualizing the results. Often, such visualizations will reveal outliers that correspond to errors or surprising features of the input data set. U ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
(Show Context)
Database users commonly explore large data sets by running aggregate queries that project the data down to a smaller number of points and dimensions, and visualizing the results. Often, such visualizations will reveal outliers that correspond to errors or surprising features of the input data set. Unfortunately, databases and visualization systems do not provide a way to work backwards from an outlier point to the common properties of the (possibly many) unaggregated input tuples that correspond to that outlier. We propose Scorpion, a system that takes a set of user-specified outlier points in an aggregate query result as input and finds predicates that explain the outliers in terms of properties of the input tuples that are used to compute the selected outlier results. Specifically, this explanation identifies predicates that, when applied to the input data, cause the outliers to disappear from the output. To find such predicates, we develop a notion of influence of a predicate on a given output, and design several algorithms that efficiently search for maximum influence predicates over the input data. We show that these algorithms can quickly find outliers in two real data sets (from a sensor deployment and a campaign finance data set), and run orders of magnitude faster than a naive search algorithm while providing comparable quality on a synthetic data set. 1.
E.: Integrating the Content and
- Process of Strategic MIS Planning with Competitive Strategy. Decision Sciences 22 (5
, 1991
"... We review here the recent success in quantum annealing, i.e., optimization of the cost or energy functions of complex systems utilizing quantum fluctuations. The concept is introduced in successive steps through the studies of mapping of such computationally hard problems to the classical spin glass ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
We review here the recent success in quantum annealing, i.e., optimization of the cost or energy functions of complex systems utilizing quantum fluctuations. The concept is introduced in successive steps through the studies of mapping of such computationally hard problems to the classical spin glass problems. The quantum spin glass problems arise with the introduction of quantum
P.: Partially supervised feature selection with regularized linear models
- In: Proceedings of the 26th International Conference on Machine Learning
, 2009
"... Partially supervised feature selection with regularized linear models ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
Partially supervised feature selection with regularized linear models
Advancing Feature Selection Research − ASU Feature Selection Repository
"... The rapid advance of computer based high-throughput technique have provided unparalleled op-portunities for humans to expand capabilities in production, services, communications, and research. Meanwhile, immense quantities of high-dimensional data are accumulated challenging state-of-the-art data mi ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
(Show Context)
The rapid advance of computer based high-throughput technique have provided unparalleled op-portunities for humans to expand capabilities in production, services, communications, and research. Meanwhile, immense quantities of high-dimensional data are accumulated challenging state-of-the-art data mining techniques. Feature selection is an essential step in successful data mining applications, which can effectively reduce data dimensionality by removing the irrelevant (and the redundant) fea-tures. In the past few decades, researchers have developed large amount of feature selection algorithms. These algorithms are designed to serve different purposes, are of different models, and all have their own advantages and disadvantages. Although there have been intensive efforts on surveying existing feature selection algorithms, to the best of our knowledge, there is still not a dedicated repository that collects the representative feature selection algorithms to facilitate their comparison and joint study. To fill this gap, in this work we present a feature selection repository, which is designed to collect the most popular algorithms that have been developed in the feature selection research to serve as a platform for facilitating their application, comparison and joint study. The repository also effectively assists researchers to achieve more reliable evaluation in the process of developing new feature selection algorithms. 1
A Regularized Method for Selecting Nested Groups of Relevant Genes from Microarray Data
, 809
"... Gene expression analysis aims at identifying the genes able to accurately predict biological parameters like, for example, disease subtyping or progression. While accurate prediction can be achieved by means of many different techniques, gene identification, due to gene correlation and the limited n ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
Gene expression analysis aims at identifying the genes able to accurately predict biological parameters like, for example, disease subtyping or progression. While accurate prediction can be achieved by means of many different techniques, gene identification, due to gene correlation and the limited number of available samples, is a much more elusive problem. Small changes in the expression values often produce different gene lists, and solutions which are both sparse and stable are difficult to obtain. We propose a two-stage regularization method able to learn linear models characterized by a high prediction performance. By varying a suitable parameter these linear models allow to trade sparsity for the inclusion of correlated genes and to produce gene lists which are almost perfectly nested. Experimental results on synthetic and microarray data confirm the interesting properties of the proposed method and its potential as a starting point for further biological investigations. Matlab code is available upon request. 1
K: Travelling the world of gene–gene interactions
- Brief Bioinform
"... Over the last few years, main effect genetic association analysis has proven to be a successful tool to unravel genetic risk components to a variety of complex diseases. In the quest for disease susceptibility factors and the search for the ‘missing heritability’, supplementary and complementary eff ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
Over the last few years, main effect genetic association analysis has proven to be a successful tool to unravel genetic risk components to a variety of complex diseases. In the quest for disease susceptibility factors and the search for the ‘missing heritability’, supplementary and complementary efforts have been undertaken. These include the inclu-sion of several genetic inheritance assumptions in model development, the consideration of different sources of information, and the acknowledgement of disease underlying pathways of networks. The search for epistasis or gene^gene interaction effects on traits of interest is marked by an exponential growth, not only in terms of meth-odological development, but also in terms of practical applications, translation of statistical epistasis to biological epistasis and integration of omics information sources. The current popularity of the field, as well as its attraction to interdisciplinary teams, each making valuable contributions with sometimes rather unique viewpoints, renders it impossible to give an exhaustive review of to-date available approaches for epistasis screening. The purpose of this work is to give a perspective view on a selection of currently active analysis strategies and concerns in the context of epistasis detection, and to provide an eye to the future of gene^gene interaction analysis.