Results 1 - 10
of
41
Cluster Analysis for Gene Expression Data: A Survey
- IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity f ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field. Index Terms—Microarray technology, gene expression data, clustering.
Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions
- Bioinformatics
, 2003
"... Motivation: Two practical realities constrain the analysis of microarray data, mass spectra from proteomics, and biomedical infrared or magnetic resonance spectra. One is the ‘curse of dimensionality’: the number of features characterizing these data is in the thousands or tens of thousands. The oth ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
Motivation: Two practical realities constrain the analysis of microarray data, mass spectra from proteomics, and biomedical infrared or magnetic resonance spectra. One is the ‘curse of dimensionality’: the number of features characterizing these data is in the thousands or tens of thousands. The other is the ‘curse of dataset sparsity’: the number of samples is limited. The consequences of these two curses are far-reaching when such data are used to classify the presence or absence of disease. Results: Using very simple classifiers, we show for several publicly available microarray and proteomics datasets how these curses influence classification outcomes. In particular, even if the sample per feature ratio is increased to the recommended 5–10 by feature extraction/reduction methods, dataset sparsity can render any classification result statistically suspect. In addition, several ‘optimal’ feature sets are typically identifiable for sparse datasets, all producing perfect classification results, both for the training and independent validation sets. This non-uniqueness leads to interpretational difficulties and casts doubt on the biological relevance of any of these ‘optimal’ feature sets. We suggest an approach to assess the relative quality of apparently equally good classifiers.
Pattern Recognition Techniques in Microarray Data Analysis: A Survey. Annals of the New York Academy of Sciences
- of Sciences, techniques in Bioinformatics and Medical Informatics
, 2002
"... analysis Abstract: Recent development of technologies (e.g. microarray technology) that are capable of producing massive amounts of genetic data has highlighted the need for new pattern recognition techniques that can mine and discover “biologically meaningful ” knowledge in large data sets. Many re ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
analysis Abstract: Recent development of technologies (e.g. microarray technology) that are capable of producing massive amounts of genetic data has highlighted the need for new pattern recognition techniques that can mine and discover “biologically meaningful ” knowledge in large data sets. Many researchers have begun an endeavor in this direction to devise such datamining techniques. As such, there is a need for survey articles that periodically review and summarize the work that has been done in the area. This article presents one such survey. The first portion of the paper is meant to provide the basic biology (mostly for non-biologists) that is required in such a project. This part is only meant to be a starting point for those experts in the technical fields who wish to embark on this new area of bioinformatics. The second portion of the paper is a survey of various data mining techniques that have been used in mining microarray data for biological knowledge and information (such as sequence information). This survey is not meant to be treated as complete in any form, as the area is currently one of the most active, and the body of research is very large. Furthermore, the applications of the techniques mentioned here are not meant to be taken as the most significant applications of the techniques, but simply as some examples among many. Molecular Genome Biology
A hybrid ga/svm approach for gene selection and classification of microarray data
- EvoWorkshops 2006, LNCS 3907
, 2006
"... Abstract. We propose a Genetic Algorithm (GA) approach combined with Support Vector Machines (SVM) for the classification of high dimensional Microarray data. This approach is associated to a fuzzy logic based pre-filtering technique. The GA is used to evolve gene subsets whosefitnessisevaluatedbyaS ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
Abstract. We propose a Genetic Algorithm (GA) approach combined with Support Vector Machines (SVM) for the classification of high dimensional Microarray data. This approach is associated to a fuzzy logic based pre-filtering technique. The GA is used to evolve gene subsets whosefitnessisevaluatedbyaSVMclassifier. Using archive records of ”good ” gene subsets, a frequency based technique is introduced to identify the most informative genes. Our approach is assessed on two well-known cancer datasets and shows competitive results with six existing methods.
Genetic programming for mining DNA chip data from cancer patients
- Genetic Programming and Evolvable Machines
, 2004
"... Abstract. In machine learning terms DNA (gene) chip data is unusual in having thousands of attributes (the gene expression values) but few (< 100) records (the patients). A GP based method for both feature selection and generating simple models based on a few genes is demonstrated on cancer data. 1 ..."
Abstract
-
Cited by 14 (7 self)
- Add to MetaCart
Abstract. In machine learning terms DNA (gene) chip data is unusual in having thousands of attributes (the gene expression values) but few (< 100) records (the patients). A GP based method for both feature selection and generating simple models based on a few genes is demonstrated on cancer data. 1
PLS dimension reduction for classification with microarray data
- STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY 3, ISSUE 1, ARTICLE 33
, 2004
"... ..."
www.intl.elsevierhealth.com/journals/cobm Cancer gene search with data-mining and genetic algorithms
, 2005
"... Cancer leads to approximately 25 % of all mortalities, making it the second leading cause of death in the United States. Early and accurate detection of cancer is critical to the well being of patients. Analysis of gene expression data leads to cancer identification and classification, which will fa ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Cancer leads to approximately 25 % of all mortalities, making it the second leading cause of death in the United States. Early and accurate detection of cancer is critical to the well being of patients. Analysis of gene expression data leads to cancer identification and classification, which will facilitate proper treatment selection and drug development. Gene expression data sets for ovarian, prostate, and lung cancer were analyzed in this research. An integrated gene-search algorithm for genetic expression data analysis was proposed. This integrated algorithm involves a genetic algorithm and correlation-based heuristics for data preprocessing (on partitioned data sets) and data mining (decision tree and support vector machines algorithms) for making predictions. Knowledge derived by the proposed algorithm has high classification accuracy with the ability to identify the most significant genes. Bagging and stacking algorithms were applied to further enhance the classification accuracy. The results were compared with that reported in the literature. Mapping of genotype information to the phenotype parameters will ultimately reduce the cost and complexity of cancer detection and classification.
A Bayesian approach to nonlinear probit gene selection and classification
, 2004
"... We considerth problem of gene selection and classification based on th expression data. Specifically, we propose a bootstrap Bayesian gene selectionmetht for nonlinear probit regression. A binomial probit regression modelwith data augmentation is used to transform th binomial problem into a sequence ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
We considerth problem of gene selection and classification based on th expression data. Specifically, we propose a bootstrap Bayesian gene selectionmetht for nonlinear probit regression. A binomial probit regression modelwith data augmentation is used to transform th binomial problem into a sequence of smoothc. problems.Th probit regressor is approximated as a nonlinear combination of th genes. A Gibbs sampler is employed to find th strongest genes. Some numericaltechcalcS to speed up th computation are discussed. WethM develop a nonlinear probit Bayesian classifier consisting of a linear term plus a nonlinear term,th parameters ofwhSz are estimated usingth sequential Monte Carlo techcGSqG Thch newmethGS are applied to analyze several data sets, includingth hludingc breast cancer data,th small round blue-cell tumor data, and th acute leukemia tumor data.Th experimental resultsshu th proposedmethse can effectively find important genes whsc are consistentwith th existing biological belief, and th classification accuracies are very hryc Some robustness and sensitivity properties of th proposedmethse are also discussed to dealwith noisy microarray data.
V.: Gene Selection via Discretized GeneExpression Profiles and Greedy Feature-Elimination. LNAI 3025
- LNAI
, 2004
"... Abstract. Analysis and interpretation of gene-expression profiles, and the identification of respective molecular- or, gene-markers is the key towards the understanding of the genetic basis of major diseases. The problem is challenging because of the huge number of genes (thousands to tenths of thou ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Abstract. Analysis and interpretation of gene-expression profiles, and the identification of respective molecular- or, gene-markers is the key towards the understanding of the genetic basis of major diseases. The problem is challenging because of the huge number of genes (thousands to tenths of thousands!) and the small number of samples (about 50 to 100 cases). In this paper we present a novel gene-selection methodology, based on the discretization of the continuous gene-expression values. With a specially devised gene-ranking metric we measure the strength of each gene with respect to its power to discriminate between sample categories. Then, a greedy feature-elimination algorithm is applied on the rank-ordered genes to form the final set of selected genes. Unseen samples are classified according to a specially devised prediction/matching metric. The methodology was applied on a number of real-world gene-expression studies yielding very good results.
Nonlinear Probit Gene Classification Using Mutual Information And Wavelet-Based Feature Selection
- Biological Systems
, 2004
"... this paper, we consider both ..."

