Results 1 - 10
of
28
Cluster Analysis for Gene Expression Data: A Survey
- IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity f ..."
Abstract
-
Cited by 149 (5 self)
- Add to MetaCart
(Show Context)
Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field. Index Terms—Microarray technology, gene expression data, clustering.
A methodology for clustering XML documents by structure
- Information Systems
, 2006
"... The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of ..."
Abstract
-
Cited by 50 (0 self)
- Add to MetaCart
(Show Context)
The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of clustering methods with distances that estimate the similarity between tree structures. This paper presents a framework for clustering XML documents by structure. Modeling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Our approach is tested using a prototype testbed.
Spatial Data Mining
, 2003
"... Spatial data mining is the process of discovering interesting and previously unknown, but potentially useful, patterns from large spatial datasets. Extracting interesting and useful patterns from spatial datasets is more di#cult than extracting the corresponding patterns from traditional numeric and ..."
Abstract
-
Cited by 35 (8 self)
- Add to MetaCart
Spatial data mining is the process of discovering interesting and previously unknown, but potentially useful, patterns from large spatial datasets. Extracting interesting and useful patterns from spatial datasets is more di#cult than extracting the corresponding patterns from traditional numeric and categorical data due to the complexity of spatial data types, spatial relationships, and spatial autocorrelation. This chapter will discuss some of accomplishments and research needs of spatial data mining in the following categories: location prediction, spatial outlier detection, co-location mining, and clustering.
Algorithms for clustering expressed sequence tags: the wcd tool
, 2008
"... Understanding which genes are active, and when and why, is an important question for molecular biology. Expressed Sequence Tags (ESTs) are a technology used to explore the transcriptome (a record of this gene activity). ESTs are short fragments of DNA created in the laboratory from mRNA extracted fr ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Understanding which genes are active, and when and why, is an important question for molecular biology. Expressed Sequence Tags (ESTs) are a technology used to explore the transcriptome (a record of this gene activity). ESTs are short fragments of DNA created in the laboratory from mRNA extracted from a cell. The key computational step in their processing is clustering: putting all ESTs associated from the same RNA together. Accurate clustering is quadratic in time in average EST length and number of ESTs, which makes naïve algorithms infeasible for real data sets. The wcd EST clustering system is an open source clustering system that provides efficient implementations of key distance measures, heuristics for speeding up clustering, a pre-clustering booster based on suffix arrays, as well as parallelised implementations based on MPI and Pthreads. This paper presents the underlying algorithms in wcd. The code is available from
A Fine-Grained XML Structural Comparison Approach
"... Abstract. As the Web continues to grow and evolve, more and more information is being placed in structurally rich documents, XML documents in particular, so as to improve the efficiency of similarity clustering, information retrieval and data management applications. Various algorithms for comparing ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
(Show Context)
Abstract. As the Web continues to grow and evolve, more and more information is being placed in structurally rich documents, XML documents in particular, so as to improve the efficiency of similarity clustering, information retrieval and data management applications. Various algorithms for comparing hierarchically structured data, e.g., XML documents, have been proposed in the literature. Most of them make use of techniques for finding the edit distance between tree structures, XML documents being modeled as ordered labeled trees. Nevertheless, a thorough investigation of current approaches led us to identify several structural similarity aspects, i.e. sub-tree related similarities, which are not sufficiently addressed while comparing XML documents. In this paper, we provide an improved comparison method to deal with fine-grained sub-trees and leaf node repetitions, without increasing overall complexity with respect to current XML comparison methods. Our approach consists of two main algorithms for discovering the structural commonality between sub-trees and computing tree-based edit operations costs. A prototype has been developed to evaluate the optimality and performance of our method. Experimental results, on both real and synthetic XML data, demonstrate better performance with respect to alternative XML comparison methods. Keywords: XML, Semi-structured data, Structural similarity, Tree edit distance. 1
Efficient Image Segmentation Using Pairwise Pixel Similarities
"... Abstract. Image segmentation based on pairwise pixel similarities has been a very active field of research in recent years. The drawbacks common to these segmentation methods are the enormous space and processor requirements. The contribution of this paper is a general purpose two-stage preprocessin ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Image segmentation based on pairwise pixel similarities has been a very active field of research in recent years. The drawbacks common to these segmentation methods are the enormous space and processor requirements. The contribution of this paper is a general purpose two-stage preprocessing method that substantially reduces the involved costs. Initially, an oversegmentation into small coherent image patches-or superpixels- is obtained through an iterative process guided by pixel similarities. A suitable pairwise superpixel similarity measure is then defined which may be plugged into an arbitrary segmentation method based on pairwise pixel similarities. To illustrate our ideas we integrated the algorithm into a spectral graph-partitioning method using the Normalized Cut criterion. Our experiments show that the time and memory requirements are reduced drastically (> 99%), while segmentations of adequate quality are obtained. 1
A Multi-Clustering Fusion Scheme For Data Partitioning
, 2005
"... A multi-clustering fusion method is presented based on combining several runs of a clustering algorithm resulting in a common partition. More specifically ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
A multi-clustering fusion method is presented based on combining several runs of a clustering algorithm resulting in a common partition. More specifically
Approaches to Partition Medical Data using Clustering Algorithms
"... The successful application of data mining in fields like e-business, marketing and retail have led to the popularity of its use in knowledge discovery in databases (KDD) in other industries and sectors. Data is a great asset to meet long-term goals of any organization and can help to improve custome ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The successful application of data mining in fields like e-business, marketing and retail have led to the popularity of its use in knowledge discovery in databases (KDD) in other industries and sectors. Data is a great asset to meet long-term goals of any organization and can help to improve customer relationship management. It can also benefit healthcare providers like hospitals, clinics and physicians, and patients, for example, by identifying effective treatments and best practices popularity of its use in knowledge discovery in databases (KDD) in other industries and sectors. Efficient clustering tools reduce demand on costly healthcare resources. It can help physicians cope with the information overload and can assist in future planning for improved services. Clustering results are used to study independence or correlation between diseases and for better insight into medical survey data. To achieve this, create clustering algorithms that enhances the traditional K-Means, DB-Scan and Fuzzy C-Means algorithms.
Approved: 2005-02-18
"... How well can clustering methods capture a phonetic classification? Can the ”appropriate” number of clusters be determined automatically? Which kinds of phonetical features group together naturally? How can clustering quality be measured? ” To what extent is an automatic clustering method reliable in ..."
Abstract
- Add to MetaCart
(Show Context)
How well can clustering methods capture a phonetic classification? Can the ”appropriate” number of clusters be determined automatically? Which kinds of phonetical features group together naturally? How can clustering quality be measured? ” To what extent is an automatic clustering method reliable in this case? This study tries to answer these questions with a number of experiments conducted on speech data using K-means and fuzzy K-means clustering. Optimal number of clusters were determined with the Davies Bouldin and I family indexes. For this report, the data considered was extracted from the TIMIT database, a corpus of read speech with phoneme transcription. A small clustering toolbox for Matlab was implemented. It computed clusterings using various classical methods and cluster validity indexes to assess quality. A number of benchmark tests were run on the IRIS data as well as synthetic data. The speech examples, show that when clustering phonemes, certain acoustical and articulatory features can be captured. Fuzzy clustering can improve cluster quality. The