MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  A Method for Evaluating the Quality of String Dissimilarity Measures and Clustering Algorithms for EST Clustering

Download:
Download as a PDF
by Judith Zimmermann, Zsuzsanna Lipták, Scott Hazelhurst
ftp://ftp.cs.wits.ac.za/pub/research/reports/TR-Wits-CS-2004-0.pdf
Add To MetaCart

Abstract:

We present a method for evaluating the suitability of different string dissimilarity measures and clustering algorithms for EST clustering, one of the main techniques used in transcriptome projects. Our method consists of first generating simulated ESTs according to user-specified parameters, and then evaluating the quality of clusterings produced when different dissimilarity measures and different clustering algorithms are used. We have implemented two tools for this purpose: (i) ESTSim (EST Simulator), which generates simulated EST sequences from mRNAs/cDNAs according to userspecified parameters, and (ii) ECLEST (Evaluator for CLusterings of ESTs), which computes and evaluates a clustering of a set of input ESTs, where the dissimilarity measure, the clustering algorithm, and the clustering validity index can be specified independently. We demonstrate the method on a sample set of 699 cDNAs taken from a public mammalian gene collection: We generated approximately 16,000 simulated ESTs from this set according to parameters that follow the guidelines laid down by NCBI, and compared the clusterings produced using five different dissimilarity measures, while fixing the clustering algorithm and clustering validity index. We then repeated the experiment with higher error parameters. We have been able to derive

Citations

1357 R.C.: Algorithms for clustering data – Jain, Dubes - 1988
1184 Basic local alignment search tool – Altschul, Gish - 1990
574 Algorithms on Strings, Trees, and Sequences – Gusfield - 1997
544 Data clustering: a review – Jain, Murty, et al. - 1999
187 A Gu ided Tour to Approximate String Matching – Navarro - 2001
169 Comparing partitions – Hubert, Arabie - 1985
123 Rapid and sensitive protein similarity searches – Lipman, Pearson - 1985
106 Approximate string matching with q-grams and maximal matches – Ukkonen - 1992
97 Computational analysis of microarray data – Quackenbush - 2001
66 Two algorithms for approximate string matching in static texts – Jokinen, Ukkonen - 1991
54 Validating clustering for gene expression data – YEUNG, HAYNOR, et al.
48 q-gram based database searching using a suffix array (quasar – Burkhardt, Crauser, et al. - 1999
35 An algorithm for clustering cDNAs for gene expression analysis – HARTUV, SCHMITT, et al. - 1999
31 S.: Comparisons and validation of statistical clustering techniques for microarray gene expression data – Datta, Datta
27 The TIGR Gene Indices: Analysis of Gene Transcript Sequences in Highly Sampled Eukaryotic Species – Quackenbush, Cho, et al. - 2001
23 Computational Molecular Biology – Pevzner - 2000
21 d2_Cluster: A Validated Method for Clustering EST and Full-Length cDNA – Burke, Davison, et al. - 1999
20 The SYSTERS protein sequence cluster set – Krause, Stoye, et al. - 2000
19 ESTablishing a human transcript map – Boguski, Schuler - 1995
18 TIGR gene indices clustering tools (TGICL): a software system for fast clustering of large EST datasets – Pertea, Huang, et al. - 2003
16 An optimized protocol for analysis of EST sequences – Liang, Holt, et al.
14 Analysis of genomic sequences by chaos game representtaion – Almeida, Carrico, et al.
12 STACK_PACK and STACK (Sequence Tag Alignment and Consensus Knowledgebase): A novel, comprehensive, hierarchical EST clustering and consensus generation and analysis system providing unique insight into the human genome – Christoffels, Miller, et al. - 1999
10 Cluster Analysis and Related Issues – Dubes - 1993
10 Fast sequence clustering using a suffix array algorithm – Malde, Coward, et al. - 2003
8 Efficient clustering of large EST data sets on parallel computers – Kalyanaraman, Aluru, et al. - 2003
8 A dataset generator for whole genome shotgun sequencing, in `Conference on Intelligent Systems for Molecular Biology (ISMB 99 – Myers - 1999
7 Alignment-free sequence comparison – a review – Vinga, Almeida - 2003
6 GenFrag 2.1: new features for more robust fragment assembly benchmarks – Engle, Burks - 1994
5 Towards new software for computational phylogenetics – Moret, Wang, et al. - 2002
5 Density of points clustering, application to transcriptomic data analysis – Wicker, Dembele, et al. - 2002
3 Comparing algorithms for large-scale sequence analysis – Nash, Blair, et al. - 2001
3 Computation of d 2 : A measure of sequence dissimilarity – Torney, Burks, et al. - 1990
2 The Instructional Use of Learning Objects – Hazelhurst, Bergheim - 2003
1 Clustering protein sequences— structure prediction by transitive homology – Bolten, Schliep, et al.
1 EST Clustering Tutorial – Hide, Miller, et al. - 1999
1 Suitability Comparison of String Distance Measures for EST Clustering – Zimmermann