We present a method for evaluating the suitability of different string dissimilarity measures and clustering algorithms for EST clustering, one of the main techniques used in transcriptome projects. Our method consists of first generating simulated ESTs according to user-specified parameters, and then evaluating the quality of clusterings produced when different dissimilarity measures and different clustering algorithms are used. We have implemented two tools for this purpose: (i) ESTSim (EST Simulator), which generates simulated EST sequences from mRNAs/cDNAs according to userspecified parameters, and (ii) ECLEST (Evaluator for CLusterings of ESTs), which computes and evaluates a clustering of a set of input ESTs, where the dissimilarity measure, the clustering algorithm, and the clustering validity index can be specified independently. We demonstrate the method on a sample set of 699 cDNAs taken from a public mammalian gene collection: We generated approximately 16,000 simulated ESTs from this set according to parameters that follow the guidelines laid down by NCBI, and compared the clusterings produced using five different dissimilarity measures, while fixing the clustering algorithm and clustering validity index. We then repeated the experiment with higher error parameters. We have been able to derive
|
1357
|
R.C.: Algorithms for clustering data
– Jain, Dubes
- 1988
|
|
1184
|
Basic local alignment search tool
– Altschul, Gish
- 1990
|
|
574
|
Algorithms on Strings, Trees, and Sequences
– Gusfield
- 1997
|
|
544
|
Data clustering: a review
– Jain, Murty, et al.
- 1999
|
|
187
|
A Gu ided Tour to Approximate String Matching
– Navarro
- 2001
|
|
169
|
Comparing partitions
– Hubert, Arabie
- 1985
|
|
123
|
Rapid and sensitive protein similarity searches
– Lipman, Pearson
- 1985
|
|
106
|
Approximate string matching with q-grams and maximal matches
– Ukkonen
- 1992
|
|
97
|
Computational analysis of microarray data
– Quackenbush
- 2001
|
|
66
|
Two algorithms for approximate string matching in static texts
– Jokinen, Ukkonen
- 1991
|
|
54
|
Validating clustering for gene expression data
– YEUNG, HAYNOR, et al.
|
|
48
|
q-gram based database searching using a suffix array (quasar
– Burkhardt, Crauser, et al.
- 1999
|
|
35
|
An algorithm for clustering cDNAs for gene expression analysis
– HARTUV, SCHMITT, et al.
- 1999
|
|
31
|
S.: Comparisons and validation of statistical clustering techniques for microarray gene expression data
– Datta, Datta
|
|
27
|
The TIGR Gene Indices: Analysis of Gene Transcript Sequences in Highly Sampled Eukaryotic Species
– Quackenbush, Cho, et al.
- 2001
|
|
23
|
Computational Molecular Biology
– Pevzner
- 2000
|
|
21
|
d2_Cluster: A Validated Method for Clustering EST and Full-Length cDNA
– Burke, Davison, et al.
- 1999
|
|
20
|
The SYSTERS protein sequence cluster set
– Krause, Stoye, et al.
- 2000
|
|
19
|
ESTablishing a human transcript map
– Boguski, Schuler
- 1995
|
|
18
|
TIGR gene indices clustering tools (TGICL): a software system for fast clustering of large EST datasets
– Pertea, Huang, et al.
- 2003
|
|
16
|
An optimized protocol for analysis of EST sequences
– Liang, Holt, et al.
|
|
14
|
Analysis of genomic sequences by chaos game representtaion
– Almeida, Carrico, et al.
|
|
12
|
STACK_PACK and STACK (Sequence Tag Alignment and Consensus Knowledgebase): A novel, comprehensive, hierarchical EST clustering and consensus generation and analysis system providing unique insight into the human genome
– Christoffels, Miller, et al.
- 1999
|
|
10
|
Cluster Analysis and Related Issues
– Dubes
- 1993
|
|
10
|
Fast sequence clustering using a suffix array algorithm
– Malde, Coward, et al.
- 2003
|
|
8
|
Efficient clustering of large EST data sets on parallel computers
– Kalyanaraman, Aluru, et al.
- 2003
|
|
8
|
A dataset generator for whole genome shotgun sequencing, in `Conference on Intelligent Systems for Molecular Biology (ISMB 99
– Myers
- 1999
|
|
7
|
Alignment-free sequence comparison – a review
– Vinga, Almeida
- 2003
|
|
6
|
GenFrag 2.1: new features for more robust fragment assembly benchmarks
– Engle, Burks
- 1994
|
|
5
|
Towards new software for computational phylogenetics
– Moret, Wang, et al.
- 2002
|
|
5
|
Density of points clustering, application to transcriptomic data analysis
– Wicker, Dembele, et al.
- 2002
|
|
3
|
Comparing algorithms for large-scale sequence analysis
– Nash, Blair, et al.
- 2001
|
|
3
|
Computation of d 2 : A measure of sequence dissimilarity
– Torney, Burks, et al.
- 1990
|
|
2
|
The Instructional Use of Learning Objects
– Hazelhurst, Bergheim
- 2003
|
|
1
|
Clustering protein sequences— structure prediction by transitive homology
– Bolten, Schliep, et al.
|
|
1
|
EST Clustering Tutorial
– Hide, Miller, et al.
- 1999
|
|
1
|
Suitability Comparison of String Distance Measures for EST Clustering
– Zimmermann
|