MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Title of thesis: A TOP-DOWN APPROACH FOR MINING MOST SPECIFIC FREQUENT PATTERNS IN BIOLOGICAL SEQUENCE DATA (2003)

Download:
Download as a PDF
by Xiang Zhang, Simon Fraser University
http://fas.sfu.ca/pub/cs/TH/2003/XiangZhangMSc.pdf
Add To MetaCart

Abstract:

The emergence of automated high-throughput sequencing technologies has resulted in a huge increase of the amount of DNA and protein sequences available in public databases. A promising approach for mining such biological sequence data is mining frequent subse-quences. One way to limit the number of patterns discovered is to determine only the most specific frequent subsequences which subsume a large number of more general patterns. In the biological domain, a wealth of knowledge on the relationships between the symbols of the underlying alphabets (in particular, amino-acids) of the sequences has been acquired, which can be represented in concept graphs. Using such concept graphs, much longer frequent patterns can be discovered which are more meaningful from a biological point of view. In this paper, we introduce the problem of mining most specific frequent patterns in biological data in the presence of concept graphs. While the well-known methods for frequent sequence mining typically follow the paradigm of bottom-up pattern generation, we present a novel top-down method (ToMMS) for mining such patterns. ToMMS (1) always generates more specific patterns before more general ones and (2) performs only minimal generalizations of infrequent candidate sequences. Due to these properties, the number of patterns generated and tested is minimized. Our experimental results demonstrate that ToMMS clearly out-performs state-of-the-art methods from the bioinformatics community as well as from the data mining community for reasonably low minimum support thresholds. iii

Citations

1449 Mining association rules between sets of items in large databases – Agrawal, Imielinski, et al. - 1993
659 Mining sequential patterns – Agrawal, Srikant - 1995
490 Generalization as search – MITCHELL - 1982
319 Amino acid substitution matrices from protein blocks – Henikoff, Henikoff - 1989
149 Introduction to Protein Structure – Branden, Tooze - 1991
134 Version spaces : An approach to concept learning – Mitchell - 1978
103 Approaches to the automatic discovery of patterns in biosequences – Brazma, Jonassen, et al. - 1998
103 SPADE: An Efficient Algorithm for Mining Frequent Sequences – Zaki - 2000
98 The Complexity of Some Problems on Subsequences and Supersequences – Maier - 1978
90 Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm – Rigoutsos, Floratos - 1998
47 Sequential PAttern Mining using a Bitmap Representation – Yiu, Flannick
36 The classification of amino acid conservation – Taylor - 1986
35 Automatic generation of primary sequence patterns form sets of related protein sequences – Smith, Smith - 1990
29 The PSP approach for mining sequential patterns – Masseglia, F, et al.
29 Finding sequence motifs in groups of functionally related proteins – Smith, Annau, et al. - 1990
22 D.: Efficient mining of spatiotemporal patterns – Tsoukatos, Gunopulos - 2001
16 Enumerating and ranking discrete motifs – Nevill-Manning, Sethi, et al. - 1997
13 A scalable algorithm for clustering sequential data – Guralnik - 2001
4 Discovering empirically conserved amino acid substitution groups in databases of protein families – Wu, Brutlag - 1996
3 Pattern Discovery in Biology: Theory and Applications – Floratos - 1999
2 The swiss-prot protein knowledgebase and its supplement trembl in 2003 – Blatter, Gasteiger, et al. - 2003
2 Fuzzy cluster analysis of simple physicochemical properties of amino acids for recognizing secondary structure in proteins. Protein Sci 4:1178–1187 – Mocz - 1995
2 Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth – Hsu - 2001
1 Freespan: freqeunt pattern-projected sequentiall pattern mining – Hsu
1 Version spaces: A candidagte elimination approach to rule learning – Mitchell - 1977
1 Frequentsubseqeunce-bsed prediction of outer membrane proteins – Gardy, Brinkman - 2003
1 Mining sequential patterns: Generalizatin and performance improvements – Srikant, Agrwal - 1996