Download:
by Xiang Zhang, Simon Fraser University
http://fas.sfu.ca/pub/cs/TH/2003/XiangZhangMSc.pdf
Add To MetaCart
Abstract:
The emergence of automated high-throughput sequencing technologies has resulted in a huge increase of the amount of DNA and protein sequences available in public databases. A promising approach for mining such biological sequence data is mining frequent subse-quences. One way to limit the number of patterns discovered is to determine only the most specific frequent subsequences which subsume a large number of more general patterns. In the biological domain, a wealth of knowledge on the relationships between the symbols of the underlying alphabets (in particular, amino-acids) of the sequences has been acquired, which can be represented in concept graphs. Using such concept graphs, much longer frequent patterns can be discovered which are more meaningful from a biological point of view. In this paper, we introduce the problem of mining most specific frequent patterns in biological data in the presence of concept graphs. While the well-known methods for frequent sequence mining typically follow the paradigm of bottom-up pattern generation, we present a novel top-down method (ToMMS) for mining such patterns. ToMMS (1) always generates more specific patterns before more general ones and (2) performs only minimal generalizations of infrequent candidate sequences. Due to these properties, the number of patterns generated and tested is minimized. Our experimental results demonstrate that ToMMS clearly out-performs state-of-the-art methods from the bioinformatics community as well as from the data mining community for reasonably low minimum support thresholds. iii
Citations
|
1449
|
Mining association rules between sets of items in large databases
– Agrawal, Imielinski, et al.
- 1993
|
|
659
|
Mining sequential patterns
– Agrawal, Srikant
- 1995
|
|
490
|
Generalization as search
– MITCHELL
- 1982
|
|
319
|
Amino acid substitution matrices from protein blocks
– Henikoff, Henikoff
- 1989
|
|
149
|
Introduction to Protein Structure
– Branden, Tooze
- 1991
|
|
134
|
Version spaces : An approach to concept learning
– Mitchell
- 1978
|
|
103
|
Approaches to the automatic discovery of patterns in biosequences
– Brazma, Jonassen, et al.
- 1998
|
|
103
|
SPADE: An Efficient Algorithm for Mining Frequent Sequences
– Zaki
- 2000
|
|
98
|
The Complexity of Some Problems on Subsequences and Supersequences
– Maier
- 1978
|
|
90
|
Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm
– Rigoutsos, Floratos
- 1998
|
|
47
|
Sequential PAttern Mining using a Bitmap Representation
– Yiu, Flannick
|
|
36
|
The classification of amino acid conservation
– Taylor
- 1986
|
|
35
|
Automatic generation of primary sequence patterns form sets of related protein sequences
– Smith, Smith
- 1990
|
|
29
|
The PSP approach for mining sequential patterns
– Masseglia, F, et al.
|
|
29
|
Finding sequence motifs in groups of functionally related proteins
– Smith, Annau, et al.
- 1990
|
|
22
|
D.: Efficient mining of spatiotemporal patterns
– Tsoukatos, Gunopulos
- 2001
|
|
16
|
Enumerating and ranking discrete motifs
– Nevill-Manning, Sethi, et al.
- 1997
|
|
13
|
A scalable algorithm for clustering sequential data
– Guralnik
- 2001
|
|
4
|
Discovering empirically conserved amino acid substitution groups in databases of protein families
– Wu, Brutlag
- 1996
|
|
3
|
Pattern Discovery in Biology: Theory and Applications
– Floratos
- 1999
|
|
2
|
The swiss-prot protein knowledgebase and its supplement trembl in 2003
– Blatter, Gasteiger, et al.
- 2003
|
|
2
|
Fuzzy cluster analysis of simple physicochemical properties of amino acids for recognizing secondary structure in proteins. Protein Sci 4:1178–1187
– Mocz
- 1995
|
|
2
|
Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth
– Hsu
- 2001
|
|
1
|
Freespan: freqeunt pattern-projected sequentiall pattern mining
– Hsu
|
|
1
|
Version spaces: A candidagte elimination approach to rule learning
– Mitchell
- 1977
|
|
1
|
Frequentsubseqeunce-bsed prediction of outer membrane proteins
– Gardy, Brinkman
- 2003
|
|
1
|
Mining sequential patterns: Generalizatin and performance improvements
– Srikant, Agrwal
- 1996
|