Results 1  10
of
278
Fitting a mixture model by expectation maximization to discover motifs in biopolymers.
 Proc Int Conf Intell Syst Mol Biol
, 1994
"... Abstract The algorithm described in this paper discovers one or more motifs in a collection of DNA or protein sequences by using the technique of expect~tiou ma.,dmization to fit a twocomponent finite mixture model to the set of sequences. Multiple motifs are found by fitting a mixture model to th ..."
Abstract

Cited by 947 (5 self)
 Add to MetaCart
(Show Context)
Abstract The algorithm described in this paper discovers one or more motifs in a collection of DNA or protein sequences by using the technique of expect~tiou ma.,dmization to fit a twocomponent finite mixture model to the set of sequences. Multiple motifs are found by fitting a mixture model to the data, probabilistically erasing tile occurrences of the motif thus found, and repeating the process to find successive motifs. The algorithm requires only a set of unaligned sequences and a number specifying the width of the motifs as input. It returns a model of each motif and a threshold which together can be used as a Bayesoptimal classifier for searching for occurrences of the motif in other databases. The algorithm estimates how many times each motif occurs in each sequence in the dataset and outputs an alignment of the occurrences of the motif. The algorithm is capable of discovering several different motifs with differing numbers of occurrences in a single dataset.
Finding motifs using random projections
, 2001
"... Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)motif problem, Pevz ..."
Abstract

Cited by 285 (6 self)
 Add to MetaCart
(Show Context)
Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)motif problem, Pevzner and Sze introduced algorithms that succeeded. However, their algorithms failed to solve the considerably more difficult (14,4), (16,5), and (18,6)motif problems. We introduce a novel motif discovery algorithm based on the use of random projections of the input’s substrings. Experiments on simulated data demonstrate that this algorithm performs better than existing algorithms and, in particular, typically solves the difficult (14,4), (16,5), and (18,6)motif problems quite efficiently. A probabilistic estimate shows that the small values of � for which the algorithm fails to recover the planted Ð � �motif are in all likelihood inherently impossible to solve. We also present experimental results on realistic biological data by identifying ribosome binding sites in prokaryotes as well as a number of known transcriptional regulatory motifs in eukaryotes. 1. CHALLENGING MOTIF PROBLEMS Pevzner and Sze [23] considered a very precise version of the motif discovery problem of computational biology, which had also been considered by Sagot [26]. Based on this formulation, they issued an algorithmic challenge: Planted Ð � �Motif Problem: Suppose there is a fixed but unknown nucleotide sequence Å (the motif) of length Ð. The problem is to determine Å, givenØ nucleotide sequences each of length Ò, and each containing a planted variant of Å. More precisely, each such planted variant is a substring that is Å with exactly � point substitutions. One instantiation that they labeled “The Challenge Problem ” was parameterized as finding a planted (15,4)motif in Ø � sequences each of length Ò � �. These values of Ò, Ø, andÐ are
Combinatorial approaches to finding subtle signals in DNA sequences,”
 in Proceedings of the 8th International Conference on Intelligient Systems for Molecular Biology,
, 2000
"... Abstract Signal finding (pattern discovery in unaligned DNA sequences) is a fundamental problem in both computer science and molecular biology with important applications in locating regulatory sites and drug target identification. Despite many studies, this problem is far from being resolved: most ..."
Abstract

Cited by 258 (5 self)
 Add to MetaCart
(Show Context)
Abstract Signal finding (pattern discovery in unaligned DNA sequences) is a fundamental problem in both computer science and molecular biology with important applications in locating regulatory sites and drug target identification. Despite many studies, this problem is far from being resolved: most signals in DNA sequences are so complicated that we don't yet have good models or reliable algorithms for their recognition. We complement existing statistical and machine learning approaches to this problem by a combinatorial approach that proved to be successhfl in identifying very subtle signals.
Combining evidence using pvalues: Application to sequence homology searches
 Bioinformatics
, 1998
"... Motivation: To illustrate an intuitive and statistically valid method for combining independent sources of evidence that yields a pvalue for the complete evidence, and to apply it to the problem of detecting simultaneous matches to multiple patterns in sequence homology searches. Results: In seque ..."
Abstract

Cited by 228 (13 self)
 Add to MetaCart
(Show Context)
Motivation: To illustrate an intuitive and statistically valid method for combining independent sources of evidence that yields a pvalue for the complete evidence, and to apply it to the problem of detecting simultaneous matches to multiple patterns in sequence homology searches. Results: In sequence analysis, two or more (approximately) independent measures of the membership of a sequence (or sequence region) in some class are often available. We would like to estimate the likelihood of the sequence being a member of the class in view of all the available evidence. An example is estimating the significance of the observed match of a macromolecular sequence (DNA or protein) to a set of patterns (motifs) that characterize a biological sequence family. An intuitive way to do this is to express each piece of evidence as a pvalue, and then use the product of these pvalues as the measure of membership in the family. We derive a formula and algorithm (qfast) for calculating the statistic...
Learning Structural SVMs with Latent Variables
"... It is well known in statistics and machine learning that the combination of latent (or hidden) variables and observed variables offer more expressive power than models with observed variables alone. Latent variables ..."
Abstract

Cited by 215 (2 self)
 Add to MetaCart
(Show Context)
It is well known in statistics and machine learning that the combination of latent (or hidden) variables and observed variables offer more expressive power than models with observed variables alone. Latent variables
Probabilistic discovery of time series motifs
, 2003
"... Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of thi ..."
Abstract

Cited by 185 (26 self)
 Add to MetaCart
(Show Context)
Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of this work were the poor scalability of the motif discovery algorithm, and the inability to discover motifs in the presence of noise. Here we address these limitations by introducing a novel algorithm inspired by recent advances in the problem of pattern discovery in biosequences. Our algorithm is probabilistic in nature, but as we show empirically and theoretically, it can find time series motifs with very high probability even in the presence of noise or “don’t care ” symbols. Not only is the algorithm fast, but it is an anytime algorithm, producing likely candidate motifs almost immediately, and gradually improving the quality of results over time.
A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes
 Journal of Computational Biology
, 2002
"... Microarray experiments can reveal important information about transcriptional regulation. In our case, we look for potential promoter regulatory elements in the upstream region of coexpressed genes. Here we present two modi � cations of the original Gibbs sampling algorithm for motif � nding (Lawren ..."
Abstract

Cited by 114 (10 self)
 Add to MetaCart
(Show Context)
Microarray experiments can reveal important information about transcriptional regulation. In our case, we look for potential promoter regulatory elements in the upstream region of coexpressed genes. Here we present two modi � cations of the original Gibbs sampling algorithm for motif � nding (Lawrence et al., 1993). First, we introduce the use of a probability distribution to estimate the number of copies of the motif in a sequence. Second, we describe the technical aspects of the incorporation of a higherorder background model whose application we discussed in Thijs et al. (2001). Our implementation is referred to as the Motif Sampler. We successfully validate our algorithm on several data sets. First, we show results for three sets of upstream sequences containing known motifs: 1) the Gbox lightresponse element in plants, 2) elements involved in methionine response in Saccharomyces cerevisiae, and 3) the FNR O2responsive element in bacteria. We use these data sets to explain the in � uence of the parameters on the performance of our algorithm. Second, we show results for upstream sequences from four clusters of coexpressed genes identi � ed in a microarray experiment on wounding in Arabidopsis thaliana. Several motifs could be matched to regulatory elements from plant defence pathways in our database of plant cisacting regulatory elements (PlantCARE). Some other strong motifs do not have corresponding motifs in PlantCARE but are promising candidates for further analysis.
Finding composite regulatory patterns in DNA sequences
 Bioinformatics
, 2002
"... Pattern discovery in unaligned DNA sequences is a fundamental problem in computational biology with important applications in finding regulatory signals. Current approaches to pattern discovery focus on monad patterns that correspond to relatively short contiguous strings. However, many of the actua ..."
Abstract

Cited by 108 (4 self)
 Add to MetaCart
Pattern discovery in unaligned DNA sequences is a fundamental problem in computational biology with important applications in finding regulatory signals. Current approaches to pattern discovery focus on monad patterns that correspond to relatively short contiguous strings. However, many of the actual regulatory signals are composite patterns that are groups of monad patterns that occur near each other. A difficulty in discovering composite patterns is that one or both of the component monad patterns in the group may be “too weak”. Since the traditional monadbased motif finding algorithms usually output one (or a few) high scoring patterns, they often fail to find composite regulatory signals consisting of weak monad parts. In this paper, we present a MITRA (MIsmatch TRee Algorithm) approach for discovering composite signals. We demonstrate that MITRA performs well for both monad and composite patterns by presenting experiments over biological and synthetic data. Availability: MITRA is available at
Conservation, Regulation, Synteny, and Introns in a Largescale C. briggsaeC. elegans Genomic Alignment
, 2000
"... This paper presents a genomic DNA comparison between the nematodes Caenorhabditis briggsae and Caenorhabditis elegans. C. elegans and C. briggsae are closely related nematodes in the same genus (Baldwin et al. 1997; Blaxter 1998; Blaxter et al. 1998; Voronov et al. 1998). They are estimated to have ..."
Abstract

Cited by 103 (0 self)
 Add to MetaCart
This paper presents a genomic DNA comparison between the nematodes Caenorhabditis briggsae and Caenorhabditis elegans. C. elegans and C. briggsae are closely related nematodes in the same genus (Baldwin et al. 1997; Blaxter 1998; Blaxter et al. 1998; Voronov et al. 1998). They are estimated to have diverged 2550 million years ago, although without a fossil record these estimates are highly dependent on assumptions about mutation rates that vary considerably between organisms and between genes (Ayala et al. 1996). In practical terms, C. elegans and C. briggsae are separated by a nearly ideal distance for comparative genomics. In regions experiencing selective pressure, close to 80% base identity is preserved between species. In other regions base identity is close to 30% (Shabalina and Kondrashov 1999). The largescale comparative genomic study we present here aligns 8 million bases of C. briggsae sequence from 229 cosmids with 97 million bases of C. elegans sequence covering essentially the entire genome (C. elegans Sequencing Consortium 1998). The scale of this comparison presented unique challenges and resulted in the development of new algorithms that can cope with long insertions. Our algorithms are also able to recognize homologous regions at the DNA level despite the rapid divergence in the wobble position of most codons.
MetaMEME: motifbased hidden Markov models of protein families
 Comput Appl Biosci
, 1997
"... Motivation: Modeling families of related biological sequences using Hidden Markov models (HMMs), although increasingly widespread, faces at least one major problem: because of the complexity of these mathematical models, they require a relatively large training set in order to accurately recognize a ..."
Abstract

Cited by 95 (10 self)
 Add to MetaCart
(Show Context)
Motivation: Modeling families of related biological sequences using Hidden Markov models (HMMs), although increasingly widespread, faces at least one major problem: because of the complexity of these mathematical models, they require a relatively large training set in order to accurately recognize a given family. For families in which there are few known sequences, a standard linear HMM contains too many parameters to be trained adequately. Results: This work attempts to solve that problem by generating smaller HMMs which precisely model only the conserved regions of the family. These HMMs are constructed from motif models generated by the EM algorithm using the MEME software. Because motifbased HMMs have relatively few parameters, they can be trained using smaller data sets. Studies of short chain alcohol dehydrogenases and 4Fe4S ferredoxins support the claim that motifbased HMMs exhibit increased sensitivity and selectivity in database searches, especially when training sets contain few sequences.