Results 1 - 10
of
90
Finding motifs using random projections
, 2001
"... Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)-motif problem, Pevz ..."
Abstract
-
Cited by 285 (6 self)
- Add to MetaCart
(Show Context)
Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)-motif problem, Pevzner and Sze introduced algorithms that succeeded. However, their algorithms failed to solve the considerably more difficult (14,4)-, (16,5)-, and (18,6)motif problems. We introduce a novel motif discovery algorithm based on the use of random projections of the input’s substrings. Experiments on simulated data demonstrate that this algorithm performs better than existing algorithms and, in particular, typically solves the difficult (14,4)-, (16,5)-, and (18,6)-motif problems quite efficiently. A probabilistic estimate shows that the small values of � for which the algorithm fails to recover the planted Ð � �-motif are in all likelihood inherently impossible to solve. We also present experimental results on realistic biological data by identifying ribosome binding sites in prokaryotes as well as a number of known transcriptional regulatory motifs in eukaryotes. 1. CHALLENGING MOTIF PROBLEMS Pevzner and Sze [23] considered a very precise version of the motif discovery problem of computational biology, which had also been considered by Sagot [26]. Based on this formulation, they issued an algorithmic challenge: Planted Ð � �-Motif Problem: Suppose there is a fixed but unknown nucleotide sequence Å (the motif) of length Ð. The problem is to determine Å, givenØ nucleotide sequences each of length Ò, and each containing a planted variant of Å. More precisely, each such planted variant is a substring that is Å with exactly � point substitutions. One instantiation that they labeled “The Challenge Problem ” was parameterized as finding a planted (15,4)-motif in Ø � sequences each of length Ò � �. These values of Ò, Ø, andÐ are
Combinatorial approaches to finding subtle signals in DNA sequences,”
- in Proceedings of the 8th International Conference on Intelligient Systems for Molecular Biology,
, 2000
"... Abstract Signal finding (pattern discovery in unaligned DNA sequences) is a fundamental problem in both computer science and molecular biology with important applications in locating regulatory sites and drug target identification. Despite many studies, this problem is far from being resolved: most ..."
Abstract
-
Cited by 258 (5 self)
- Add to MetaCart
(Show Context)
Abstract Signal finding (pattern discovery in unaligned DNA sequences) is a fundamental problem in both computer science and molecular biology with important applications in locating regulatory sites and drug target identification. Despite many studies, this problem is far from being resolved: most signals in DNA sequences are so complicated that we don't yet have good models or reliable algorithms for their recognition. We complement existing statistical and machine learning approaches to this problem by a combinatorial approach that proved to be successhfl in identifying very subtle signals.
Designing seeds for similarity search in genomic dna
- Journal of Computer and System Sciences
, 2003
"... Abstract: Large-scale comparisons of genomic DNA are of fundamental importance in annotating functional elements in genomes. To perform large comparisons efficiently, BLAST [3, 2] and other widely used tools use seeded alignment, which compares only sequences that can be shown to share a common patt ..."
Abstract
-
Cited by 103 (4 self)
- Add to MetaCart
Abstract: Large-scale comparisons of genomic DNA are of fundamental importance in annotating functional elements in genomes. To perform large comparisons efficiently, BLAST [3, 2] and other widely used tools use seeded alignment, which compares only sequences that can be shown to share a common pattern or “seed ” of matching bases. The literature suggests that the choice of seed substantially affects the sensitivity of seeded alignment, but designing and evaluating seeds is computationally challenging. This work addresses problems arising in seed design. We give the fastest known algorithm for evaluating the sensitivity of a seed in a Markov model of ungapped alignments, as well as theoretical results on which seeds are good choices. We also describe Mandala, a software tool for seed design, and show that it can be used to improve the sensitivity of alignment in practice. 1
Computational identification of transcriptional regulatory elements in DNA sequence
, 2006
"... Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computatio ..."
Abstract
-
Cited by 55 (0 self)
- Add to MetaCart
Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and high-throughput experimental methods for mapping protein-binding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cis-regulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges.
VOTING ALGORITHMS FOR DISCOVERING LONG MOTIFS
"... Pevzner and Sze [14] have introduced the Planted (l,d)-Motif Problem to find similar patterns (motifs) in sequences which represent the promoter region of co-regulated genes. l is the length of the motif and d is the maximum Hamming distance around the similar patterns. Many algorithms have been dev ..."
Abstract
-
Cited by 42 (10 self)
- Add to MetaCart
Pevzner and Sze [14] have introduced the Planted (l,d)-Motif Problem to find similar patterns (motifs) in sequences which represent the promoter region of co-regulated genes. l is the length of the motif and d is the maximum Hamming distance around the similar patterns. Many algorithms have been developed to solve this motif problem. However, these algorithms either have long running times or do not guarantee the motif can be found. In this paper, we introduce new algorithms to solve the motif problem. Our algorithms can find motifs in reasonable time for not only the challenging (9,2), (11,3), (15,5)-motif problems but for even longer motifs, say (20,7), (30,11) and (40,15), which have never been seriously attempted by other researchers because of heavy time and space requirements. 1
Extracting structured motifs using a suffix tree - algorithms and application to promoter consensus identification
- In Proceedings of RECOMB 2000
, 2000
"... promoter consensus identification ..."
(Show Context)
Sequence alignment kernel for recognition of promoter regions
- Bioinformatics
, 2003
"... In this paper we propose a new method for recognition of prokaryotic promoter regions with startpoints of transcription. The method is based on Sequence Alignment Kernel, a function reflecting the quantitative measure of match between two sequences. This kernel function is further used in Dual SVM, ..."
Abstract
-
Cited by 31 (0 self)
- Add to MetaCart
(Show Context)
In this paper we propose a new method for recognition of prokaryotic promoter regions with startpoints of transcription. The method is based on Sequence Alignment Kernel, a function reflecting the quantitative measure of match between two sequences. This kernel function is further used in Dual SVM, which performs the recognition. Several recognition methods have been trained and tested on positive data set, consisting of 669 σ 70-promoter regions with known transcription startpoints of Escherichia coli and two negative data sets of 709 examples each, taken from coding and non-coding regions of the same genome. The results show that our method performs well and achieves 16.5 % average error rate on positive & coding negative data and 18.6% average error rate on positive & non-coding negative data. Availability: The demo version of our method is accessible from our website
Greedy Mixture Learning for Multiple Motif Discovery in Biological Sequences
, 2003
"... Motivation: This paper studies the problem of discovering subsequences, known as motifs, that are common to a given collection of related biosequences, by proposing agreedy algorithm for learning a mixture of motifs model through likelihood maximization. The approach adds sequentially a new motif to ..."
Abstract
-
Cited by 29 (6 self)
- Add to MetaCart
Motivation: This paper studies the problem of discovering subsequences, known as motifs, that are common to a given collection of related biosequences, by proposing agreedy algorithm for learning a mixture of motifs model through likelihood maximization. The approach adds sequentially a new motif to a mixture model by performing a combined scheme of global and local search for appropriately initializing its parameters. In addition, a hierarchical partitioning scheme based on kd-trees is presented for partitioning the input dataset in order to speed-up the global searching procedure. The proposed method compares favorably over the well-known MEME approach and treats successfully several drawbacks of MEME.
Separating Real Motifs From Their Artifacts
, 2001
"... The typical output of many computational methods to identify binding sites is a long list of motifs containing some real motifs (those most likely to correspond to the actual binding sites) along with a large number of random variations of these. We present a statistical method to separate real moti ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
The typical output of many computational methods to identify binding sites is a long list of motifs containing some real motifs (those most likely to correspond to the actual binding sites) along with a large number of random variations of these. We present a statistical method to separate real motifs from their artifacts. This produces a short list of high quality motifs that is sufficient to explain the over-representation of all motifs in the given sequences. Using synthetic data sets, we show that the output of our method is very accurate. On various sets of upstream sequences in S. cerevisiae, our program identifies several known binding sites, as well as a number of significant novel motifs. Contact: fblanchem,saurabhg@cs.washington.edu
Functional Bioinformatics of Microarray Data: From Expression to Regulation
, 2002
"... Microarrays are a powerful technique to monitor the expression of thousands of genes in a single experiment. From series of such experiments, it is possible identify the mechanisms that govern the activation of genes in an organism. Short DNA patterns (called binding sites) in or around the genes se ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
(Show Context)
Microarrays are a powerful technique to monitor the expression of thousands of genes in a single experiment. From series of such experiments, it is possible identify the mechanisms that govern the activation of genes in an organism. Short DNA patterns (called binding sites) in or around the genes serve as switches that control gene expression. As a result similar patterns of expression can correspond to similar binding site patterns. We integrate clustering of coexpressed genes with the discovery of binding motifs. We overview several important clustering techniques and present a clustering algorithm (called adaptive quality-based clustering), which we have developed to address several shortcomings of existing methods. We overview the dierent techniques for motif nding, in particular the technique of Gibbs sampling, and we present several extension of this technique in our Motif Sampler. Finally, we present an integrated web tool called INCLUSive (http://www.esat.kuleuven.ac.be/ ~dna/BioI/Software.html) that allows the easy analysis of microarray data for motif nding.