Results 1  10
of
67
Finding motifs using random projections
, 2001
"... Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)motif problem, Pevz ..."
Abstract

Cited by 285 (6 self)
 Add to MetaCart
(Show Context)
Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)motif problem, Pevzner and Sze introduced algorithms that succeeded. However, their algorithms failed to solve the considerably more difficult (14,4), (16,5), and (18,6)motif problems. We introduce a novel motif discovery algorithm based on the use of random projections of the input’s substrings. Experiments on simulated data demonstrate that this algorithm performs better than existing algorithms and, in particular, typically solves the difficult (14,4), (16,5), and (18,6)motif problems quite efficiently. A probabilistic estimate shows that the small values of � for which the algorithm fails to recover the planted Ð � �motif are in all likelihood inherently impossible to solve. We also present experimental results on realistic biological data by identifying ribosome binding sites in prokaryotes as well as a number of known transcriptional regulatory motifs in eukaryotes. 1. CHALLENGING MOTIF PROBLEMS Pevzner and Sze [23] considered a very precise version of the motif discovery problem of computational biology, which had also been considered by Sagot [26]. Based on this formulation, they issued an algorithmic challenge: Planted Ð � �Motif Problem: Suppose there is a fixed but unknown nucleotide sequence Å (the motif) of length Ð. The problem is to determine Å, givenØ nucleotide sequences each of length Ò, and each containing a planted variant of Å. More precisely, each such planted variant is a substring that is Å with exactly � point substitutions. One instantiation that they labeled “The Challenge Problem ” was parameterized as finding a planted (15,4)motif in Ø � sequences each of length Ò � �. These values of Ò, Ø, andÐ are
Comining phylogenetic data with coregulated genes to identify regulatory motif
 BIOINFORMATICS
, 2003
"... Motivation: Discovery of regulatory motifs in unaligned DNA sequences remains a fundamental problem in computational biology. Two categories of algorithms have been developed to identify common motifs from a set of DNA sequences. The first can be called a ‘multiple genes, single species’approach. It ..."
Abstract

Cited by 136 (11 self)
 Add to MetaCart
(Show Context)
Motivation: Discovery of regulatory motifs in unaligned DNA sequences remains a fundamental problem in computational biology. Two categories of algorithms have been developed to identify common motifs from a set of DNA sequences. The first can be called a ‘multiple genes, single species’approach. It proposes that a degenerate motif is embedded in some or all of the otherwise unrelated input sequences and tries to describe a consensus motif and identify its occurrences. It is often used for coregulated genes identified through experimental approaches. The second approach can be called ‘single gene, multiple species’. It requires orthologous input sequences and tries to identify unusually well conserved regions by phylogenetic footprinting. Both approaches perform well, but each has some limitations. It is tempting to combine the knowledge of coregulation among different genes and conservation among orthologous genes to improve our ability to identify motifs. Results: Based on the Consensus algorithm previously established by our group, we introduce a new algorithm called PhyloCon (Phylogenetic Consensus) that takes into account both conservation among orthologous genes and coregulation of genes within a species. This algorithm first aligns conserved regions of orthologous sequences into multiple sequence alignments, or profiles, then compares profiles representing nonorthologous sequences. Motifs emerge as common regions in these profiles. Here we present a novel statistic to compare profiles of DNA sequences and a greedy approach to search for common subprofiles. We demonstrate that PhyloCon performs well on both synthetic and biological data. Availability: Software available upon request from the authors.
Finding Subtle Motifs by Branching from Sample Strings
, 2003
"... Many motif finding algorithms apply local search techniques to a set of seeds. For example, GibbsDNA (Lawrence et al., 1993) applies Gibbs sampling to random seeds, and MEME (Bailey and Elkan, 1994) applies the EM algorithm to selected sample strings, i.e. substrings of the sample. In the case of su ..."
Abstract

Cited by 43 (0 self)
 Add to MetaCart
Many motif finding algorithms apply local search techniques to a set of seeds. For example, GibbsDNA (Lawrence et al., 1993) applies Gibbs sampling to random seeds, and MEME (Bailey and Elkan, 1994) applies the EM algorithm to selected sample strings, i.e. substrings of the sample. In the case of subtle motifs, recent benchmarking efforts show that both random seeds and selected sample strings may never get close to the globally optimal motif. We propose a new approach which searches motif space by branching from sample strings, and implement this idea in both patternbased and profilebased settings. Our PatternBranching and ProfileBranching algorithms achieve favorable results relative to other motif finding algorithms.
Motif discovery in heterogeneous sequence data
 Pac. Symp. Biocomput
, 2004
"... This paper introduces the first integrated algorithm designed to discover novel motifs in heterogeneous sequence data, which is comprised of coregulated genes from a single genome together with the orthologs of these genes from other genomes. Results are presented for regulons in yeasts, worms, and ..."
Abstract

Cited by 40 (1 self)
 Add to MetaCart
This paper introduces the first integrated algorithm designed to discover novel motifs in heterogeneous sequence data, which is comprised of coregulated genes from a single genome together with the orthologs of these genes from other genomes. Results are presented for regulons in yeasts, worms, and mammals. 1 Regulatory Elements and Sequence Sources An important and challenging question facing biologists is to understand the varied and complex mechanisms that regulate gene expression: how, when, in what cells, and at what rate is a given gene turned on and off? This paper focuses on one important aspect of this challenge, the discovery of novel binding sites in DNA (also called regulatory elements) for the proteins involved in such gene regulation. This is an important first step in determining which proteins regulate the gene and how. Until the present, nearly all regulatory element discovery algorithms
FixedParameter Algorithms for Closest String and Related Problems
 ALGORITHMICA
, 2003
"... Closest String is one of the core problems in the field of consensus word analysis with particular importance for computational biology. Given k strings ..."
Abstract

Cited by 36 (7 self)
 Add to MetaCart
Closest String is one of the core problems in the field of consensus word analysis with particular importance for computational biology. Given k strings
Methods in comparative genomics: Genome correspondence, gene identification, and regulatory motif discovery
 Journal of Computational Biology
, 2004
"... In Kellis et al. (2003), we reported the genome sequences of S. paradoxus, S. mikatae, and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genomewide comparative analysis allowed the identification of functionally important sequences, both coding and noncodi ..."
Abstract

Cited by 35 (6 self)
 Add to MetaCart
(Show Context)
In Kellis et al. (2003), we reported the genome sequences of S. paradoxus, S. mikatae, and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genomewide comparative analysis allowed the identification of functionally important sequences, both coding and noncoding. In this companion paper we describe the mathematical and algorithmic results underpinning the analysis of these genomes. (1) We present methods for the automatic determination of genome correspondence. The algorithms enabled the automatic identification of orthologs for more than 90 % of genes and intergenic regions across the four species despite the large number of duplicated genes in the yeast genome. The remaining ambiguities in the gene correspondence revealed recent gene family expansions in regions of rapid genomic change. (2) We present methods for the identification of proteincoding genes based on their patterns of nucleotide conservation across related species. We observed the pressure to conserve the reading frame of functional proteins and developed a test for gene identification with high sensitivity and specificity. We used this test to revisit the genome of S. cerevisiae, reducing the overall gene count by 500 genes (10 % of previously
Settling the Intractability of Multiple Alignment
 Proc. of the 14th Ann. Int. Symp. on Algorithms and Computation (ISAAC
, 2003
"... In this paper some of the most fundamental problems in computational biology are proved intractable. The following problems are shown NPhard for all binary or larger alphabets under all xed metrics: Multiple Alignment with SPscore, Star Alignment, and Tree Alignment (for a given phylogeny) . Earli ..."
Abstract

Cited by 31 (1 self)
 Add to MetaCart
(Show Context)
In this paper some of the most fundamental problems in computational biology are proved intractable. The following problems are shown NPhard for all binary or larger alphabets under all xed metrics: Multiple Alignment with SPscore, Star Alignment, and Tree Alignment (for a given phylogeny) . Earlier these problems have only been shown intractable for sporadic alphabets and distances, here the intractability is settled. Moreover, Consensus Patterns and Substring Parsimony are shown NPhard.
Structurebased prediction of C2H2 zincfinger binding specificity: sensitivity to docking geometry
 Nucleic Acids Res
, 2007
"... binding specificity: sensitivity to docking geometry ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
binding specificity: sensitivity to docking geometry
Computing the Similarity of Two Sequences with Nested Arc Annotations
 Theoretical Computer Science
, 2003
"... We present exact algorithms for the NPcomplete Longest Common Subsequence problem for sequences with nested arc annotations, a problem occurring in structure comparison of RNA. Given two sequences of length at most n and nested arc structure, one of our algorithms determines (if existent) in O(3.3 ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
We present exact algorithms for the NPcomplete Longest Common Subsequence problem for sequences with nested arc annotations, a problem occurring in structure comparison of RNA. Given two sequences of length at most n and nested arc structure, one of our algorithms determines (if existent) in O(3.31 time an arcpreserving subsequence of both sequences, which can be obtained by deleting (together with corresponding arcs) k 1 letters from the first and k 2 letters from the second sequence. A second algorithm shows that (in case of a four letter alphabet) we can find a length l arcannotated subsequence in O(12 n) time. This means that the problem is fixedparameter tractable when parameterized by the number of deletions as well as when parameterized by the subsequence length. Our findings complement known approximation results which give a quadratic time factor2approximation for the general and polynomial time approximation schemes for restricted versions of the problem. In addition, we obtain further fixedparameter tractability results for these restricted versions.
The Closest Substring problem with small distances
, 2005
"... In the CLOSEST SUBSTRING problem k strings s1,..., sk are given, and the task is to find a string s of length L such that each string si has a consecutive substring of length L whose distance is at most d from s. The problem is motivated by applications in computational biology. We present two algo ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
In the CLOSEST SUBSTRING problem k strings s1,..., sk are given, and the task is to find a string s of length L such that each string si has a consecutive substring of length L whose distance is at most d from s. The problem is motivated by applications in computational biology. We present two algorithms that can be efficient for small fixed values of d and k: for some functions f and g, the algorithms have running time f (d) ·nO(logd) and g(d,k) ·nO(loglogk), respectively. The second algorithm is based on connections with the extremal combinatorics of hypergraphs. The CLOSEST SUBSTRING problem is also investigated from the parameterized complexity point of view. Answering an open question from [6, 7, 11, 12], we show that the problem is W[1]hard even if both d and k are parameters. It follows as a consequence of this hardness result that our algorithms are optimal in the sense that the exponent of n in the running time cannot be improved to o(logd) or to o(log logk) (modulo some complexitytheoretic assumptions). Another consequence is that the running time nO(1/ε4) of the approximation scheme for CLOSEST SUBSTRING presented in [13] cannot be improved to f (ε) ·nc, i.e., the ε has to appear in the exponent of n.