Superiority and Complexity of the Spaced Seeds
 SODA
, 2006
Optimal spaced seeds were introduced by the theoretical computer science community to bioinformatics to effectively increase homology search sensitivity. They are now serving thousands of homology search queries daily. While dozens of papers have been published on optimal spaced seeds since their invention, many fundamental questions still remain unanswered. In this paper, we settle several open questions in this area. Specifically, we prove that when the length of a nonuniformly spaced seed is bounded by an exponential function of the seed weight, the seed outperforms strictly the traditional consecutive seed in both (i) the average number of nonoverlapping hits and (ii) the asymptotic hit probability. Then, we study the computation of the hit probability of a spaced seed, solving three more open questions: (iii) hit probability computation in a uniform homologous region is NPhard and (iv) it admits a PTAS; (v) the asymptotic hit probability is computable in exponential time in seed length, independent of the homologous region length. 1
Separating Real Motifs From Their Artifacts
, 2001
The typical output of many computational methods to identify binding sites is a long list of motifs containing some real motifs (those most likely to correspond to the actual binding sites) along with a large number of random variations of these. We present a statistical method to separate real motifs from their artifacts. This produces a short list of high quality motifs that is sufficient to explain the overrepresentation of all motifs in the given sequences. Using synthetic data sets, we show that the output of our method is very accurate. On various sets of upstream sequences in S. cerevisiae, our program identifies several known binding sites, as well as a number of significant novel motifs. Contact: fblanchem,saurabhg@cs.washington.edu
Rare Events and Conditional Events on Random Strings
 DMTCS
, 2004
this paper is twofold. First, a single word is given. We study the tail distribution of the number of its occurrences. Sharp large deviation estimates are derived. Second, we assume that a given word is overrepresented. The conditional distribution of a second word is studied; formulae for the expectation and the variance are derived. In both cases, the formulae are precise and can be computed efficiently. These results have applications in computational biology, where a genome is viewed as a text
Probabilistic arithmetic automata and their application to pattern matching statistics
 Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 5029 of LNCS
, 2008
Hidden Word Statistics
We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is...
Discovery of stable and significant binding motif pairs from PDB complexes and protein interaction datasets
 Bioinformatics
, 2005
Motivation: Discovery of binding sites is important in the study of protein–protein interactions. In this paper, we introduce stable and significant motif pairs to model proteinbinding sites. The stability is the pattern’s resistance to some transformation. The significance is the unexpected frequency of occurrence of the pattern in a sequence dataset comprising known interacting protein pairs. Discovery of stable motif pairs is an iterative process, undergoing a chain of changing but converging patterns. Determining the starting point for such a chain is an interesting problem. We use a protein complex dataset extracted from the Protein Data Bank to help in identifying those starting points, so that the computational complexity of the problem is much released. Results: We found 913 stable motif pairs, of which 765 are significant. We evaluated these motif pairs using comprehensive comparison results against random patterns. Wetexperimentally discovered motifs reported in the literature were also used to confirm the effectiveness of our method. Contact
Mastering seeds for genomic size nucleotide BLAST searches
 Nucleic Acids Res
, 2003
