Results 1 - 10
of
50
Finding Similar Regions In Many Strings
- Journal of Computer and System Sciences
, 1999
"... Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1 ; : : : ; sn . The Consensus Patterns problem, which has been widely ..."
Abstract
-
Cited by 69 (9 self)
- Add to MetaCart
(Show Context)
Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1 ; : : : ; sn . The Consensus Patterns problem, which has been widely studied in bioinformatics research [26, 16, 12, 25, 4, 6, 15, 22, 24, 27], in its simplest form, asks for a region of length L in each s i , and a median string s of length L so that the total Hamming distance from s to these regions is minimized. We show the problem is NPhard and give a polynomial time approximation scheme (PTAS) for it. We also give a PTAS for the problem under the original measure of [26, 16, 12, 25]. As an interesting application of our analysis, we further obtain a PTAS for a restricted (but still NP-hard) version of the important star alignment problem allowing at most constant number of gaps, each of arbitrary length, in each sequence. The Closest String problem [2, 3, 7, 9, 18] asks for the smallest d and a string s which is within Hamming distance d to each s i . The problem is NP-hard [7, 18]. [3] gives a polynomial time algorithm for constant d. For super-logarithmic d, [2, 9] give efficient approximation algorithms using linear program ralaxation techniques. The best polynomial time approximation has ratio 4 3 for all d, given by [18] ([9] also independently claimed the 4 3 ratio but only for super-logarithmic d). We settle the problem with a PTAS. We then give the first nontrivial better-than-2 approximation with ratio 2 \Gamma 2 2j\Sigmaj+1 for the more elusive Closest
A Polynomial Time Approximation Scheme for Minimum Routing Cost Spanning Trees
, 1998
"... Given an undirected graph with nonnegative costs on the edges, the routing cost of any of its spanning trees is the sum over all pairs of vertices of the cost of the path between the pair in the tree. Finding a spanning tree of minimum routing cost is NP-hard, even when the costs obey the triangle i ..."
Abstract
-
Cited by 45 (8 self)
- Add to MetaCart
(Show Context)
Given an undirected graph with nonnegative costs on the edges, the routing cost of any of its spanning trees is the sum over all pairs of vertices of the cost of the path between the pair in the tree. Finding a spanning tree of minimum routing cost is NP-hard, even when the costs obey the triangle inequality. We show that the general case is in fact reducible to the metric case and present a polynomial-time approximation scheme valid for both versions of the problem. In particular, we show how to build a spanning tree of an n-vertex weighted graph with routing cost within (1 + ffl) from the minimum in time O(n O( 1 ffl ) ). Besides the obvious connection to network design, trees with small routing cost also find application in the construction of good multiple sequence alignments in computational biology. The communication cost spanning tree problem is a generalization of the minimum routing cost tree problem where the routing costs of different pairs are weighted by different r...
Finding Similar Regions in Many Sequences
- JOURNAL OF COMPUTER AND SYSTEM SCIENCES
, 1999
"... Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. Assume that we are given n DNA sequences s 1 ; : : : ; s n . The Consensus Patterns problem, which has been widely studied in bioinformatics research [22, 10, 7 ..."
Abstract
-
Cited by 35 (7 self)
- Add to MetaCart
Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. Assume that we are given n DNA sequences s 1 ; : : : ; s n . The Consensus Patterns problem, which has been widely studied in bioinformatics research [22, 10, 7, 21, 2, 3, 9, 18, 19, 27], in its simplest form, asks for a region of length L in each s i , and a median string s of length L so that the total Hamming distance from s to these regions is minimized. We show that the problem is NP-hard and give a polynomial time approximation scheme (PTAS) for it. We then present an efficient approximation algorithm for the consensus pattern problem under the original relative entropy measure of [22, 10, 7, 21]. As an interesting application of our analysis, we further obtain a PTAS for a restricted (but still NP-hard) version of the important consensus alignment problem [6] allowing at most constant number of gaps, each of arbitrary length, in each sequence.
The complexity of multiple sequence alignment with SP-score that is a metric
- TCS
, 2001
"... This paper analyzes the computational complexity of computing the optimal alignment of a set of sequences under the SP (sum of all pairs) score scheme. We solve an open question by showing that the problem is NP -complete in the very restricted case in which the sequences are over a binary alphabet ..."
Abstract
-
Cited by 34 (0 self)
- Add to MetaCart
(Show Context)
This paper analyzes the computational complexity of computing the optimal alignment of a set of sequences under the SP (sum of all pairs) score scheme. We solve an open question by showing that the problem is NP -complete in the very restricted case in which the sequences are over a binary alphabet and the score is a metric. This result establishes the intractability of multiple sequence alignment under a score function of mathematical interest, which has indeed received much attention in biological sequence comparison.
Opportunities for Combinatorial Optimization In Computational Biology
, 2003
"... This is a survey designed for mathematical programming people who do not know molecular biology and want to learn the kinds of combinatorial optimization problems that arise. After a brief introduction to the biology, we present optimization models pertaining to sequencing, evolutionary explanations ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
This is a survey designed for mathematical programming people who do not know molecular biology and want to learn the kinds of combinatorial optimization problems that arise. After a brief introduction to the biology, we present optimization models pertaining to sequencing, evolutionary explanations, structure prediction and recognition. Additional biology is given in the context of the problems, including some motivation for disease diagnosis and drug discovery. Open problems are cited with an extensive bibliography, and we offer a guide to getting started in this exciting frontier.
A polynomial-time approximation scheme for minimum routing cost spanning trees
- SIAM J. COMPUT
, 1999
"... Given an undirected graph with nonnegative costs on the edges, the routing cost of any of its spanning trees is the sum over all pairs of vertices of the cost of the path between the pair in the tree. Finding a spanning tree of minimum routing cost is NP-hard, even when the costs obey the triangle ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
Given an undirected graph with nonnegative costs on the edges, the routing cost of any of its spanning trees is the sum over all pairs of vertices of the cost of the path between the pair in the tree. Finding a spanning tree of minimum routing cost is NP-hard, even when the costs obey the triangle inequality. We show that the general case is in fact reducible to the metric case and present a polynomial-time approximation scheme valid for both versions of the problem. In particular, we show how to build a spanning tree of an n-vertex weighted graph with routing cost at most (1 + ɛ) of the minimum in time O(n O ( 1 ɛ)). Besides the obvious connection to network design, trees with small routing cost also find application in the construction of good multiple sequence alignments in computational biology. The communication cost spanning tree problem is a generalization of the minimum routing cost tree problem where the routing costs of different pairs are weighted by different requirement amounts. We observe that a randomized O(log nlog log n)-approximation for this problem follows directly from a recent result of Bartal, where n is the number of nodes in a metric graph. This also yields the same approximation for the generalized sum-of-pairs alignment problem in computational biology.
Approximation Algorithms for Multiple Sequence Alignment Under a Fixed Evolutionary Tree
, 1995
"... . We consider the problem of aligning sequences related by a given evolutionary tree: given a fixed tree with its leaves labeled with sequences, find ancestral sequences to label the internal nodes so as to minimize the total cost of all the edges in the tree. The cost of an edge is the edit distanc ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
. We consider the problem of aligning sequences related by a given evolutionary tree: given a fixed tree with its leaves labeled with sequences, find ancestral sequences to label the internal nodes so as to minimize the total cost of all the edges in the tree. The cost of an edge is the edit distance between the sequences labeling its endpoints. In this paper, we consider the case when the given tree is a regular d-ary tree for some fixed d and provide a d+1 d01 -approximation algorithm for this problem that runs in time O(d(2kn) d +n 2 k 2d ) where k is the number of leaves in the tree and n is the maximum length of any of the sequences labeling the leaves. We also consider a new bottleneck objective in labeling the internal nodes. In this version, we wish to find the labeling of the internal nodes that minimizes the maximum cost of any edge in the tree. For this problem we provide a simple 2ffi + 1-approximation algorithm where ffi is the depth of the given undirected tree def...
A General Method for Fast Multiple Sequence Alignment
, 1996
"... We have developed a fast heuristic algorithm for multiple sequence alignment which provides near-to-optimal results for sufficiently homologous sequences. The algorithm makes use of the standard dynamic programming procedure by applying it to all pairs of sequences. The resulting score matrices ..."
Abstract
-
Cited by 20 (10 self)
- Add to MetaCart
We have developed a fast heuristic algorithm for multiple sequence alignment which provides near-to-optimal results for sufficiently homologous sequences. The algorithm makes use of the standard dynamic programming procedure by applying it to all pairs of sequences. The resulting score matrices for pairwise alignment give rise to secondary matrices containing the additional charges imposed by forcing the alignment path to run through a particular vertex. Such a constraint corresponds to slicing the sequences at the positions defining that vertex, and aligning the remaining pairs of prefix and suffix sequences seperately. From these secondary matrices, one can compute -- for any given family of sequences -- suitable positions for cutting all of these sequences simultaneously, thus reducing the problem of aligning a family of n sequences of average length l in a Divide and Conquer fashion to aligning two families of n sequences of approximately half that length. In this pape...
Aligning Alignments Exactly
- Proceedings of the Eighth Annual international Conference on Research in Computational Molecular Biology
, 2004
"... A basic computational problem that arises in both the construction and local-search phases of the best heuristics for multiple sequence alignment is that of aligning the columns of two multiple alignments. When the scoring function is the sum-of-pairs objective and induced pairwise alignments are ev ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
A basic computational problem that arises in both the construction and local-search phases of the best heuristics for multiple sequence alignment is that of aligning the columns of two multiple alignments. When the scoring function is the sum-of-pairs objective and induced pairwise alignments are evaluated using linear gap-costs, we call this problem Aligning Alignments. While seemingly a straightforward extension of two-sequence alignment, we prove it is actually NP-complete. As explained in the paper, this provides the first demonstration that minimizing linear gap-costs, in the context of multiple sequence alignment, is inherently hard. We also develop an exact algorithm for Aligning Alignments that is remarkably efficient in practice, both in time and space. Even though the problem is NP-complete, computational experiments on both biological and simulated data show we can compute optimal alignments for all benchmark instances in two standard datasets, and solve verylarge random instances with highly-gapped sequences.