Abstract. A phylogeny is the evolutionary history of a group of organisms; systematists (and other biologists) attempt to reconstruct this history from various forms of data about contemporary organisms. Phylogeny reconstruction is a crucial step in the understanding of evolution as well as an important tool in biological, pharmaceutical, and medical research. Phylogeny reconstruction from molecular data is very difficult: almost all optimization models give rise to NP-hard (and thus computationally intractable) problems. Yet approximations must be of very high quality in order to avoid outright biological nonsense. Thusmany biologistshave been willing to run farmsof processorsfor many monthsin order to analyze just one dataset. High-performance algorithm engineering offers a battery of tools that can reduce, sometimes spectacularly, the running time of existing phylogenetic algorithms, as well as help designers produce better algorithms. We present an overview of algorithm engineering techniques, illustrating them with an application to the “breakpoint analysis ” method of Sankoff et al., which resulted in the GRAPPA software suite. GRAPPA demonstrated a speedup in running time by over eight orders of magnitude over the original implementation on a variety of real and simulated datasets. We show how these algorithmic engineering techniquesare directly applicable to a large variety of challenging combinatorial problems in computational biology.
|
5825
|
Introduction to Algorithms
– Cormen, Leiserson, et al.
- 1992
|
|
550
|
LEDA: A Platform for Combinatorial and Geometric Computing
– Mehlhorn, Näher
- 2000
|
|
425
|
The neighbor-joining method: A new method for reconstructing phylogenetic trees
– SAITOU, NEI
- 1987
|
|
216
|
The Traveling Salesman Problem: A Case Study in Local Optimization
– Johnson, McGeoch
- 1997
|
|
113
|
Shortest Paths Algorithms: Theory and Experimental Evaluation
– Cherkassky, Goldberg, et al.
- 1996
|
|
97
|
The influence of caches on the performance of sorting
– LaMarca, Ladner
- 1999
|
|
88
|
On Implementing PushRelabel Method for the Maximum Flow Problem
– Cherkassky, Goldberg
- 1995
|
|
67
|
M.Yan “A linear-time algorithm for computing inversion distances between signed permutations with an experimental study
– Bader
|
|
65
|
The influence of caches on the performance of heaps
– LaMarca, Ladner
- 1996
|
|
53
|
Multiple genome rearrangement and breakpoint phylogeny
– Sankoff, Blanchette
- 1998
|
|
46
|
Breakpoint phylogenies
– Blanchette, Bourque, et al.
- 1997
|
|
46
|
The median problems for breakpoints are NP-complete
– Pe’er, Shamir
- 1998
|
|
39
|
Towards a discipline of experimental algorithmics
– Moret
- 2002
|
|
38
|
A new implementation and detailed study of breakpoint analysis
– Moret, Bader, et al.
- 2001
|
|
36
|
An empirical comparison of phylogenetic methods on chloroplast gene order data in Campanulaceae
– Cosner, Jansen, et al.
- 2000
|
|
35
|
Chloroplast DNA evidence on the ancient evolutionary split in vascular land plants
– Raubeson, Jansen
- 1992
|
|
33
|
A fast linear-time algorithm for inversion distance with an experimental comparison
– Bader, Moret, et al.
- 2000
|
|
31
|
A new fast heuristic for computing the breakpoint phylogeny and experimental phylogenetic analyses of real and synthetic data
– Cosner, Jansen, et al.
- 2000
|
|
31
|
An empirical assessment of algorithms for constructing a minimum spanning tree
– Moret, Shapiro
- 1994
|
|
30
|
Chloroplast and mitochondrial genome evolution in land plants
– Palmer
- 1992
|
|
30
|
Fast priority queues for cached memory
– Sanders
|
|
29
|
Chloroplast DNA systematics: a review of methods and data analysis
– Olmstead, Palmer
- 1994
|
|
26
|
The LEDA Platform of Combinatorial and Geometric Computing
– Melhorn, Näher
- 1999
|
|
25
|
Pairing heaps: experiments and analysis
– STASKO, VITTER
- 1987
|
|
22
|
Improving memory performance of sorting algorithms
– Xiao, Zhang, et al.
- 2000
|
|
20
|
Estimating true evolutionary distances between genomes
– Wang, Warnow
|
|
17
|
Efficient sorting using registers and caches
– Arge, Chase, et al.
|
|
16
|
Industrial Applications of High-Performance Computing for Phylogeny Reconstruction
– Bader, Moret, et al.
- 2001
|
|
16
|
Augment or push: A computational study of bipartite matching and unit-capacity flow algorithms
– Cherkassky, Goldberg, et al.
- 1998
|
|
15
|
An empirical comparison of priority queue and event set implementations
– Jones
- 1986
|
|
15
|
Analysing cache effects in distribution sorting
– Rahman, Raman
- 1999
|
|
13
|
Analyzing large datasets: rbcL 500 revisited
– Rice, Donoghue, et al.
- 1997
|
|
12
|
The cache performance of traversals and random accesses
– Ladner, Fix, et al.
- 1999
|
|
12
|
Algorithms from P to NP
– Moret, Shapiro
- 1991
|
|
9
|
Cache-oblivious search trees
– Bender, Demaine, et al.
- 2000
|
|
9
|
Matrix Multiplication: A Case Study of Enhanced Data Cache Utilization
– Eiron, Rodeh, et al.
- 1999
|
|
9
|
Algorithms and experiments: The new (and the old) methodology
– Moret, Shapiro
- 2001
|
|
9
|
GRAPPA runs in record time
– Bader, Moret
|
|
8
|
Neuromedin U is a potent agonist at the orphan G protein-coupled receptor FM3
– Szekeres, Muir, et al.
- 2000
|
|
7
|
Improving the accuracy of evolutionary distances between genomes
– Wang
- 2001
|
|
5
|
Algorithm engineering for parallel computation
– Bader, Moret, et al.
- 2002
|
|
5
|
eds., The Herbicide Glyphosate
– Grossbard, Atkinson
- 1985
|
|
5
|
New porcine reproductive and respiratory syndrome virus DNA - and proteins encoded by open reading frames of an Iowa strain of the virus; are used in vaccines against PRRSV in pigs." Patent filing WO9606619-A1
– Halbur, Lum, et al.
- 1994
|
|
5
|
Cloning, expression, and pharmacological characterization of a novel human histamine receptor
– Zhu, Michalovich, et al.
- 2001
|
|
4
|
Cut tree algorthms: an experimental study
– Goldberg, Tsioutsiouliklis
- 2001
|
|
4
|
Reconstructing optimal phylogenetic trees: A challenge in experimental algorithmics
– Moret, Warnow
- 2002
|
|
3
|
A detailed study of breakpoint analysis
– Moret, Wyman, et al.
|
|
1
|
GRAPPA runsin record time
– Bader, Moret
|
|
1
|
oligonucleotides corresponding to HIV-1 sequences—used for selective amplification and ashybridisation probesfor detection of HIV-1. Patent filing EP-617132-A (priority date
– New
|
|
1
|
The influence of cacheson the performance of heaps
– LaMarca, Ladner
- 1996
|