Results 1  10
of
472
Gapped BLAST and PSIBLAST: a new generation of protein database search programs.
 Nucleic Acids Res.
, 1997
"... ABSTRACT The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantia ..."
Abstract

Cited by 8572 (88 self)
 Add to MetaCart
(Show Context)
ABSTRACT The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a positionspecific score matrix, and searching the database using this matrix. The resulting PositionSpecific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSIBLAST is used to uncover several new and interesting members of the BRCT superfamily.
Fast Folding and Comparison of RNA Secondary Structures (The Vienna RNA Package)
"... Computer codes for computation and comparison of RNA secondary structures, the Vienna RNA package, are presented, that are based on dynamic programming algorithms and aim at predictions of structures with minimum free energies as well as at computations of the equilibrium partition functions and bas ..."
Abstract

Cited by 809 (117 self)
 Add to MetaCart
Computer codes for computation and comparison of RNA secondary structures, the Vienna RNA package, are presented, that are based on dynamic programming algorithms and aim at predictions of structures with minimum free energies as well as at computations of the equilibrium partition functions and base pairing probabilities. An efficient heuristic for the inverse folding problem of RNA is introduced. In addition we present compact and efficient programs for the comparison of RNA secondary structures based on tree editing and alignment. All computer codes are written in ANSI C. They include implementations of modified algorithms on parallel computers with distributed memory. Performance analysis carried out on an Intel Hypercube shows that parallel computing becomes gradually more and more efficient the longer the sequences are.
A Guided Tour to Approximate String Matching
 ACM COMPUTING SURVEYS
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 598 (36 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems.
Learnability in Optimality Theory
, 1995
"... In this article we show how Optimality Theory yields a highly general Constraint Demotion principle for grammar learning. The resulting learning procedure specifically exploits the grammatical structure of Optimality Theory, independent of the content of substantive constraints defining any given gr ..."
Abstract

Cited by 529 (35 self)
 Add to MetaCart
In this article we show how Optimality Theory yields a highly general Constraint Demotion principle for grammar learning. The resulting learning procedure specifically exploits the grammatical structure of Optimality Theory, independent of the content of substantive constraints defining any given grammatical module. We decompose the learning problem and present formal results for a central subproblem, deducing the constraint ranking particular to a target language, given structural descriptions of positive examples. The structure imposed on the space of possible grammars by Optimality Theory allows efficient convergence to a correct grammar. We discuss implications for learning from overt data only, as well as other learning issues. We argue that Optimality Theory promotes confluence of the demands of more effective learnability and deeper linguistic explanation.
Convolution Kernels on Discrete Structures
, 1999
"... We introduce a new method of constructing kernels on sets whose elements are discrete structures like strings, trees and graphs. The method can be applied iteratively to build a kernel on an infinite set from kernels involving generators of the set. The family of kernels generated generalizes the fa ..."
Abstract

Cited by 506 (0 self)
 Add to MetaCart
(Show Context)
We introduce a new method of constructing kernels on sets whose elements are discrete structures like strings, trees and graphs. The method can be applied iteratively to build a kernel on an infinite set from kernels involving generators of the set. The family of kernels generated generalizes the family of radial basis kernels. It can also be used to define kernels in the form of joint Gibbs probability distributions. Kernels can be built from hidden Markov random elds, generalized regular expressions, pairHMMs, or ANOVA decompositions. Uses of the method lead to open problems involving the theory of infinitely divisible positive definite functions. Fundamentals of this theory and the theory of reproducing kernel Hilbert spaces are reviewed and applied in establishing the validity of the method.
Fastmap: A fast algorithm for indexing, datamining and visualization of traditional and multimedia datasets
, 1995
"... A very promising idea for fast searching in traditional and multimedia databases is to map objects into points in kd space, using k featureextraction functions, provided by a domain expert [Jag91]. Thus, we can subsequently use highly finetuned spatial access methods (SAMs), to answer several ..."
Abstract

Cited by 502 (22 self)
 Add to MetaCart
A very promising idea for fast searching in traditional and multimedia databases is to map objects into points in kd space, using k featureextraction functions, provided by a domain expert [Jag91]. Thus, we can subsequently use highly finetuned spatial access methods (SAMs), to answer several types of queries, including the `Query By Example' type (which translates to a range query); the `all pairs' query (which translates to a spatial join [BKSS94]); the nearestneighbor or bestmatch query, etc. However, designing feature extraction functions can be hard. It is relatively easier for a domain expert to assess the similarity/distance of two objects. Given only the distance information though, it is not obvious how to map objects into points. This is exactly the topic of this paper. We describe a fast algorithm to map objects into points in some kdimensional space (k is userdefined), such that the dissimilarities are preserved. There are two benefits from this mapping: ...
Survey of clustering algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract

Cited by 499 (4 self)
 Add to MetaCart
(Show Context)
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Efficient Retrieval of Similar Time Sequences Under Time Warping
, 1997
"... Fast similarity searching in large timesequence databases has attracted a lot of research interest [1, 5, 2, 6, 3, 10]. All of them use the Euclidean distance (L 2), or some variation of L ..."
Abstract

Cited by 215 (5 self)
 Add to MetaCart
(Show Context)
Fast similarity searching in large timesequence databases has attracted a lot of research interest [1, 5, 2, 6, 3, 10]. All of them use the Euclidean distance (L 2), or some variation of L
An O(ND) Difference Algorithm and Its Variations
 Algorithmica
, 1986
"... The problems of finding a longest common subsequence of two sequences A and B and a shortest edit script for transforming A into B have long been known to be dual problems. In this paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this perspective, a s ..."
Abstract

Cited by 212 (4 self)
 Add to MetaCart
(Show Context)
The problems of finding a longest common subsequence of two sequences A and B and a shortest edit script for transforming A into B have long been known to be dual problems. In this paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this perspective, a simple O(ND) time and space algorithm is developed where N is the sum of the lengths of A and B and D is the size of the minimum edit script for A and B. The algorithm performs well when differences are small (sequences are similar) and is consequently fast in typical applications. The algorithm is shown to have O(N +D expectedtime performance under a basic stochastic model. A refinement of the algorithm requires only O(N) space, and the use of suffix trees leads to an O(NlgN +D ) time variation.
Approaches to the Automatic Discovery of Patterns in Biosequences
, 1995
"... This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering those patterns which a ..."
Abstract

Cited by 174 (21 self)
 Add to MetaCart
This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering those patterns which are the most frequently used in molecular bioinformatics. A formulation is given of the problem of the automatic discovery of such patterns from a set of sequences, and an analysis presented of the ways in which an assessment can be made of the significance and usefulness of the discovered patterns. It is shown that this problem is related to problems studied in the field of machine learning. The largest part of this paper comprises a review of a number of existing methods developed to solve this problem and how these relate to each other, focusing on the algorithms underlying the approaches. A comparison is given of the algorithms, and examples are given of patterns that have been discovered...