Results 1 
6 of
6
Fast and simple character classes and bounded gaps pattern matching, with application to protein searching
 Journal of Computational Biology
, 2001
"... The problem of fast exact and approximate searching for a pattern that contains classes of characters and bounded size gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching (for instance, one PROSITE protein site is associated with the CB ..."
Abstract

Cited by 35 (3 self)
 Add to MetaCart
The problem of fast exact and approximate searching for a pattern that contains classes of characters and bounded size gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching (for instance, one PROSITE protein site is associated with the CBG [RK]  x(2,3)  [DE]  x(2,3)  Y, where the brackets match any of the letters inside, and x(2,3) a gap of length between 2 and 3). Currently, the only way to search for a CBG in a text is to convert it into a full regular expression (RE). However, a RE is more sophisticated than a CBG, and searching for it with a RE pattern matching algorithm complicates the search and makes it slow. This is the reason why we design in this article two new practical CBG matching algorithms that are much simpler and faster than all the RE search techniques. The first one looks exactly once at each text character. The second one does not need to consider all the text characters, and hence it is usually faster than the first one, but in bad cases may have to read the same text character more than once. We then propose a criterion based on the form of the CBG to choose a priori the fastest between both. We also show how to search permitting a few mistakes in the occurrences. We performed many practical experiments using the PROSITE database, and all of them show that our algorithms are the fastest in virtually all cases.
A fast algorithm for approximate string matching on gene sequences
 in Symposium. 16th Annu. Combinatorial Pattern Matching, LNCS, SpringerVerlag
, 2005
"... Abstract. Approximate string matching is a fundamental and challenging problem in computer science, for which a fast algorithm is highly demanded in many applications including text processing and DNA sequence analysis. In this paper, we present a fast algorithm for approximate string matching, call ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Approximate string matching is a fundamental and challenging problem in computer science, for which a fast algorithm is highly demanded in many applications including text processing and DNA sequence analysis. In this paper, we present a fast algorithm for approximate string matching, called FAAST. It aims at solving a popular variant of the approximate string matching problem, the kmismatch problem, whose objective is to find all occurrences of a short pattern in a long text string with at most k mismatches. FAAST generalizes the wellknown TarhioUkkonen algorithm by requiring two or more matches when calculating shift distances, which makes the approximate string matching process significantly faster than the TarhioUkkonen algorithm. Theoretically, we prove that FAAST on average skips more characters than the TarhioUkkonen algorithm in a single shift, and makes fewer character comparisons in an entire matching process. Experiments on both simulated data sets and real gene sequences also demonstrate that FAAST runs several times faster than the TarhioUkkonen algorithm in all the cases that we tested. 1
A BitParallel, General IntegerScoring Sequence Alignment Algorithm
"... Abstract. Mapping of nextgeneration sequencing data and other processorintensive sequence comparison applications have motivated a continued search for high efficiency sequence alignment algorithms. In one approach, which exploits the inherent parallelism in computer logic calculations, individ ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
(Show Context)
Abstract. Mapping of nextgeneration sequencing data and other processorintensive sequence comparison applications have motivated a continued search for high efficiency sequence alignment algorithms. In one approach, which exploits the inherent parallelism in computer logic calculations, individual cells in an alignment scoring matrix are represented as bits in a computer word and the calculation of scores is emulated by a series of bit operations comprised of AND, OR, XOR, complement, shift, and addition. Bitparallelism has been successfully applied to the Longest Common Subsequence (LCS) and editdistance problems, producing solutions which are significantly faster than standard implementations. But, the intensive mental effort required to produce these solutions, which are closely tied to special properties of the problems, has limited efforts to extend bitparallelism to more general scoring schemes. In this paper, we give the first bitparallel solution for general, integerscoring global alignment. Integerscoring schemes, which are widely used, assign integer weights for match, mismatch, and insertion/deletion or indel. Our method depends on structural properties of the relationship between adjacent scores in the scoring matrix. We utilize these properties to construct a class of efficient algorithms, each designed for a particular set of weights, and we introduce a standard for characterizing the efficiency in terms of the average number of bitoperations per cell of the original scoring matrix.
Text Searching: Theory and Practice
"... We present the state of the art of the main component of text retrieval systems: the search engine. We outline the main lines of research and issues involved. We survey the relevant techniques in use today for text searching and explore the gap between theoretical and practical algorithms. The main ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
We present the state of the art of the main component of text retrieval systems: the search engine. We outline the main lines of research and issues involved. We survey the relevant techniques in use today for text searching and explore the gap between theoretical and practical algorithms. The main observation is that simpler ideas are better in practice.
Weighted Degenerated Approximate Pattern Matching ⋆
"... Abstract. We present a bitparallel approach to degenerated approximate pattern matching problem. That is the problem of finding approximate matches of a “special ” pattern in a text of degenerate symbols. The special pattern P = s1 ∗ (a1,b1)... sℓ ∗ (a ℓ,b ℓ) sℓ+1 ∗ (a ℓ+1,b ℓ+1)... sω, such that s ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We present a bitparallel approach to degenerated approximate pattern matching problem. That is the problem of finding approximate matches of a “special ” pattern in a text of degenerate symbols. The special pattern P = s1 ∗ (a1,b1)... sℓ ∗ (a ℓ,b ℓ) sℓ+1 ∗ (a ℓ+1,b ℓ+1)... sω, such that symbol ∗ (a,b) is a sequence of at most b but at least a “don’t care ” symbols which match any symbol within the alphabet, i.e. a sequence of subpatterns with gaps; the pattern is associated with integer weights in each subpattern sℓ for replacements, insertions, and deletions. The problem is to match the pattern such that the minimum sum of weights is achieved. The total time complexity is (k(log(k+2)+1)mn)/w, where m is the length of the pattern P, n is the length of text of degenerate symbols, k is the maximum number of edit operations performed, and w is the length of the computer word.
BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btu507 Sequence analysis Advance Access publication July 29, 2014
, 2014
"... BitPAl: a bitparallel, general integerscoring sequence alignment algorithm ..."
Abstract
 Add to MetaCart
(Show Context)
BitPAl: a bitparallel, general integerscoring sequence alignment algorithm