Results 1  10
of
25
Fast and Flexible Word Searching on Compressed Text
, 2000
"... ... text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, ..."
Abstract

Cited by 102 (38 self)
 Add to MetaCart
... text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.
Compressed text databases with efficient query algorithms based on the compressed suffix array
 Proceedings of ISAAC'00, number 1969 in LNCS
, 2000
"... A compressed text database based on the compressed suffix array is proposed. The compressed su#x array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies O(n log bits for the alphabet #. On the other hand, our data structure does n ..."
Abstract

Cited by 68 (3 self)
 Add to MetaCart
A compressed text database based on the compressed suffix array is proposed. The compressed su#x array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies O(n log bits for the alphabet #. On the other hand, our data structure does not use the text itself, and supports important operations for text databases: inverse, search and decompress. Our algorithms can find occ occurrences of any substring P of the text
The Smallest Grammar Problem
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2005
"... This paper addresses the smallest grammar problem: What is the smallest contextfree grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addi ..."
Abstract

Cited by 62 (0 self)
 Add to MetaCart
(Show Context)
This paper addresses the smallest grammar problem: What is the smallest contextfree grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addition chains. Due to the problem’s inherent complexity, our objective is to find an approximation algorithm which finds a small grammar for the input string. We focus attention on the approximation ratio of the algorithm (and implicitly, worstcase behavior) to establish provable performance guarantees and to address shortcomings in the classical measure of redundancy in the literature. Our first results are a variety of hardness results, most notably that every efficient algorithm for the smallest grammar problem has approximation ratio at least 8569 unless P = NP. 8568 We then bound approximation ratios for several of the bestknown grammarbased compression algorithms, including LZ78, BISECTION, SEQUENTIAL, LONGEST MATCH, GREEDY, and REPAIR. Among these, the best upper bound we show is O(n 1/2). We finish by presenting two novel algorithms with exponentially better ratios of O(log 3 n) and O(log(n/m ∗)), where m ∗ is the size of the smallest grammar for that input. The latter highlights a connection between grammarbased compression and LZ77.
Approximate String Matching over ZivLempel Compressed Text
, 2000
"... We present the first nontrivial algorithm for approximate pattern matching on compressed text. The format we choose is the ZivLempel family. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to k inse ..."
Abstract

Cited by 52 (13 self)
 Add to MetaCart
We present the first nontrivial algorithm for approximate pattern matching on compressed text. The format we choose is the ZivLempel family. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to k insertions, deletions and substitutions. On LZ78/LZW we need O(mkn + R) time in the worst case and O(k ) +R) on average where is the alphabet size. The experimental results show a practical speedup over the basic approach of up to 2X for moderate m and small k. We extend the algorithms to more general compression formats and approximate matching models.
Speeding Up Pattern Matching By Text Compression
, 2000
"... Pattern matching is one of the most fundamental operations in string processing. Recently, a new trend for accelerating pattern matching has emerged: Speeding up pattern matching by text compression. From the traditional criteria for data compression, i.e., compression ratio and compression/decom ..."
Abstract

Cited by 29 (10 self)
 Add to MetaCart
Pattern matching is one of the most fundamental operations in string processing. Recently, a new trend for accelerating pattern matching has emerged: Speeding up pattern matching by text compression. From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time, adaptive dictionary methods such as the LempelZiv family are often preferred. However, such methods cannot speed up the pattern matching since an extra work is needed to keep track of compression mechanism. We have to reexamine existing compression methods or develop a new method in the light of the new criterion: Efficiency of pattern matching in compressed text. Byte pair encoding (BPE) is a simple universal text compression scheme. Decompression is fast and requires small work space. Moreover, it is easy to decompress an arbitrary part of the original text. However, it has not been so popular since the compression is slow and the compression ratio is not as good as...
Multiple Pattern Matching in LZW Compressed Text
 In Proc. DCC'98
, 1998
"... In this paper we address the problem of searching in LZW compressed text directly, and present a new algorithm for finding multiple patterns bysimulating the moveofthe AhoCorasick pattern matching machine. The new algorithm finds all occurrences of multiple patterns whereas the algorithm propose ..."
Abstract

Cited by 26 (10 self)
 Add to MetaCart
(Show Context)
In this paper we address the problem of searching in LZW compressed text directly, and present a new algorithm for finding multiple patterns bysimulating the moveofthe AhoCorasick pattern matching machine. The new algorithm finds all occurrences of multiple patterns whereas the algorithm proposed by Amir, Benson, and Farach finds only the first occurrence of a single pattern.
Approximate Matching of RunLength Compressed Strings
 Algorithmica
, 2001
"... We focus on the problem of approximate matching of strings that have been compressed using runlength encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n, compressed to m runs. We extend an existi ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
We focus on the problem of approximate matching of strings that have been compressed using runlength encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n, compressed to m runs. We extend an existing algorithm for the LCS to the Levenshtein distance achieving O(m m) complexity.
Approximating the Smallest Grammar: Kolmogorov Complexity in Natural Models
, 2002
"... We consider the problem of finding the smallest contextfree grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data co ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
We consider the problem of finding the smallest contextfree grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data compression and pattern extraction. The smallest grammar...
A BoyerMoore type algorithm for compressed pattern matching
, 2000
"... Recently the compressed pattern matching problem has attracted special concern, where the goal is to find a pattern in a compressed text without decompression. In previous work, we proposed an AhoCorasick (AC) type algorithm for searching in text files compressed by the socalled byte pair encod ..."
Abstract

Cited by 17 (7 self)
 Add to MetaCart
Recently the compressed pattern matching problem has attracted special concern, where the goal is to find a pattern in a compressed text without decompression. In previous work, we proposed an AhoCorasick (AC) type algorithm for searching in text files compressed by the socalled byte pair encoding (BPE). The searching time is reduced at the same rate as the compression ratio compared with AC. In this paper, we show a BoyerMoore (BM) type algorithm for pattern matching in BPE compressed files. Experimental results show that the algorithm runs about 1.5 ~ 3.0 times faster than the exact match routines based on the BM algorithm in the software package Agrep, which is known as the fastest pattern matching tool.
BitParallel Approach to Approximate String Matching in Compressed Texts
, 2000
"... In this paper, we address the problem of approximate string matching on compressed text. We consider this problem for a text string described in terms of collage system, which is a formal system proposed by Kida et al. (1999) that captures various dictionarybased compression methods. We present an ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
In this paper, we address the problem of approximate string matching on compressed text. We consider this problem for a text string described in terms of collage system, which is a formal system proposed by Kida et al. (1999) that captures various dictionarybased compression methods. We present an algorithm that exploits bitparallelism, assuming that our problem fits in a single machine word, e.g., (m  k + 1)(k + 1) # L, where m is the pattern length, k is the number of allowed errors, and L is the length in bits of the machine word. For a class of simple collage systems, the algorithm runs in O(k 2 (#D# + S) + km) time using O(k 2 #D#) space, where #D# is the size of dictionary D and S is the number of tokens in S. The LZ78 and the LZW compression methods are covered by this class. Since we can regard n = #D# + S as the compressed length, the time and the space complexities are O(k 2 n + km) and O(k² n), respectively.