• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Application of Lempel-Ziv factorization to the approximation of grammar-based compression (2003)

by W Rytter
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 30
Next 10 →

Efficient memory representation of XML documents

by Giorgio Busatto, Markus Lohrey, Sebastian Maneth - In DBPL , 2005
"... Abstract. Implementations that load XML documents and give access to them via, e.g., the DOM, suffer from huge memory demands: the space needed to load an XML document is usually many times larger than the size of the document. A considerable amount of memory is needed to store the tree structure of ..."
Abstract - Cited by 23 (7 self) - Add to MetaCart
Abstract. Implementations that load XML documents and give access to them via, e.g., the DOM, suffer from huge memory demands: the space needed to load an XML document is usually many times larger than the size of the document. A considerable amount of memory is needed to store the tree structure of the XML document. Here a technique is presented that allows to represent the tree structure of an XML document in an efficient way. The representation exploits the high regularity in XML documents by “compressing ” their tree structure; the latter means to detect and remove repetitions of tree patterns. The functionality of basic tree operations, like traversal along edges, is preserved in the compressed representation. This allows to directly execute queries (and in particular, bulk operations) without prior decompression. For certain tasks like validation against an XML type or checking equality of documents, the representation allows for provably more efficient algorithms than those running on conventional representations. 1

Processing compressed texts: a tractability border

by Yury Lifshits - Proc. CPM 2007 , 2007
"... Abstract. What kind of operations can we perform effectively (without full unpacking) with compressed texts? In this paper we consider three fundamental problems: (1) check the equality of two compressed texts, (2) check whether one compressed text is a substring of another compressed text, and (3) ..."
Abstract - Cited by 11 (1 self) - Add to MetaCart
Abstract. What kind of operations can we perform effectively (without full unpacking) with compressed texts? In this paper we consider three fundamental problems: (1) check the equality of two compressed texts, (2) check whether one compressed text is a substring of another compressed text, and (3) compute the number of different symbols (Hamming distance) between two compressed texts of the same length. We present an algorithm that solves the first problem in O(n 3) time and the second problem in O(n 2 m) time. Here n is the size of compressed representation (we consider representations by straight-line programs) of the text and m is the size of compressed representation of the pattern. Next, we prove that the third problem is actually #P-complete. Thus, we indicate a pair of similar problems (equivalence checking, Hamming distance computation) that have radically different complexity on compressed texts. Our algorithmic technique used for problems (1) and (2) helps for computing minimal periods and covers of compressed texts. 1

A Fast and Compact Web Graph Representation

by Francisco Claude, Gonzalo Navarro
"... Compressed graphs representation has become an attractive research topic because of its applications in the manipulation of huge Web graphs in main memory. By far the best current result is the technique by Boldi and Vigna, which takes advantage of several particular properties of Web graphs. In t ..."
Abstract - Cited by 11 (10 self) - Add to MetaCart
Compressed graphs representation has become an attractive research topic because of its applications in the manipulation of huge Web graphs in main memory. By far the best current result is the technique by Boldi and Vigna, which takes advantage of several particular properties of Web graphs. In this paper we show that the same properties can be exploited with a different and elegant technique, built on Re-Pair compression, which achieves about the same space but much faster navigation of the graph. Moreover, the technique has the potential of adapting well to secondary memory. In addition, we introduce an approximate Re-Pair version that works efficiently with limited main memory.

A Fully Linear-Time Approximation Algorithm for Grammar-Based Compression

by Hiroshi Sakamoto
"... A linear-time approximation algorithm for the grammar-based compression, which is an optimization problem to minimize the size of a context-free grammar deriving a given string, is presented. Given a string of length n, the algorithm guarantees O(log n) approximation ratio and using the data s ..."
Abstract - Cited by 8 (0 self) - Add to MetaCart
A linear-time approximation algorithm for the grammar-based compression, which is an optimization problem to minimize the size of a context-free grammar deriving a given string, is presented. Given a string of length n, the algorithm guarantees O(log n) approximation ratio and using the data structures of doubly-linked list, hash table, and priority queue, it runs in O(n) time even if the size of alphabet is unbounded.

Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions

by Yury Lifshits, Shay Mozes, Oren Weimann, Michal Ziv-Ukelson
"... We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete time-independent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [33], as well as to the forward-backward and Baum-Welch [ ..."
Abstract - Cited by 7 (4 self) - Add to MetaCart
We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete time-independent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [33], as well as to the forward-backward and Baum-Welch [6] algorithms. Our approach is based on identifying repeated substrings in the observed input sequence. Initially, we show how to exploit repetitions of all sufficiently small substrings (this is similar to the Four Russians method). Then, we describe four algorithms based alternatively on run length encoding (RLE), Lempel-Ziv (LZ78) parsing, grammar-based compression (SLP), and byte pair encoding (BPE). Compared to Viterbi’s algorithm, we achieve speedups of Θ(log n) using the Four Russians method, Ω ( r log n r) using RLE, Ω ( ) using LZ78, Ω ( ) using SLP, and Ω(r) using BPE, where k is the number log r k k of hidden states, n is the length of the observed sequence and r is its compression ratio (under each compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster in practice. Furthermore, unlike Viterbi’s algorithm, our algorithms are highly parallelizable.

Querying and Embedding Compressed Texts

by Yury Lifshits, Markus Lohrey , 2005
"... Abstract. In this work the computational complexity of two simple string problems on compressed input strings is considered: the querying problem (What is the symbol at a given position in a given input string?) and the embedding problem (Can the first input string embedded into the second input str ..."
Abstract - Cited by 7 (1 self) - Add to MetaCart
Abstract. In this work the computational complexity of two simple string problems on compressed input strings is considered: the querying problem (What is the symbol at a given position in a given input string?) and the embedding problem (Can the first input string embedded into the second input string?). Straightline programs are used for text compression. It is shown that the querying problem becomes P-complete for compressed strings, while the embedding problem becomes hard for the complexity class Θ p 2. 1

Self-indexed text compression using straight-line programs

by Francisco Claude, Gonzalo Navarro - In Proc. 34th MFCS , 2009
"... Abstract. Straight-line programs (SLPs) offer powerful text compression by representing a text T [1, u] in terms of a restricted context-free grammar of n rules, so that T can be recovered in O(u) time. However, the problem of operating the grammar in compressed form has not been studied much. We pr ..."
Abstract - Cited by 6 (4 self) - Add to MetaCart
Abstract. Straight-line programs (SLPs) offer powerful text compression by representing a text T [1, u] in terms of a restricted context-free grammar of n rules, so that T can be recovered in O(u) time. However, the problem of operating the grammar in compressed form has not been studied much. We present a grammar representation whose size is of the same order of that of a plain SLP representation, and can answer other queries apart from expanding nonterminals. This can be of independent interest. We then extend it to achieve the first grammar representation able of extracting text substrings, and of searching the text for patterns, in time o(n). We also give byproducts on representing binary relations. 1 Introduction and Related Work Grammar-based compression is a well-known technique since at least the seventies, and still a very active area of research. From the different variants of the idea, we focus on the case where a given text T [1, u] is replaced by a context-free grammar (CFG) G that generates just the string T. Then one can store G instead

Random Access to Grammar-Compressed Strings

by Philip Bille , Gad M. Landau , Rajeev Raman , Kunihiko Sadakane, Srinivasa Rao Satti, Oren Weimann , 2011
"... Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is the inverse of the k th row of Ackermann’s function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{|P |k, k 4 + |P |} + log N) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammar-compressed trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two ”biased” weighted ancestor data structures, and a compact representation of heavy-paths in grammars.

Semi-local string comparison: Algorithmic techniques and applications

by Alexander Tiskin - Mathematics in Computer Science 1(4) (2008) 571–603 See also arXiv: 0707.3619
"... The longest common subsequence (LCS) problem is a classical problem in computer science. The semi-local LCS problem is a generalisation of the LCS problem, arising naturally in the context of string comparison. In this work, we present a number of algorithmic techniques related to the semi-local LCS ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
The longest common subsequence (LCS) problem is a classical problem in computer science. The semi-local LCS problem is a generalisation of the LCS problem, arising naturally in the context of string comparison. In this work, we present a number of algorithmic techniques related to the semi-local LCS problem, and give a number of algorithmic applications of these techniques. Summarising the presented results, we conclude that semilocal string comparison turns out to be a useful algorithmic plug-in, which unifies, and often improves on, a number of previous approaches to various substring- and subsequence-related problems. Contents

Indexes for Highly Repetitive Document Collections

by Francisco Claude, Antonio Fariña, A Coruña Spain, Gonzalo Navarro, Miguel A. Martínez-prieto, Valladolid Spain
"... We introduce new compressed inverted indexes for highly repetitive document collections. They are based on runlength, Lempel-Ziv, or grammar-based compression of the differential inverted lists, instead of gap-encoding them as is the usual practice. We show that our compression methods significantly ..."
Abstract - Cited by 4 (3 self) - Add to MetaCart
We introduce new compressed inverted indexes for highly repetitive document collections. They are based on runlength, Lempel-Ziv, or grammar-based compression of the differential inverted lists, instead of gap-encoding them as is the usual practice. We show that our compression methods significantly reduce the space achieved by classical compression, at the price of moderate slowdowns. Moreover, many of our methods are universal, that is, they do not need to know the versioning structure of the collection. We also introduce compressed self-indexes in the comparison. We show that techniques can compress much further, using a small fraction of the space required by our new inverted indexes, yet they are orders of magnitude slower.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University