Results 1 
6 of
6
Grammarbased tree compression
, 2004
"... Abstract. Grammarbased compression means to find a small grammar that generates a given object. Such a grammar reveals the structure of the object (according to the grammar formalism used); the main advantage of this compression method is that the resulting grammar can often be used in further comp ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Grammarbased compression means to find a small grammar that generates a given object. Such a grammar reveals the structure of the object (according to the grammar formalism used); the main advantage of this compression method is that the resulting grammar can often be used in further computations without prior decompression. A linear time bottomup algorithm is presented which transforms a tree into a particular contextfree tree grammar. For common XML documents the algorithm performs well, compressing the tree structure to about 5 % of the original size. The validation of an XML document against an XML type can be done without decompression, in linear time w.r.t. the size of the grammar (for a fixed type). While the involved grammars can be double exponentially smaller than the represented trees, testing them for equivalence can be done in polynomial space w.r.t. the sum of their sizes. 1
Learning Schemas for Unordered XML
"... We consider unordered XML, where the relative order among siblings is ignored, and we investigate the problem of learning schemas from examples given by the user. We focus on the schema formalisms proposed in [10]: disjunctive multiplicity schemas (DMS) and its restriction, disjunctionfree multipli ..."
Abstract

Cited by 5 (5 self)
 Add to MetaCart
(Show Context)
We consider unordered XML, where the relative order among siblings is ignored, and we investigate the problem of learning schemas from examples given by the user. We focus on the schema formalisms proposed in [10]: disjunctive multiplicity schemas (DMS) and its restriction, disjunctionfree multiplicity schemas (MS). A learning algorithm takes as input a set of XML documents which must satisfy the schema (i.e., positive examples) and a set of XML documents which must not satisfy the schema (i.e., negative examples), and returns a schema consistent with the examples. We investigate a learning framework inspired by Gold [18], where a learning algorithm should be sound i.e., always return a schema consistent with the examples given by the user, and complete i.e., able to produce every schema with a sufficiently rich set of examples. Additionally, the algorithm should be efficient i.e., polynomial in the size of the input. We prove that the DMS are learnable from positive examples only, but they are not learnable when we also allow negative examples. Moreover, we show that the MS are learnable in the presence of positive examples only, and also in the presence of both positive and negative examples. Furthermore, for the learnable cases, the proposed learning algorithms return minimal schemas consistent with the examples. 1.
Tree compression with top trees.
 In Proc. ICALP
, 2013
"... We introduce a new compression scheme for labeled trees based on top trees. Our compression scheme is the first to simultaneously take advantage of internal repeats in the tree (as opposed to the classical DAG compression that only exploits rooted subtree repeats) while also supporting fast navigat ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
We introduce a new compression scheme for labeled trees based on top trees. Our compression scheme is the first to simultaneously take advantage of internal repeats in the tree (as opposed to the classical DAG compression that only exploits rooted subtree repeats) while also supporting fast navigational queries directly on the compressed representation. We show that the new compression scheme achieves close to optimal worstcase compression, can compress exponentially better than DAG compression, is never much worse than DAG compression, and supports navigational queries in logarithmic time.
Constructing Small Tree Grammars and Small Circuits for Formulas
, 2014
"... Abstract It is shown that every tree of size n over a fixed set of σ different ranked symbols can be decomposed into O( n log σ n ) = O( n log σ log n ) many hierarchically defined pieces. Formally, such a hierarchical decomposition has the form of a straightline linear contextfree tree grammar o ..."
Abstract
 Add to MetaCart
Abstract It is shown that every tree of size n over a fixed set of σ different ranked symbols can be decomposed into O( n log σ n ) = O( n log σ log n ) many hierarchically defined pieces. Formally, such a hierarchical decomposition has the form of a straightline linear contextfree tree grammar of size O( n log σ n ), which can be used as a compressed representation of the input tree. This generalizes an analogous result for strings. Previous grammarbased tree compressors were not analyzed for the worstcase size of the computed grammar, except for the top dag of Bille et al., for which only the weaker upper bound of O( n log 0.19 n ) for unranked and unlabelled trees has been derived. The main result is used to show that every arithmetical formula of size n, in which only m ≤ n different variables occur, can be transformed (in time O(n log n)) into an arithmetical circuit of size O( n·log m log n ) and depth O(log n). This refines a classical result of Brent, according to which an arithmetical formula of size n can be transformed into a logarithmic depth circuit of size O(n). Missing proofs can be found in the long version ACM Subject Classification E.4 Data compaction and compression Keywords and phrases grammarbased compression, tree compression, arithmetical circuits Introduction Grammarbased compression has emerged to an active field in string compression during the past 20 years. The idea is to represent a given string s by a small contextfree grammar that generates only s; such a grammar is also called a straightline program, briefly SLP. For instance, the word (ab) 1024 can be represented by the SLP with the productions A 0 → ab and A i → A i−1 A i−1 for 1 ≤ i ≤ 10 (A 10 is the start symbol). The size of this grammar is much smaller than the size (length) of the string (ab) 1024 . In general, an SLP of size n (the size of an SLP is usually defined as the total length of all righthand sides of the productions) can produce a string of length 2 Ω(n) . Hence, an SLP can be seen indeed as a succinct representation of the generated string. The goal of grammarbased string compression is to construct from a given input string s a small SLP that produces s. Several algorithms for this have been proposed and analyzed. Prominent grammarbased string compressors are for instance LZ78, RePair, and BISECTION, see To evaluate the compression performance of a grammarbased compressor C, two different approaches can be found in the literature: A first approach is to analyze the size of the SLP produced by C for an input string x compared to the size of a smallest SLP for x. This leads to the approximation ratio for C, see
XML Compression via DAGs
, 2013
"... Unranked trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size ( = n ..."
Abstract
 Add to MetaCart
Unranked trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size ( = number of edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One main finding is that the size of the dag of the binary tree can never be smaller than the square root of the size of the minimal dag, and that there are examples that match this bound. We introduce a new combined structure, the hybrid dag, which is guaranteed to be smaller than (or equal in size to) both dags. Interestingly, we find through experiments that last child/previous sibling encodings are much better for XML compression via dags, than fcns encodings. We determine the average sizes of unranked and binary dags over a given set of labels (under uniform distribution) in terms of their exact generating functions, and in terms of their asymptotical behavior.
XPath Node Selection over GrammarCompressed Trees
"... ABSTRACT XML document markup is highly repetitive and therefore well compressible using grammarbased compression. Downward, navigational XPath can be executed over grammarcompressed trees in PTIME: the query is translated into an automaton which is executed in one pass over the grammar. This resu ..."
Abstract
 Add to MetaCart
(Show Context)
ABSTRACT XML document markup is highly repetitive and therefore well compressible using grammarbased compression. Downward, navigational XPath can be executed over grammarcompressed trees in PTIME: the query is translated into an automaton which is executed in one pass over the grammar. This result is wellknown and has been mentioned before. Here we present precise bounds on the time complexity of this problem, in terms of bigO notation. For a given grammar and XPath query, we consider three different tasks: (1) to count the number of nodes selected by the query (2) to materialize the preorder numbers of the selected nodes, and (3) to serialize the subtrees at the selected nodes.