Results 1 - 10
of
14
Compressing and searching xml data via two zips
- In Proceedings 15th World Wide Web Conference (WWW). 751–760
, 2006
"... ..."
(Show Context)
Compressing and indexing labeled trees, with applications
, 2009
"... Consider an ordered, static tree T where each node has a label from alphabet �. Tree T may be of arbitrary degree and shape. Our goal is designing a compressed storage scheme of T that supports basic navigational operations among the immediate neighbors of a node (i.e. parent, ith child, or any chi ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
Consider an ordered, static tree T where each node has a label from alphabet �. Tree T may be of arbitrary degree and shape. Our goal is designing a compressed storage scheme of T that supports basic navigational operations among the immediate neighbors of a node (i.e. parent, ith child, or any child with some label,...) as well as more sophisticated path-based search operations over its labeled structure. We present a novel approach to this problem by designing what we call the XBW-transform of the tree in the spirit of the well-known Burrows-Wheeler transform for strings [1994]. The XBW-transform uses path-sorting to linearize the labeled tree T into two coordinated arrays, one capturing the structure and the other the labels. For the first time, by using the properties of the XBW-transform, our compressed indexes go beyond the information-theoretic lower bound, and support navigational and pathsearch operations over labeled trees within (near-)optimal time bounds and entropy-bounded space. Our XBW-transform is simple and likely to spur new results in the theory of tree compression and indexing, as well as interesting application contexts. As an example, we use the XBW-transform to design and implement a compressed index for XML documents whose compression ratio is significantly better than the one achievable by state-of-the-art tools, and its query time performance is order
Efficient external-memory bisimulation on DAGs
- In SIGMOD
, 2012
"... ABSTRACT In this paper we introduce the first efficient external-memory algorithm to compute the bisimilarity equivalence classes of a directed acyclic graph (DAG). DAGs are commonly used to model data in a wide variety of practical applications, ranging from XML documents and data provenance model ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
(Show Context)
ABSTRACT In this paper we introduce the first efficient external-memory algorithm to compute the bisimilarity equivalence classes of a directed acyclic graph (DAG). DAGs are commonly used to model data in a wide variety of practical applications, ranging from XML documents and data provenance models, to web taxonomies and scientific workflows. In the study of efficient reasoning over massive graphs, the notion of node bisimilarity plays a central role. For example, grouping together bisimilar nodes in an XML data set is the first step in many sophisticated approaches to building indexing data structures for efficient XPath query evaluation. To date, however, only internal-memory bisimulation algorithms have been investigated. As the size of real-world DAG data sets often exceeds available main memory, storage in external memory becomes necessary. Hence, there is a practical need for an efficient approach to computing bisimulation in external memory. Our general algorithm has a worst-case IO-complexity of O(Sort(|N | + |E|)), where |N | and |E| are the numbers of nodes and edges, resp., in the data graph and Sort(n) is the number of accesses to external memory needed to sort an input of size n. We also study specializations of this algorithm to common variations of bisimulation for treestructured XML data sets. We empirically verify efficient performance of the algorithms on graphs and XML documents having billions of nodes and edges, and find that the algorithms can process such graphs efficiently even when very limited internal memory is available. The proposed algorithms are simple enough for practical implementation and use, and open the door for further study of externalmemory bisimulation algorithms. To this end, the full opensource C++ implementation has been made freely available.
Extended XML Tree Pattern Matching: Theories and Algorithms
, 2010
"... As business and enterprises generate and exchange XML data more often, there is an increasing need for efficient processing of queries on XML data. Searching for the occurrences of a tree pattern query in an XML database is a core operation in XML query processing. Prior works demonstrate that holis ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
As business and enterprises generate and exchange XML data more often, there is an increasing need for efficient processing of queries on XML data. Searching for the occurrences of a tree pattern query in an XML database is a core operation in XML query processing. Prior works demonstrate that holistic twig pattern matching algorithm is an efficient technique to answer an XML tree pattern with parent-child (P-C) and ancestor-descendant (A-D) relationships, as it can effectively control the size of intermediate results during query processing. However, XML query languages (e.g. XPath, XQuery) define more axes and functions such as negation function, order-based axis and wildcards. In this article, we research a large set of XML tree pattern, called extended XML tree pattern, which may include P-C, A-D relationships, negation functions, wildcards and order restriction. We establish a theoretical framework about “matching cross” which demonstrates the intrinsic reason in the proof of optimality on holistic algorithms. Based on our theorems, we propose a set of novel algorithms to efficiently process three categories of extended XML tree patterns. A set of experimental results on both real-life and synthetic data sets demonstrate the effectiveness and efficiency of our proposed theories and algorithms.
A resource efficient hybrid data structure for twig queries
- Database and XML Technologies : 4th International XML Database Symposium (XSym 2006
, 2006
"... Abstract. Designing data structures for use in mobile devices requires attention on optimising data volumes with associated benefits for data transmission, storage space and battery use. For semistructured data, tree summarisation techniques can be used to reduce the volume of structured elements wh ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Designing data structures for use in mobile devices requires attention on optimising data volumes with associated benefits for data transmission, storage space and battery use. For semistructured data, tree summarisation techniques can be used to reduce the volume of structured elements while dictionary compression can efficiently deal with value-based predicates. This paper introduces an integration of the two approaches using numbering schemes to connect the separate elements, the key strength of this hybrid technique is that both structural and value predicates can be resolved in one graph, while further allowing for compression of the resulting data structure. Performance measures that show advantages of using this hybrid structure are presented, together with an analysis of query resolution using a number of different index granularities. As the current trend is towards the requirement for working with larger semi-structured data sets this work allows for the utilisation of these data sets whilst reducing both the bandwidth and storage space necessary. 1
Searching Web Data: an Entity Retrieval and High-Performance Indexing Model
"... More and more (semi) structured information is becoming available on the Web in the form of documents embedding metadata (e.g., RDF, RDFa, Microformats and others). There are already hundreds of millions of such documents accessible and their number is growing rapidly. This calls for large scale sys ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
More and more (semi) structured information is becoming available on the Web in the form of documents embedding metadata (e.g., RDF, RDFa, Microformats and others). There are already hundreds of millions of such documents accessible and their number is growing rapidly. This calls for large scale systems providing effective means of search-ing and retrieving this semi-structured information with the ultimate goal of making it exploitable by humans and machines alike. This article examines the shift from the traditional web document model to a web data object (entity) model and studies the challenges faced in implementing a scalable and high performance system for searching semi-structured data objects over a large heterogeneous and decentralised infrastructure. Towards this goal, we define an entity re-trieval model, develop novel methodologies for supporting this model and show how to achieve a high-performance entity retrieval system. We introduce an indexing methodology for semi-structured data which offers a good com-promise between query expressiveness, query processing and index maintenance compared to other approaches. We address high-performance by optimisation of the index data structure using appropriate compression techniques. Fi-nally, we demonstrate that the resulting system can index billions of data objects and provides keyword-based as well as more advanced search interfaces for retrieving relevant data objects in sub-second time.
Optimal Tree Node Ordering for
"... Abstract — There are many applications in which users interactively access huge tree data by repeating set-based navigations. In this paper, we focus on label-specific/wildcard children/descendant navigations. For efficient processing of these operations in huge data stored on a disk, we need a node ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract — There are many applications in which users interactively access huge tree data by repeating set-based navigations. In this paper, we focus on label-specific/wildcard children/descendant navigations. For efficient processing of these operations in huge data stored on a disk, we need a node ordering scheme that clusters nodes that are accessed together by these operations. In this paper, (1) we show there is no node order that is optimal for all these operations, (2) we propose two schemes, each of which is optimal only for some subset of them, and (3) we show that one of the proposed schemes can process all these operations with access to a constant-bounded number of regions on the disk without accessing irrelevant nodes. I.
Hierarchical Indexing Approach to Support XPath Queries
"... Abstract — We study new hierarchical indexing approach to process XPATH queries. Here, a hierarchical index consists of index entries that are pairs of queries and their (full/partial) answers (called extents). With such an index, XPATH queries can be processed to extract the results if they match t ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract — We study new hierarchical indexing approach to process XPATH queries. Here, a hierarchical index consists of index entries that are pairs of queries and their (full/partial) answers (called extents). With such an index, XPATH queries can be processed to extract the results if they match the queries maintained in those index entries. Existing XML path indexing approaches support either child-axis (/) only, or additional descendant-or-self-axis ( /) but only in the query root. Different from them, we propose a novel indexing approach to process a large fragment of XPATH queries, which may use /, /, and wildcards (∗). The key issues are how to reduce the number of index entries and how to maintain non-overlapping extents among index entries. We show how to compress such index and how to evaluate XPATH queries on it. Experiments show the efficiency of our approaches.