Results 1 - 10
of
43
Scalable semantic web data management using vertical partitioning
- In VLDB
, 2007
"... The dataset used for this benchmark is taken from the publicly available Barton Libraries dataset [1]. This data is provided by the Simile Project [3], which develops tools for library data management and interoperability. The data contains records that compose an RDF-formatted dump of the MIT Libra ..."
Abstract
-
Cited by 190 (6 self)
- Add to MetaCart
(Show Context)
The dataset used for this benchmark is taken from the publicly available Barton Libraries dataset [1]. This data is provided by the Simile Project [3], which develops tools for library data management and interoperability. The data contains records that compose an RDF-formatted dump of the MIT Libraries Barton catalog, converted from raw data stored in an old library format standard called MARC (Machine Readable Catalog). Because of the multiple sources the data was derived from and the diverse nature of the data that is cataloged, the structure of the data is quite irregular. At the time of publication of this report, there are slightly more than 50 million triples in the dataset, with a total of 221 unique properties, of which the vast majority appear infrequently. Of these properties, 82 (37%) are multi-valued, meaning that they appear more than once for a given subject; however, these properties appear more often (77 % of the triples have a multi-valued property). The dataset provides a good demonstration of the relatively unstructured nature of Semantic Web data. 2. LONGWELL OVERVIEW Longwell [2] is a tool developed by the Simile Project, which provides a graphical user interface for generic RDF data exploration in a web browser. It begins by presenting the user with a list of the values the type property can take (such as Text or Notated Music in the library dataset). The user can click on the types of data he desires to further explore. Longwell shows the list of currently filtered resources (RDF subjects) in the main portion of the screen, and a list of filters in panels along the side. Each panel represents a property that is defined on the current filter, with popular object values for that property and their frequency also presented in this box. If the user selects an object value, this filters the working set of resources to those that have that property-object value defined,
Hexastore: Sextuple Indexing for Semantic Web Data Management
, 2008
"... Despite the intense interest towards realizing the Semantic Web vision, most existing RDF data management schemes are constrained in terms of efficiency and scalability. Still, the growing popularity of the RDF format arguably calls for an effort to offset these drawbacks. Viewed from a relationalda ..."
Abstract
-
Cited by 188 (11 self)
- Add to MetaCart
(Show Context)
Despite the intense interest towards realizing the Semantic Web vision, most existing RDF data management schemes are constrained in terms of efficiency and scalability. Still, the growing popularity of the RDF format arguably calls for an effort to offset these drawbacks. Viewed from a relationaldatabase perspective, these constraints are derived from the very nature of the RDF data model, which is based on a triple format. Recent research has attempted to address these constraints using a vertical-partitioning approach, in which separate two-column tables are constructed for each property. However, as we show, this approach suffers from similar scalability drawbacks on queries that are not bound by RDF property value. In this paper, we propose an RDF storage scheme that uses the triple nature of RDF as an asset. This scheme enhances the vertical partitioning idea and takes it to its logical conclusion. RDF data is indexed in six possible ways, one for each possible ordering of the three RDF elements. Each instance of an RDF element is associated with two vectors; each such vector gathers elements of one of the other types, along with lists of the third-type resources attached to each vector element. Hence, a sextupleindexing scheme emerges. This format allows for quick and scalable general-purpose query processing; it confers significant advantages (up to five orders of magnitude) compared to previous approaches for RDF data management, at the price of a worst-case five-fold increase in index space. We experimentally document the advantages of our approach on real-world and synthetic data sets with practical queries.
SW-Store: a vertically partitioned DBMS for Semantic Web data management
, 2009
"... Efficient management of RDF data is an important prerequisite for realizing the Semantic Web vision. Performance and scalability issues are becoming increasingly pressing as Semantic Web technology is applied to real-world applications. In this paper, we examine the reasons why current data manageme ..."
Abstract
-
Cited by 72 (1 self)
- Add to MetaCart
Efficient management of RDF data is an important prerequisite for realizing the Semantic Web vision. Performance and scalability issues are becoming increasingly pressing as Semantic Web technology is applied to real-world applications. In this paper, we examine the reasons why current data management solutions for RDF data scale poorly, and explore the fundamental scalability limitations of these approaches. We review the state of the art for improving performance of RDF databases and consider a recent suggestion, “property tables”. We then discuss practically and empirically why this solution has undesirable features. As an improvement, we propose an alternative solution: vertically partitioning the RDF data. We compare the performance of vertical partitioning with prior art on queries generated by a Web-based RDF browser over a large-scale (more than 50 million triples) catalog of library data. Our results show that a vertically partitioned schema achieves similar performance to the property table technique while being much simpler to design. Further, if a column-oriented DBMS (a database architected specially for the vertically partitioned case) is used instead of a row-oriented DBMS, another order of magnitude performance improvement is observed, with query times dropping from minutes to several seconds. Encouraged by these results, we describe the architecture of
4store: The Design and Implementation of a Clustered RDF Store
- IN: SCALABLE SEMANTIC WEB KNOWLEDGE BASE SYSTEMS - SSWS2009
, 2009
"... This paper describes the design and implementation of the 4store RDF storage and SPARQL query system with respect to its cluster and query processing design. 4store was originally designed to meet the data needs of Garlik, a UK-based semantic web company. This paper describes the design and perform ..."
Abstract
-
Cited by 53 (2 self)
- Add to MetaCart
(Show Context)
This paper describes the design and implementation of the 4store RDF storage and SPARQL query system with respect to its cluster and query processing design. 4store was originally designed to meet the data needs of Garlik, a UK-based semantic web company. This paper describes the design and performance characteristics of 4store, as well as discussing some of the trade-offs and design decisions. These arose both from immediate business requirements and a desire to engineer a scalable system capable of reuse in a range of experimental contexts where we were looking to explore new business opportunities.
Towards effective partition management for large graphs
- IN SIGMOD
, 2012
"... Searching and mining large graphs today is critical to a variety of application domains, ranging from community detection in social networks, to de novo genome sequence assembly. Scalable processing of large graphs requires careful partitioning and distribution of graphs across clusters. In this pap ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
(Show Context)
Searching and mining large graphs today is critical to a variety of application domains, ranging from community detection in social networks, to de novo genome sequence assembly. Scalable processing of large graphs requires careful partitioning and distribution of graphs across clusters. In this paper, we investigate the problem of managing large-scale graphs in clusters and study access characteristics of local graph queries such as breadth-first search, random walk, and SPARQL queries, which are popular in real applications. These queries exhibit strong access locality, and therefore require specific data partitioning strategies. In this work, we propose a Self Evolving Distributed Graph Management Environment (Sedge), to minimize inter-machine communi-cation during graph query processing in multiple machines. In order to improve query response time and throughput, Sedge introduces a two-level partition management archi-tecture with complimentary primary partitions and dynamic secondary partitions. These two kinds of partitions are able to adapt in real time to changes in query workload. Sedge also includes a set of workload analyzing algorithms whose time complexity is linear or sublinear to graph size. Empirical results show that it significantly improves distributed graph processing on today’s commodity clusters.
Query execution in column-oriented database systems
, 2008
"... There are two obvious ways to map a two-dimension relational database table onto a one-dimensional storage interface: store the table row-by-row, or store the table column-by-column. Historically, database system implementations and research have focused on the row-by row data layout, since it perfo ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
(Show Context)
There are two obvious ways to map a two-dimension relational database table onto a one-dimensional storage interface: store the table row-by-row, or store the table column-by-column. Historically, database system implementations and research have focused on the row-by row data layout, since it performs best on the most common application for database systems: business transactional data processing. However, there are a set of emerging applications for database systems for which the row-by-row layout performs poorly. These applications are more analytical in nature, whose goal is to read through the data to gain new insight and use it to drive decision making and planning. In this dissertation, we study the problem of poor performance of row-by-row data layout for these emerging
Ultrawrap: SPARQL Execution on Relational Data
"... Abstract: The Semantic Web’s promise to achieve web-wide data integration requires the inclusion of legacy relational data as RDF, which, in turn, requires the execution of SPARQL queries on the legacy relational database. In this paper we explore a hypothesis: existing commercial relational databas ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
(Show Context)
Abstract: The Semantic Web’s promise to achieve web-wide data integration requires the inclusion of legacy relational data as RDF, which, in turn, requires the execution of SPARQL queries on the legacy relational database. In this paper we explore a hypothesis: existing commercial relational databases already subsume the algorithms and optimizations needed to support effective SPARQL execution on existing relationally stored data. The experiment, embodied in a system called Ultrawrap, comprises encoding a logical representation of the database as a graph using SQL views and a simple syntactic translation of SPARQL queries to SQL queries on those views. Thus, in the course executing a SPARQL query, the SQL optimizer both instantiates a mapping of relational data to RDF and optimizes its execution. Other approaches typically implement aspects of query optimization and execution outside the SQL environment. Ultrawrap is evaluated using two benchmarks across the three major relational database management systems. We identify two important optimizations: detection of unsatisfiable conditions and self-join elimination, such that, when applied, SPARQL queries execute at nearly the same speed as semantically equivalent native SQL queries, providing strong evidence of the validity of the hypothesis. 1.
TripleBit: a Fast and Compact System for Large Scale RDF Data
"... The volume of RDF data continues to grow over the past decade and many known RDF datasets have billions of triples. A grant challenge of managing this huge RDF data is how to access this big RDF data efficiently. A popular approach to addressing the problem is to build a full set of permutations of ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
The volume of RDF data continues to grow over the past decade and many known RDF datasets have billions of triples. A grant challenge of managing this huge RDF data is how to access this big RDF data efficiently. A popular approach to addressing the problem is to build a full set of permutations of (S, P, O) indexes. Although this approach has shown to accelerate joins by orders of magnitude, the large space overhead limits the scalability of this approach and makes it heavyweight. In this paper, we present TripleBit, a fast and compact system for storing and accessing RDF data. The design of TripleBit has three salient features. First, the compact design of TripleBit reduces both the size of stored RDF data and the size of its indexes. Second, TripleBit introduces two auxiliary index structures, ID-Chunk bit matrix and ID-Predicate bit matrix, to minimize the cost of index selection during query evaluation. Third, its query processor dynamically generates an optimal execution ordering for join queries, leading to fast query execution and effective reduction on the size of intermediate results. Our experiments show that TripleBit outperforms RDF-3X, MonetDB, BitMat on LUBM, UniProt and BTC 2012 benchmark queries and it offers orders of mangnitude performance improvement for some complex join queries. 1.
RDFProv: A Relational RDF Store for Querying and Managing Scientific Workflow Provenance
, 2010
"... Provenance metadata has become increasingly important to support scientific discovery reproducibility, result interpretation, and problem diagnosis in scientific workflow environments. The provenance management problem concerns the efficiency and effectiveness of the modeling, recording, representat ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
Provenance metadata has become increasingly important to support scientific discovery reproducibility, result interpretation, and problem diagnosis in scientific workflow environments. The provenance management problem concerns the efficiency and effectiveness of the modeling, recording, representation, integration, storage, and querying of provenance metadata. Our approach to provenance management seamlessly integrates the interoperability, extensibility, and inference advantages of Semantic Web technologies with the storage and querying power of an RDBMS to meet the emerging requirements of scientific workflow provenance management. In this paper, we elaborate on the design of a relational RDF store, called RDFProv, that is optimized for scientific workflow provenance querying and management. Specifically, we propose: i) two schema mapping algorithms to map an OWL provenance ontology to a relational database schema that is optimized for common provenance queries; ii) three efficient data mapping algorithms to map provenance RDF metadata to relational data according to the generated relational database schema, and iii) a schema-independent SPARQL-to-SQL translation algorithm that is optimized on-the-fly by using the type information of an instance available from the input provenance ontology and the statistics of the sizes of the tables in the database. Experimental results are presented to show that our algorithms are efficient and scalable. The comparison with two popular relational RDF stores, Jena and Sesame, and two commercial native RDF stores, AllegroGraph and BigOWLIM, showed that our optimizations result in improved performance and scalability for provenance metadata management. Finally, our case study for provenance management in a real-life biological simulation workflow showed the production quality and capability of the RDFProv system. Although presented in the context of scientific workflow provenance management, many of our proposed techniques apply to general RDF data management as well.