Results 1 - 10
of
72
Ad-hoc object retrieval in the web of data.
- In Proceedings of the 19th international conference on World wide web,
, 2010
"... ABSTRACT Semantic Search refers to a loose set of concepts, challenges and techniques having to do with harnessing the information of the growing Web of Data (WoD) for Web search. Here we propose a formal model of one specific semantic search task: ad-hoc object retrieval. We show that this task pr ..."
Abstract
-
Cited by 42 (8 self)
- Add to MetaCart
(Show Context)
ABSTRACT Semantic Search refers to a loose set of concepts, challenges and techniques having to do with harnessing the information of the growing Web of Data (WoD) for Web search. Here we propose a formal model of one specific semantic search task: ad-hoc object retrieval. We show that this task provides a solid framework to study some of the semantic search problems currently tackled by commercial Web search engines. We connect this task to the traditional ad-hoc document retrieval and discuss appropriate evaluation metrics. Finally, we carry out a realistic evaluation of this task in the context of a Web search application.
A Distributed Graph Engine for Web Scale RDF Data
"... Much work has been devoted to supporting RDF data. But state-of-the-art systems and methods still cannot handle web scale RDF data effectively. Furthermore, many useful and general purpose graph-based operations (e.g., random walk, reachability, community discovery) on RDF data are not supported, as ..."
Abstract
-
Cited by 35 (1 self)
- Add to MetaCart
(Show Context)
Much work has been devoted to supporting RDF data. But state-of-the-art systems and methods still cannot handle web scale RDF data effectively. Furthermore, many useful and general purpose graph-based operations (e.g., random walk, reachability, community discovery) on RDF data are not supported, as most existing systems store and index data in particular ways (e.g., as relational tables or as a bitmap matrix) to maximize one particular operation on RDF data: SPARQL query processing. In this paper, we introduce Trinity.RDF, a distributed, memory-based graph engine for web scale RDF data. Instead of managing the RDF data in triple stores or as bitmap matrices, we store RDF data in its native graph form. It achieves much better (sometimes orders of magnitude better) performance for SPARQL queries than the state-of-the-art approaches. Furthermore, since the data is stored in its native graph form, the system can support other operations (e.g., random walks, reachability) on RDF graphs as well. We conduct comprehensive experimental studies on real life, web scale RDF data to demonstrate the effectiveness of our approach. 1
Relational Processing of RDF Queries: A Survey
"... The Resource Description Framework (RDF) is a flexible model for representing information about resources in the web. With the increasing amount of RDF data which is becoming available, efficient and scalable management of RDF data has become a fundamental challenge to achieve the Semantic Web visio ..."
Abstract
-
Cited by 23 (6 self)
- Add to MetaCart
(Show Context)
The Resource Description Framework (RDF) is a flexible model for representing information about resources in the web. With the increasing amount of RDF data which is becoming available, efficient and scalable management of RDF data has become a fundamental challenge to achieve the Semantic Web vision. The RDF model has attracted the attention of the database community and many researchers have proposed different solutions to store and query RDF data efficiently. This survey focuses on using relational query processors to store and query RDF data. We provide an overview of the different approaches and classify them according to their storage and query evaluation strategies. 1.
GraphX: Graph Processing in a Distributed Dataflow Framework
- USENIX ASSOCIATION 11TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI ’14)
, 2014
"... In pursuit of graph processing performance, the systems community has largely abandoned general-purpose dis-tributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming ab-stractions and accelerate the execution of iterative graph algorithms. In thi ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
In pursuit of graph processing performance, the systems community has largely abandoned general-purpose dis-tributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming ab-stractions and accelerate the execution of iterative graph algorithms. In this paper we argue that many of the advan-tages of specialized graph processing systems can be re-covered in a modern general-purpose distributed dataflow system. We introduce GraphX, an embedded graph pro-cessing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphX presents a fa-miliar composable graph abstraction that is sufficient to express existing graph APIs, yet can be implemented us-ing only a few basic dataflow operators (e.g., join, map, group-by). To achieve performance parity with special-ized graph systems, GraphX recasts graph-specific op-timizations as distributed join optimizations and mate-rialized view maintenance. By leveraging advances in distributed dataflow frameworks, GraphX brings low-cost fault tolerance to graph processing. We evaluate GraphX on real workloads and demonstrate that GraphX achieves an order of magnitude performance gain over the base dataflow framework and matches the performance of spe-cialized graph processing systems while enabling a wider range of computation.
A Minimal Deductive System for General Fuzzy RDF
"... Abstract. It is well-known that crisp RDF is not suitable to represent vague information. Fuzzy RDF variants are emerging to overcome this limitations. In this work we provide, under a very general semantics, a deductive system for a salient fragment of fuzzy RDF. We then also show how we may comput ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
(Show Context)
Abstract. It is well-known that crisp RDF is not suitable to represent vague information. Fuzzy RDF variants are emerging to overcome this limitations. In this work we provide, under a very general semantics, a deductive system for a salient fragment of fuzzy RDF. We then also show how we may compute the top-k answers of the union of conjunctive queries in which answers may be scored by means of a scoring function. 1
A Database Perspective on Consuming Linked Data on the Web
"... During recent years an increasing number of data providers adopted the Linked Data principles for publishing and connecting structured data on the Web, thus creating a globally distributed dataspace – the Web of Data. While the execution of structured, SQL-like queries over this dataspace opens po ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
During recent years an increasing number of data providers adopted the Linked Data principles for publishing and connecting structured data on the Web, thus creating a globally distributed dataspace – the Web of Data. While the execution of structured, SQL-like queries over this dataspace opens possibilities not conceivable before, query execution on the Web of Data poses novel challenges. These challenges provide great opportunities for the database community. In this article we introduce the concept of Linked Data and discuss different approaches to query the Web of Data. Our goal is to provide a general understanding of this new research area and of the challenges and open issues that must be addressed.
Heuristics-based Query Optimisation for SPARQL
"... Query optimization in RDF Stores is a challenging problem as SPARQL queries typically contain many more joins than equivalent relational plans, and hence lead to a large join order search space. In such cases, cost-based query optimization often is not possible. One practical reason for this is that ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
(Show Context)
Query optimization in RDF Stores is a challenging problem as SPARQL queries typically contain many more joins than equivalent relational plans, and hence lead to a large join order search space. In such cases, cost-based query optimization often is not possible. One practical reason for this is that statistics typically are missing in web scale setting such as the Linked Open Datasets (LOD). The more profound reason is that due to the absence of schematic structure in RDF, join-hit ratio estimation requires complicated forms of correlated join statistics; and currently there are no methods to identify the relevant correlations beforehand. For this reason, the use of good heuristics is essential in SPARQL query optimization, even in the case that are partially used with cost-based statistics (i.e., hybrid query optimization). In this paper we describe a set of useful heuristics for SPARQL query optimizers. We present these in the context of a new Heuristic SPARQL Planner (HSP) that is capable of exploiting the syntactic and the structural variations of the triple patterns in a SPARQL query in order to choose an execution plan without the need of any cost model. For this, we define the variable graph and we show a reduction of the SPARQL query optimization problem to the maximum weight independent set problem. We implemented our planner on top of the MonetDB open source column-store and evaluated its effectiveness against the state-ofthe-art RDF-3X engine as well as comparing the plan quality with a relational (SQL) equivalent of the benchmarks. 1.
Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools,” 2010
"... Abstract-Cloud computing is the newest paradigm in the IT world and hence the focus of new research. Companies hosting cloud computing services face the challenge of handling data intensive applications. Semantic web technologies can be an ideal candidate to be used together with cloud computing to ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
(Show Context)
Abstract-Cloud computing is the newest paradigm in the IT world and hence the focus of new research. Companies hosting cloud computing services face the challenge of handling data intensive applications. Semantic web technologies can be an ideal candidate to be used together with cloud computing tools to provide a solution. These technologies have been standardized by the World Wide Web Consortium (W3C). One such standard is the Resource Description Framework (RDF). With the explosion of semantic web technologies, large RDF graphs are common place. Current frameworks do not scale for large RDF graphs. In this paper, we describe a framework that we built using Hadoop, a popular open source framework for Cloud Computing, to store and retrieve large numbers of RDF triples. We describe a scheme to store RDF data in Hadoop Distributed File System. We present an algorithm to generate the best possible query plan to answer a SPARQL Protocol and RDF Query Language (SPARQL) query based on a cost model. We use Hadoop's MapReduce framework to answer the queries. Our results show that we can store large RDF graphs in Hadoop clusters built with cheap commodity class hardware. Furthermore, we show that our framework is scalable and efficient and can easily handle billions of RDF triples, unlike traditional approaches.
Triad: a distributed shared-nothing rdf engine based on asynchronous message passing
- In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14
, 2014
"... We investigate a new approach to the design of distributed, shared-nothing RDF engines. Our engine, coined “TriAD”, combines join-ahead pruning via a novel form of RDF graph summarization with a locality-based, horizontal partitioning of RDF triples into a grid-like, distributed index structure. The ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
We investigate a new approach to the design of distributed, shared-nothing RDF engines. Our engine, coined “TriAD”, combines join-ahead pruning via a novel form of RDF graph summarization with a locality-based, horizontal partitioning of RDF triples into a grid-like, distributed index structure. The multi-threaded and distributed execution of joins in TriAD is facilitated by an asynchronous Mes-sage Passing protocol which allows us to run multiple join oper-ators along a query plan in a fully parallel, asynchronous fashion. We believe that our architecture provides a so far unique approach to join-ahead pruning in a distributed environment, as the more classical form of sideways information passing would not permit for executing distributed joins in an asynchronous way. Our experi-ments over the LUBM, BTC andWSDTS benchmarks demonstrate that TriAD consistently outperforms centralized RDF engines by up to two orders of magnitude, while gaining a factor of more than three compared to the currently fastest, distributed engines. To our knowledge, we are thus able to report the so far fastest query re-sponse times for the above benchmarks using a mid-range server and regular Ethernet setup.
Diversified Stress Testing of RDF Data Management Systems
"... Abstract. The Resource Description Framework (RDF) is a standard for conceptually describing data on the Web, and SPARQL is the query language for RDF. As RDF data continue to be published across heterogeneous domains and integrated at Web-scale such as in the Linked Open Data (LOD) cloud, RDF data ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
(Show Context)
Abstract. The Resource Description Framework (RDF) is a standard for conceptually describing data on the Web, and SPARQL is the query language for RDF. As RDF data continue to be published across heterogeneous domains and integrated at Web-scale such as in the Linked Open Data (LOD) cloud, RDF data management systems are being exposed to queries that are far more diverse and workloads that are far more varied. The first contribution of our work is an indepth experimental analysis that shows existing SPARQL benchmarks are not suitable for testing systems for diverse queries and varied workloads. To address these shortcomings, our second contribution is the Waterloo SPARQL Diversity Test Suite (WatDiv) that provides stress testing tools for RDF data management systems. Using WatDiv, we have been able to reveal issues with existing systems that went unnoticed in evaluations using earlier benchmarks. Specifically, our experiments with five popular RDF data management systems show that they cannot deliver good performance uniformly across workloads. For some queries, there can be as much as five orders of magnitude difference between the query execution time of the fastest and the slowest system while the fastest system on one query may unexpectedly time out on another query. By performing a detailed analysis, we pinpoint these problems to specific types of queries and workloads.