Results 1 - 10
of
71
GPS: A Graph Processing System
"... GPS (for Graph Processing System) is a complete open-source system we developed for scalable, fault-tolerant, and easy-to-program execution of algorithms on extremely large graphs. GPS is similar to Google’s proprietary Pregel system [MAB+ 11], with some useful additional functionality described in ..."
Abstract
-
Cited by 68 (3 self)
- Add to MetaCart
(Show Context)
GPS (for Graph Processing System) is a complete open-source system we developed for scalable, fault-tolerant, and easy-to-program execution of algorithms on extremely large graphs. GPS is similar to Google’s proprietary Pregel system [MAB+ 11], with some useful additional functionality described in the paper. In distributed graph processing systems like GPS and Pregel, graph partitioning is the problem of deciding which vertices of the graph are assigned to which compute nodes. In addition to presenting the GPS system itself, we describe how we have used GPS to study the effects of different graph partitioning schemes. We present our experiments on the performance of GPS under different static partitioning schemes—assigning vertices to workers “intelligently ” before the computation starts—and with GPS’s dynamic repartitioning feature, which reassigns vertices to different compute nodes during the computation by observing their message sending patterns.
WTF: The who to follow service at twitter
"... Wtf (“Who to Follow”) is Twitter’s user recommendation service, which is responsible for creating millions of connections daily between users based on shared interests, common connections, and other related factors. This paper provides an architectural overview and shares lessons we learned in build ..."
Abstract
-
Cited by 31 (4 self)
- Add to MetaCart
(Show Context)
Wtf (“Who to Follow”) is Twitter’s user recommendation service, which is responsible for creating millions of connections daily between users based on shared interests, common connections, and other related factors. This paper provides an architectural overview and shares lessons we learned in building and running the service over the past few years. Particularly noteworthy was our design decision to process the entire Twitter graph in memory on a single server, which significantly reduced architectural complexity and allowed us to develop and deploy the service in only a few months. At the core of our architecture is Cassovary, an open-source in-memory graph processing engine we built from scratch for Wtf. Besides powering Twitter’s user recommendations, Cassovary is also used for search, discovery, promoted products, and other services as well. We describe and evaluate a few graph recommendation algorithms implemented in Cassovary, including a novel approach based on a combination of random walks and SALSA. Looking into the future, we revisit the design of our architecture and comment on its limitations, which are presently being addressed in a secondgeneration system under development.
Towards effective partition management for large graphs
- IN SIGMOD
, 2012
"... Searching and mining large graphs today is critical to a variety of application domains, ranging from community detection in social networks, to de novo genome sequence assembly. Scalable processing of large graphs requires careful partitioning and distribution of graphs across clusters. In this pap ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
(Show Context)
Searching and mining large graphs today is critical to a variety of application domains, ranging from community detection in social networks, to de novo genome sequence assembly. Scalable processing of large graphs requires careful partitioning and distribution of graphs across clusters. In this paper, we investigate the problem of managing large-scale graphs in clusters and study access characteristics of local graph queries such as breadth-first search, random walk, and SPARQL queries, which are popular in real applications. These queries exhibit strong access locality, and therefore require specific data partitioning strategies. In this work, we propose a Self Evolving Distributed Graph Management Environment (Sedge), to minimize inter-machine communi-cation during graph query processing in multiple machines. In order to improve query response time and throughput, Sedge introduces a two-level partition management archi-tecture with complimentary primary partitions and dynamic secondary partitions. These two kinds of partitions are able to adapt in real time to changes in query workload. Sedge also includes a set of workload analyzing algorithms whose time complexity is linear or sublinear to graph size. Empirical results show that it significantly improves distributed graph processing on today’s commodity clusters.
From "Think Like a Vertex " to "Think Like a Graph"
"... To meet the challenge of processing rapidly growing graph and network data created by modern applications, a number of distributed graph processing systems have emerged, such as Pregel and GraphLab. All these systems divide input graphs into partitions, and employ a “think like a vertex ” programmin ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
(Show Context)
To meet the challenge of processing rapidly growing graph and network data created by modern applications, a number of distributed graph processing systems have emerged, such as Pregel and GraphLab. All these systems divide input graphs into partitions, and employ a “think like a vertex ” programming model to support iterative graph computation. This vertex-centric model is easy to program and has been proved useful for many graph algorithms. However, this model hides the partitioning information from the users, thus prevents many algorithm-specific optimizations. This often results in longer execution time due to excessive network messages (e.g. in Pregel) or heavy scheduling overhead to ensure data consistency (e.g. in GraphLab). To address this limitation, we propose a new “think like a graph ” programming paradigm. Under this graph-centric model, the partition structure is opened up to the users, and can be utilized so that communication within a partition can bypass the heavy message passing or scheduling machinery. We implemented this model in a new system, called Giraph++, based on Apache Giraph, an open source implementation of Pregel. We explore the applicability of the graph-centric model to three categories of graph algorithms, and demonstrate its flexibility and superior performance, especially on well-partitioned data. For example, on a web graph with 118 million vertices and 855 million edges, the graph-centric version of connected component detection algorithm runs 63X faster and uses 204X fewer network messages than its vertex-centric counterpart. 1.
Managing large dynamic graphs efficiently
, 2012
"... There is an increasing need to ingest, manage, and query large volumes of graph-structured data arising in applications like social networks, communication networks, biological networks, and so on. Graph databases that can explicitly reason about the graphical nature of the data, that can support fl ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
(Show Context)
There is an increasing need to ingest, manage, and query large volumes of graph-structured data arising in applications like social networks, communication networks, biological networks, and so on. Graph databases that can explicitly reason about the graphical nature of the data, that can support flexible schemas and nodecentric or edge-centric analysis and querying, are ideal for storing such data. However, although there is much work on singlesite graph databases and on efficiently executing different types of queries over large graphs, to date there is little work on understanding the challenges in distributed graph databases, needed to handle the large scale of such data. In this paper, we propose the design of an in-memory, distributed graph data management system aimed at managing a large-scale dynamically changing graph, and supporting low-latency query processing over it. The key challenge in
TripleBit: a Fast and Compact System for Large Scale RDF Data
"... The volume of RDF data continues to grow over the past decade and many known RDF datasets have billions of triples. A grant challenge of managing this huge RDF data is how to access this big RDF data efficiently. A popular approach to addressing the problem is to build a full set of permutations of ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
(Show Context)
The volume of RDF data continues to grow over the past decade and many known RDF datasets have billions of triples. A grant challenge of managing this huge RDF data is how to access this big RDF data efficiently. A popular approach to addressing the problem is to build a full set of permutations of (S, P, O) indexes. Although this approach has shown to accelerate joins by orders of magnitude, the large space overhead limits the scalability of this approach and makes it heavyweight. In this paper, we present TripleBit, a fast and compact system for storing and accessing RDF data. The design of TripleBit has three salient features. First, the compact design of TripleBit reduces both the size of stored RDF data and the size of its indexes. Second, TripleBit introduces two auxiliary index structures, ID-Chunk bit matrix and ID-Predicate bit matrix, to minimize the cost of index selection during query evaluation. Third, its query processor dynamically generates an optimal execution ordering for join queries, leading to fast query execution and effective reduction on the size of intermediate results. Our experiments show that TripleBit outperforms RDF-3X, MonetDB, BitMat on LUBM, UniProt and BTC 2012 benchmark queries and it offers orders of mangnitude performance improvement for some complex join queries. 1.
Making the Most of your Triple Store: Query Answering in OWL 2 Using an RL Reasoner
"... Triple stores implementing the RL profile of OWL 2 are becoming increasingly popular. In contrast to unrestricted OWL 2, the RL profile is known to enjoy favourable computational properties for query answering, and state-of-the-art RL reasoners such as OWLim and Oracle’s RDF Semantic Graph have prov ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
(Show Context)
Triple stores implementing the RL profile of OWL 2 are becoming increasingly popular. In contrast to unrestricted OWL 2, the RL profile is known to enjoy favourable computational properties for query answering, and state-of-the-art RL reasoners such as OWLim and Oracle’s RDF Semantic Graph have proved extremely successful in industry-scale applications. The expressive restrictions imposed by OWL 2 RL may, however, be problematical for some applications. In this paper, we propose novel techniques that allow us (in many cases) to compute exact query answers using an offthe-shelf RL reasoner, even when the ontology is outside the RL profile. Furthermore, in the cases where exact query answers cannot be computed, we can still compute both lower and upper bounds on the exact answers. These bounds allow us to estimate the degree of incompleteness of the RL reasoner on the given query, and to optimise the computation of exact answers using a fully-fledged OWL 2 reasoner. A preliminary evaluation using Oracle’s RDF Semantic Graph has shown very promising results with respect to both scalability and tightness of the bounds. 1.
Rya: A Scalable RDF Triple Store for the Clouds
"... Resource Description Framework (RDF) was designed with the initial goal of developing metadata for the Internet. While the Internet is a conglomeration of many interconnected networks and computers, most of today’s best RDF storage solutions are confined to a single node. Working on a single node ha ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
Resource Description Framework (RDF) was designed with the initial goal of developing metadata for the Internet. While the Internet is a conglomeration of many interconnected networks and computers, most of today’s best RDF storage solutions are confined to a single node. Working on a single node has significant scalability issues, especially considering the magnitude of modern day data. In this paper we introduce a scalable RDF data management system that uses Accumulo, a Google Bigtable variant. We introduce storage methods, indexing schemes, and query processing techniques that scale to billions of triples across multiple nodes, while providing fast and easy access to the data through conventional query mechanisms such as SPARQL. Our performance evaluation shows that in most cases, our system outperforms existing distributed RDF solutions, even systems much more complex than ours.
Efficient distributed query processing for autonomous RDF databases
- IN EDBT
, 2012
"... The inherent flexibility of the RDF data model has led to its notable adoption in many domains, especially in the area of life-sciences. Some of these domains have an emerging need to access data integrated from various distributed sources of information. It is not always possible to implement this ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
The inherent flexibility of the RDF data model has led to its notable adoption in many domains, especially in the area of life-sciences. Some of these domains have an emerging need to access data integrated from various distributed sources of information. It is not always possible to implement this by simply loading all data into one central RDF store. For example, in the context of inter-institutional collaboration for drug development and clinical research participants often want to maintain control over their local databases. Alternatively, distributed query processing techniques can be utilized to evaluate queries by accessing the remote data sources only on demand and in conformance with local authorization models. In this paper we present an efficient approach to distributed query processing for large autonomous RDF databases. The groundwork is laid by a comprehensive RDF-specific schema- and instance-level synopsis. We present an optimizer that is able to utilize this synopsis to generate compact execution plans by precisely determining, at compile-time, those sources that are relevant to a query. Furthermore we present a tightly integrated query engine that is able to further reduce the volume of intermediate results at run-time. An extensive evaluation shows that our approach improves query execution times by up to two and transferred data volumes by up to three orders of magnitude compared to a naïve implementation.
Cascading Map-Side Joins over HBase for Scalable Join Processing
"... Abstract. One of the major challenges in large-scale data processing with MapReduce is the smart computation of joins. Since Semantic Web datasets published in RDF have increased rapidly over the last few years, scalable join techniques become an important issue for SPARQL query processing as well. ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Abstract. One of the major challenges in large-scale data processing with MapReduce is the smart computation of joins. Since Semantic Web datasets published in RDF have increased rapidly over the last few years, scalable join techniques become an important issue for SPARQL query processing as well. In this paper, we introduce the Map-Side Index Nested Loop Join (MAPSIN join) which combines scalable indexing capabilities of NoSQL data stores like HBase, that suffer from an insufficient distributed processing layer, with MapReduce, which in turn does not provide appropriate storage structures for efficient large-scale join processing. While retaining the flexibility of commonly used reduce-side joins, we leverage the effectiveness of map-side joins without any changes to the underlying framework. We demonstrate the significant benefits of MAPSIN joins for the processing of SPARQL basic graph patterns on large RDF datasets by an evaluation with the LUBM and SP 2 Bench benchmarks. For selective queries, MAPSIN join based query execution outperforms reduce-side join based execution by an order of magnitude. 1