Results 1  10
of
29
GPS: A Graph Processing System
"... GPS (for Graph Processing System) is a complete opensource system we developed for scalable, faulttolerant, and easytoprogram execution of algorithms on extremely large graphs. GPS is similar to Google’s proprietary Pregel system [MAB+ 11], with some useful additional functionality described in ..."
Abstract

Cited by 68 (3 self)
 Add to MetaCart
(Show Context)
GPS (for Graph Processing System) is a complete opensource system we developed for scalable, faulttolerant, and easytoprogram execution of algorithms on extremely large graphs. GPS is similar to Google’s proprietary Pregel system [MAB+ 11], with some useful additional functionality described in the paper. In distributed graph processing systems like GPS and Pregel, graph partitioning is the problem of deciding which vertices of the graph are assigned to which compute nodes. In addition to presenting the GPS system itself, we describe how we have used GPS to study the effects of different graph partitioning schemes. We present our experiments on the performance of GPS under different static partitioning schemes—assigning vertices to workers “intelligently ” before the computation starts—and with GPS’s dynamic repartitioning feature, which reassigns vertices to different compute nodes during the computation by observing their message sending patterns.
A Distributed Graph Engine for Web Scale RDF Data
"... Much work has been devoted to supporting RDF data. But stateoftheart systems and methods still cannot handle web scale RDF data effectively. Furthermore, many useful and general purpose graphbased operations (e.g., random walk, reachability, community discovery) on RDF data are not supported, as ..."
Abstract

Cited by 35 (1 self)
 Add to MetaCart
(Show Context)
Much work has been devoted to supporting RDF data. But stateoftheart systems and methods still cannot handle web scale RDF data effectively. Furthermore, many useful and general purpose graphbased operations (e.g., random walk, reachability, community discovery) on RDF data are not supported, as most existing systems store and index data in particular ways (e.g., as relational tables or as a bitmap matrix) to maximize one particular operation on RDF data: SPARQL query processing. In this paper, we introduce Trinity.RDF, a distributed, memorybased graph engine for web scale RDF data. Instead of managing the RDF data in triple stores or as bitmap matrices, we store RDF data in its native graph form. It achieves much better (sometimes orders of magnitude better) performance for SPARQL queries than the stateoftheart approaches. Furthermore, since the data is stored in its native graph form, the system can support other operations (e.g., random walks, reachability) on RDF graphs as well. We conduct comprehensive experimental studies on real life, web scale RDF data to demonstrate the effectiveness of our approach. 1
From "Think Like a Vertex " to "Think Like a Graph"
"... To meet the challenge of processing rapidly growing graph and network data created by modern applications, a number of distributed graph processing systems have emerged, such as Pregel and GraphLab. All these systems divide input graphs into partitions, and employ a “think like a vertex ” programmin ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
(Show Context)
To meet the challenge of processing rapidly growing graph and network data created by modern applications, a number of distributed graph processing systems have emerged, such as Pregel and GraphLab. All these systems divide input graphs into partitions, and employ a “think like a vertex ” programming model to support iterative graph computation. This vertexcentric model is easy to program and has been proved useful for many graph algorithms. However, this model hides the partitioning information from the users, thus prevents many algorithmspecific optimizations. This often results in longer execution time due to excessive network messages (e.g. in Pregel) or heavy scheduling overhead to ensure data consistency (e.g. in GraphLab). To address this limitation, we propose a new “think like a graph ” programming paradigm. Under this graphcentric model, the partition structure is opened up to the users, and can be utilized so that communication within a partition can bypass the heavy message passing or scheduling machinery. We implemented this model in a new system, called Giraph++, based on Apache Giraph, an open source implementation of Pregel. We explore the applicability of the graphcentric model to three categories of graph algorithms, and demonstrate its flexibility and superior performance, especially on wellpartitioned data. For example, on a web graph with 118 million vertices and 855 million edges, the graphcentric version of connected component detection algorithm runs 63X faster and uses 204X fewer network messages than its vertexcentric counterpart. 1.
Scalable Maximum Clique Computation Using MapReduce
"... We present a scalable and faulttolerant solution for the maximum clique problem based on the MapReduce framework. Thekeycontributionthatenablesusto effectively use MapReduce is a recursive partitioning method that partitions the graph into several subgraphs of similar size. After partitioning, the ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
We present a scalable and faulttolerant solution for the maximum clique problem based on the MapReduce framework. Thekeycontributionthatenablesusto effectively use MapReduce is a recursive partitioning method that partitions the graph into several subgraphs of similar size. After partitioning, the maximum cliques of the different partitions can be computed independently, and the computation is sped up using a branch and bound method. Our experiments show that our approach leads to good scalability, which is unachievable by other partitioning methods since they result in partitions of different sizes and hence lead to load imbalance. Our method is more scalable than an MPI algorithm, and is simpler and more fault tolerant.
Pregelix: Big(ger) Graph Analytics on A Dataflow Engine
"... There is a growing need for distributed graph processing systems that are capable of gracefully scaling to very large graph datasets. Unfortunately, this challenge has not been easily met due to the intense memory pressure imposed by processcentric, message passing designs that many graph process ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
There is a growing need for distributed graph processing systems that are capable of gracefully scaling to very large graph datasets. Unfortunately, this challenge has not been easily met due to the intense memory pressure imposed by processcentric, message passing designs that many graph processing systems follow. Pregelix is a new open source distributed graph processing system that is based on an iterative dataflow design that is better tuned to handle both inmemory and outofcore workloads. As such, Pregelix offers improved performance characteristics and scaling properties over current open source systems (e.g., we have seen up to 15× speedup compared to Apache Giraph and up to 35 × speedup compared to distributed GraphLab), and more effective use of available machine resources to support Big(ger) Graph Analytics. 1.
High throughput indexing for largescale semantic web data
 in Proc. 30th Annual ACM Symp. Applied Computing, 2015
"... Distributed RDF data management systems become increasingly important with the growth of the Semantic Web. Currently, several such systems have been proposed, however, their indexing methods meet performance bottlenecks either on data loading or querying when processing large amounts of data. In t ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Distributed RDF data management systems become increasingly important with the growth of the Semantic Web. Currently, several such systems have been proposed, however, their indexing methods meet performance bottlenecks either on data loading or querying when processing large amounts of data. In this work, we propose a high throughout index to enable rapid analysis of large datasets. We adopt a hybrid structure to combine the loading speed of similarsize based methods with the execution speed of graphbased approaches, using dynamic data repartitioning over query workloads. We introduce the design and detailed implementation of our method. Experimental results show that the proposed index can indeed vastly improve loading speeds while remaining competitive in terms of performance. Therefore, the method could be considered as a good choice for RDF analysis in largescale distributed scenarios. 1.
Balanced graph edge partition
 KDD
, 2014
"... Abstract Balanced edge partition has emerged as a new approach to partition an input graph data for the purpose of scaling out parallel computations, which is of interest for several modern data analytics computation platforms, including platforms for iterative computations, machine learning probl ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract Balanced edge partition has emerged as a new approach to partition an input graph data for the purpose of scaling out parallel computations, which is of interest for several modern data analytics computation platforms, including platforms for iterative computations, machine learning problems, and graph databases. This new approach stands in a stark contrast to the traditional approach of balanced vertex partition, where for given number of partitions, the problem is to minimize the number of edges cut subject to balancing the vertex cardinality of partitions. In this paper, we first characterize the expected costs of vertex and edge partitions with and without aggregation of messages, for the commonly deployed policy of placing a vertex or an edge uniformly at random to one of the partitions. We then obtain the first approximation algorithms for the balanced edgepartition problem which for the case of no aggregation matches the best known approximation ratio for the balanced vertexpartition problem, and show that this remains to hold for the case with aggregation up to factor that is equal to the maximum indegree of a vertex. We report results of an extensive empirical evaluation on a set of realworld graphs, which quantifies the benefits of edgevs. vertexpartition, and demonstrates efficiency of natural greedy online assignments for the balanced edgepartition problem with and with no aggregation.
Scalable SPARQL Querying using Path Partitioning
"... Abstract—The emerging need for conducting complex analysis over big RDF datasets calls for scaleout solutions that can harness a computing cluster to process big RDF datasets. Queries over RDF data often involve complex selfjoins, which would be very expensive to run if the data are not carefully ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—The emerging need for conducting complex analysis over big RDF datasets calls for scaleout solutions that can harness a computing cluster to process big RDF datasets. Queries over RDF data often involve complex selfjoins, which would be very expensive to run if the data are not carefully partitioned across the cluster and hence distributed joins over massive amount of data are necessary. Existing RDF data partitioning methods can nicely localize simple queries but still need to resort to expensive distributed joins for more complex queries. In this paper, we propose a new data partitioning approach that takes use of the rich structural information in RDF datasets and minimizes the amount of data that have to be joined across different computing nodes. We conduct an extensive experimental study using two popular RDF benchmark data and one real RDF dataset that contain up to billions of RDF triples. The results indicate that our approach can produce a balanced and low redundant data partitioning scheme that can avoid or largely reduce the cost of distributed joins even for very complicated queries. In terms of query execution time, our approach can outperform the stateoftheart methods by orders of magnitude. I.
xdgp: A dynamic graph processing system with adaptive partitioning. arXiv
, 2013
"... Many realworld systems, such as social networks, rely on mining efficiently large graphs, with hundreds of millions of vertices and edges. This volume of information requires partitioning the graph across multiple nodes in a distributed system. This has a deep effect on performance, as traversing ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Many realworld systems, such as social networks, rely on mining efficiently large graphs, with hundreds of millions of vertices and edges. This volume of information requires partitioning the graph across multiple nodes in a distributed system. This has a deep effect on performance, as traversing edges cut between partitions incurs a significant performance penalty due to the cost of communication. Thus, several systems in the literature have attempted to improve computational performance by enhancing graph partitioning, but they do not support another characteristic of realworld graphs: graphs are inherently dynamic, their topology evolves continuously, and subsequently the optimum partitioning also changes over time. In this work, we present the first system that dynamically repartitions massive graphs to adapt to structural changes. The system optimises graph partitioning to prevent performance degradation without using data replication. The system adopts an iterative vertex migration algorithm that relies on local information only, making complex coordination unnecessary. We show how the improvement in graph partitioning reduces execution time by over 50%, while adapting the partitioning to a large number of changes to the graph in three realworld scenarios.
Systems for BigGraphs
"... Graphs have become increasingly important to represent highlyinterconnected structures and schemaless data including the World Wide Web, social networks, knowledge graphs, genome and scientific databases, medical and government records. The massive scale of graph data easily overwhelms the main m ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Graphs have become increasingly important to represent highlyinterconnected structures and schemaless data including the World Wide Web, social networks, knowledge graphs, genome and scientific databases, medical and government records. The massive scale of graph data easily overwhelms the main memory and computation resources on commodity servers. In these cases, achieving low latency and high throughput requires partitioning the graph and processing the graph data in parallel across a cluster of servers. However, the software and and hardware advances that have worked well for developing parallel databases and scientific applications are not necessarily effective for biggraph problems. Graph processing poses interesting system challenges: graphs represent relationships which are usually irregular and unstructured; and therefore, the computation and data access patterns have poor locality. Hence, the last few years has seen an unprecedented interest in building systems for biggraphs by various communities including databases, systems, semantic web, machine learning, and operations research. In this tutorial, we discuss the design of the emerging systems for processing of biggraphs, key features of distributed graph algorithms, as well as graph partitioning and workload balancing techniques. We emphasize the current challenges and highlight some future research directions. 1.