### ITERATIVE GRAPH COMPUTATION IN THE BIG DATA ERA

, 2015

"... Iterative graph computation is a key component in many real-world applications, as the graph data model naturally captures complex relationships be-tween entities. The big data era has seen the rise of several new challenges to this classic computation model. In this dissertation we describe three p ..."

Abstract
- Add to MetaCart

Iterative graph computation is a key component in many real-world applications, as the graph data model naturally captures complex relationships be-tween entities. The big data era has seen the rise of several new challenges to this classic computation model. In this dissertation we describe three projects that address different aspects of these challenges. First, because of the increasing volume of data, it is increasingly important to scale iterative graph computation to large graphs. We observe that an important class of graph applications performing little computation per vertex scales poorly when running on multiple cores. These computationally light applications are limited by memory access rates, and cannot fully utilize the benefits of multiple cores. We propose a new block-oriented computation model which creates two levels of iterative computation. On each processor, a small block of highly connected vertices is iterated locally, while the blocks are updated iteratively at the global level. We show that block-oriented execution reduces the communication-to-computation ratio and significantly improves the perfor-

### Efficient Extraction of High Centrality Vertices in Distributed Graphs

"... Abstract-Betweenness centrality (BC) is an important measure for identifying high value or critical vertices in graphs, in variety of domains such as communication networks, road networks, and social graphs. However, calculating betweenness values is prohibitively expensive and, more often, domain ..."

Abstract
- Add to MetaCart

(Show Context)
Abstract-Betweenness centrality (BC) is an important measure for identifying high value or critical vertices in graphs, in variety of domains such as communication networks, road networks, and social graphs. However, calculating betweenness values is prohibitively expensive and, more often, domain experts are interested only in the vertices with the highest centrality values. In this paper, we first propose a partition-centric algorithm (MS-BC) to calculate BC for a large distributed graph that optimizes resource utilization and improves overall performance. Further, we extend the notion of approximate BC by pruning the graph and removing a subset of edges and vertices that contribute the least to the betweenness values of other vertices (MSL-BC), which further improves the runtime performance. We evaluate the proposed algorithms using a mix of real-world and synthetic graphs on an HPC cluster and analyze its strengths and weaknesses. The experimental results show an improvement in performance of upto 12x for large sparse graphs as compared to the state-of-the-art, and at the same time highlights the need for better partitioning methods to enable a balanced workload across partitions for unbalanced graphs such as small-world or power-law graphs.

### GRAPHiQL: A Graph Intuitive Query Language for Relational Databases

"... Abstract—Graph analytics is becoming increasingly popular, driving many important business applications from social net-work analysis to machine learning. Since most graph data is collected in a relational database, it seems natural to attempt to perform graph analytics within the relational environ ..."

Abstract
- Add to MetaCart

(Show Context)
Abstract—Graph analytics is becoming increasingly popular, driving many important business applications from social net-work analysis to machine learning. Since most graph data is collected in a relational database, it seems natural to attempt to perform graph analytics within the relational environment. However, SQL, the query language for relational databases, makes it difficult to express graph analytics operations. This is because SQL requires programmers to think in terms of tables and joins, rather than the more natural representation of graphs as collections of nodes and edges. As a result, even relatively simple graph operations can require very complex SQL queries. In this paper, we present GRAPHiQL, an intuitive query language for graph analytics, which allows developers to reason in terms of nodes and edges. GRAPHiQL provides key graph constructs such as looping, recursion, and neighborhood operations. At runtime, GRAPHiQL compiles graph programs into efficient SQL queries that can run on any relational database. We demonstrate the applicability of GRAPHiQL on several applications and compare the performance of GRAPHiQL queries with those of Apache Giraph (a popular ‘vertex centric ’ graph programming language). I.

### Departamento de Informática e Estatı́stica Universidade Federal de Santa Catarina (UFSC)

"... Abstract—Many large-scale computing problems can be mod-eled as graphs. Example areas include the web, social networks, and biological systems. The increasing sizes of datasets has led to the creation of various distributed large scale graph processing systems, e.g., Google Pregel. Although these sy ..."

Abstract
- Add to MetaCart

(Show Context)
Abstract—Many large-scale computing problems can be mod-eled as graphs. Example areas include the web, social networks, and biological systems. The increasing sizes of datasets has led to the creation of various distributed large scale graph processing systems, e.g., Google Pregel. Although these systems tolerate crash faults, the literature suggests they are vulnerable to a wider range of accidental arbitrary faults (also called Byzantine faults). In this paper we present an algorithm and a prototype of a distributed large-scale graph processing system that can tolerate arbitrary faults. The prototype is based on GPS, an open source implementation of Pregel. Experimental results of the prototype in Amazon AWS are presented, showing that it uses only twice the resources of the original implementation, instead of 3-4 times as usual in Byzantine fault-tolerant systems. This cost may be acceptable for critical applications that require this level of fault tolerance. I.

### Arabesque: A System for Distributed Graph Mining

"... Distributed data processing platforms such as MapReduce and Pregel have substantially simplified the design and de-ployment of certain classes of distributed graph analytics al-gorithms. However, these platforms do not represent a good match for distributed graph mining problems, as for exam-ple fin ..."

Abstract
- Add to MetaCart

(Show Context)
Distributed data processing platforms such as MapReduce and Pregel have substantially simplified the design and de-ployment of certain classes of distributed graph analytics al-gorithms. However, these platforms do not represent a good match for distributed graph mining problems, as for exam-ple finding frequent subgraphs in a graph. Given an input graph, these problems require exploring a very large number of subgraphs and finding patterns that match some “interest-ingness ” criteria desired by the user. These algorithms are very important for areas such as social networks, semantic web, and bioinformatics. In this paper, we present Arabesque, the first distributed data processing platform for implementing graph mining algorithms. Arabesque automates the process of exploring

### Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstractions

"... Iterative computation on large graphs has challenged system research from two aspects: (1) how to conduct high per-formance parallel processing for both in-memory and out-of-core graphs; and (2) how to handle large graphs that exceed the resource boundary of traditional systems by re-source aware gr ..."

Abstract
- Add to MetaCart

Iterative computation on large graphs has challenged system research from two aspects: (1) how to conduct high per-formance parallel processing for both in-memory and out-of-core graphs; and (2) how to handle large graphs that exceed the resource boundary of traditional systems by re-source aware graph partitioning such that it is feasible to run large-scale graph analysis on a single PC. This paper presents GraphLego, a resource adaptive graph processing system with multi-level programmable graph parallel ab-stractions. GraphLego is novel in three aspects: (1) we argue that vertex-centric or edge-centric graph partitioning are ineffective for parallel processing of large graphs and we introduce three alternative graph parallel abstractions to enable a large graph to be partitioned at the granularity of subgraphs by slice, strip and dice based partitioning; (2) we use dice-based data placement algorithm to store a large graph on disk by minimizing non-sequential disk access and enabling more structured in-memory access; and (3) we dy-namically determine the right level of graph parallel abstrac-tion to maximize sequential access and minimize random access. GraphLego can run efficiently on different computers with diverse resource capacities and respond to different memory requirements by real-world graphs of different com-plexity. Extensive experiments show the competitiveness of GraphLego against existing representative graph processing systems, such as GraphChi, GraphLab and X-Stream.

### Scaling Iterative Graph Computations with GraphMap

"... In recent years, systems researchers have devoted consider-able effort to the study of large-scale graph processing. Ex-isting distributed graph processing systems such as Pregel, based solely on distributed memory for their computations, fail to provide seamless scalability when the graph data and ..."

Abstract
- Add to MetaCart

(Show Context)
In recent years, systems researchers have devoted consider-able effort to the study of large-scale graph processing. Ex-isting distributed graph processing systems such as Pregel, based solely on distributed memory for their computations, fail to provide seamless scalability when the graph data and their intermediate computational results no longer fit into the memory; and most distributed approaches for itera-tive graph computations do not consider utilizing secondary storage a viable solution. This paper presents GraphMap, a distributed iterative graph computation framework that maximizes access locality and speeds up distributed itera-tive graph computations by effectively utilizing secondary storage. GraphMap has three salient features: (1) It distin-guishes data states that are mutable during iterative compu-tations from those that are read-only in all iterations to max-imize sequential access and minimize random access. (2) It entails a two-level graph partitioning algorithm that enables balanced workloads and locality-optimized data placement. (3) It contains a proposed suite of locality-based optimiza-tions that improve computational efficiency. Extensive ex-periments on several real-world graphs show that GraphMap outperforms existing distributed memory-based systems for various iterative graph algorithms.

### Distributed Programming over Time-series Graphs

"... Abstract—Graphs are a key form of Big Data, and performing scalable analytics over them is invaluable to many domains. There is an emerging class of inter-connected data which accumulates or varies over time, and on which novel algorithms both over the network structure and across the time-variant a ..."

Abstract
- Add to MetaCart

(Show Context)
Abstract—Graphs are a key form of Big Data, and performing scalable analytics over them is invaluable to many domains. There is an emerging class of inter-connected data which accumulates or varies over time, and on which novel algorithms both over the network structure and across the time-variant attribute values is necessary. We formalize the notion of time-series graphs and propose a Temporally Iterative BSP programming abstraction to develop algorithms on such datasets using several design patterns. Our abstractions leverage a sub-graph centric programming model and extend it to the temporal dimension. We present three time-series graph algorithms based on these design patterns and abstractions, and analyze their performance using the GoFFish distributed platform on Amazon AWS Cloud. Our results demonstrate the efficacy of the abstractions to develop practical time-series graph algorithms, and scale them on commodity hardware. I.

### Deploying Large-Scale Data Sets on-Demand in the Cloud: Treats and Tricks on Data Distribution.

"... Public clouds have democratised the access to analytics for virtually any institution in the world. Virtual Machines (VMs) can be provisioned on demand, and be used to crunch data after uploading into the VMs. While this task is trivial for a few tens of VMs, it becomes increasingly complex and time ..."

Abstract
- Add to MetaCart

(Show Context)
Public clouds have democratised the access to analytics for virtually any institution in the world. Virtual Machines (VMs) can be provisioned on demand, and be used to crunch data after uploading into the VMs. While this task is trivial for a few tens of VMs, it becomes increasingly complex and time consuming when the scale grows to hundreds or thousands of VMs crunching tens or hundreds of TB. Moreover, the elapsed time comes at a price: the cost of provisioning VMs in the cloud and keeping them waiting to load the data. In this paper we present a big data provisioning service that incorporates hierarchical and peer-to-peer data distribution techniques to speed-up data loading into the VMs used for data processing. The system dynamically mutates the sources of the data for the VMs to speed-up data loading. We tested this solution with 1000 VMs and 100 TB of data, reducing time by at least 30 % over current state of the art techniques. This dynamic topology mechanism is tightly coupled with classic declarative machine configuration techniques (the system takes a single high-level declarative configuration file and configures both software and data loading). Together, these two techniques simplify the deployment of big data in the cloud for end users who may not be experts in infrastructure management. Index Terms—Large-scale data transfer, flash crowd, big data, BitTorrent, p2p overlay, provisioning, big data distribution I.

### Spinner: Scalable Graph Partitioning for the Cloud

"... Several organizations, like social networks, store and routinely an-alyze large graphs as part of their daily operation. Such graphs are typically distributed across multiple servers, and graph partitioning is critical for efficient graph management. Existing partitioning al-gorithms focus on findin ..."

Abstract
- Add to MetaCart

(Show Context)
Several organizations, like social networks, store and routinely an-alyze large graphs as part of their daily operation. Such graphs are typically distributed across multiple servers, and graph partitioning is critical for efficient graph management. Existing partitioning al-gorithms focus on finding graph partitions with good locality, but they disregard the pragmatic challenges of integrating partitioning into large-scale graph management systems deployed on a cloud. In this paper, we aim at a solution that performs substantially bet-ter than the most practical solution currently used, hash partition-ing, but is nearly as practical. We propose Spinner, a scalable and adaptive graph partitioning algorithm based on label propagation. Spinner scales to massive graphs, produces partitions with locality and balance comparable to the state-of-the-art and efficiently adapts the partitioning upon changes. We describe our fully decentralized algorithm and its implementation in the Pregel programming model that makes it possible to partition billion-vertex graphs. We eval-uate Spinner with a variety of synthetic and real graphs and show that it can compute partitions with quality comparable to the state-of-the art. In fact, by integrating Spinner into the Giraph graph analytics engine, we speed up different applications by a factor of 2 relative to standard hash partitioning. 1.