Results 11  20
of
120
HADI: Mining radii of large graphs
 ACM Transactions on Knowledge Discovery from Data
, 2010
"... Given large, multimillion node graphs (e.g., Facebook, webcrawls, etc.), how do they evolve over time? How are they connected? What are the central nodes and the outliers? In this paper we define the Radius plot of a graph and show how it can answer these questions. However, computing the Radius p ..."
Abstract

Cited by 33 (10 self)
 Add to MetaCart
(Show Context)
Given large, multimillion node graphs (e.g., Facebook, webcrawls, etc.), how do they evolve over time? How are they connected? What are the central nodes and the outliers? In this paper we define the Radius plot of a graph and show how it can answer these questions. However, computing the Radius plot is prohibitively expensive for graphs reaching the planetary scale. There are two major contributions in this paper: (a) We propose HADI (HAdoop DIameter and radii estimator), a carefully designed and finetuned algorithm to compute the radii and the diameter of massive graphs, that runs on the top of the Hadoop/MapReduce system, with excellent scaleup on the number of available machines (b) We run HADI on several real world datasets including YahooWeb (6B edges, 1/8 of a Terabyte), one of the largest public graphs ever analyzed. Thanks to HADI, we report fascinating patterns on large networks, like the surprisingly small effective diameter, the multimodal/bimodal shape of the Radius plot, and its palindrome motion over time.
C.: Priter: a distributed framework for prioritized iterative computations
 In: Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC ’11
, 2011
"... Iterative computations are pervasive among data analysis applications in the cloud, including Web search, online social network analysis, recommendation systems, and so on. These cloud applications typically involve data sets of massive scale. Fast convergence of the iterative computation on the mas ..."
Abstract

Cited by 32 (9 self)
 Add to MetaCart
Iterative computations are pervasive among data analysis applications in the cloud, including Web search, online social network analysis, recommendation systems, and so on. These cloud applications typically involve data sets of massive scale. Fast convergence of the iterative computation on the massive data set is essential for these applications. In this paper, we explore the opportunity for accelerating iterative computations and propose a distributed computing framework, PrIter, which enables fast iterative computation by providing the support of prioritized iteration. Instead of performing computations on all data records without discrimination, PrIter prioritizes the computations that help convergence the most, so that the convergence speed of iterative process is significantly improved. We evaluate PrIter on a local cluster of machines as well as on Amazon EC2 Cloud. The results show that PrIter achieves up to 50x speedup over Hadoop for a series of iterative algorithms.
Spectral Analysis for BillionScale Graphs: Discoveries and Implementation
"... Abstract. Given a graph with billions of nodes and edges, how can we find patterns and anomalies? Are there nodes that participate in too many or too few triangles? Are there closeknit nearcliques? These questions are expensive to answer unless we have the first several eigenvalues and eigenvector ..."
Abstract

Cited by 31 (13 self)
 Add to MetaCart
(Show Context)
Abstract. Given a graph with billions of nodes and edges, how can we find patterns and anomalies? Are there nodes that participate in too many or too few triangles? Are there closeknit nearcliques? These questions are expensive to answer unless we have the first several eigenvalues and eigenvectors of the graph adjacency matrix. However, eigensolvers suffer from subtle problems (e.g., convergence) for large sparse matrices, let alone for billionscale ones. We address this problem with the proposed HEIGEN algorithm, which we carefully design to be accurate, efficient, and able to run on the highly scalable MAPREDUCE (HADOOP) environment. This enables HEIGEN to handle matrices more than 1000 × larger than those which can be analyzed by existing algorithms. We implement HEIGEN and run it on the M45 cluster, one of the top 50 supercomputers in the world. We report important discoveries about nearcliques and triangles on several realworld graphs, including a snapshot of the Twitter social network (38Gb, 2 billion edges) and the “YahooWeb ” dataset, one of the largest publicly available graphs (120Gb, 1.4 billion nodes, 6.6 billion edges). 1
Radius Plots for Mining Terabyte Scale Graphs: Algorithms, Patterns, and Observations
"... Given large, multimillion node graphs (e.g., FaceBook, webcrawls, etc.), how do they evolve over time? How are they connected? What are the central nodes and the outliers of the graphs? We show that the Radius Plot (pdf of node radii) can answer these questions. However, computing the Radius Plot ..."
Abstract

Cited by 22 (16 self)
 Add to MetaCart
(Show Context)
Given large, multimillion node graphs (e.g., FaceBook, webcrawls, etc.), how do they evolve over time? How are they connected? What are the central nodes and the outliers of the graphs? We show that the Radius Plot (pdf of node radii) can answer these questions. However, computing the Radius Plot is prohibitively expensive for graphs reaching the planetary scale. There are two major contributions in this paper: (a) We propose HADI (HAdoop DIameter and radii estimator), a carefully designed and finetuned algorithm to compute the diameter of massive graphs, that runs on the top of the HADOOP /MAPREDUCE system, with excellent scaleup on the number of available machines (b) We run HADI on several real world datasets including YahooWeb (6B edges, 1/8 of a Terabyte), one of the largest public graphs ever analyzed. Thanks to HADI, we report fascinating patterns on large networks, like the surprisingly small effective diameter, the multimodal/bimodal shape of the Radius Plot, and its palindrome motion over time. 1
GBASE: A Scalable and General Graph Management System
"... Graphs appear in numerous applications including cybersecurity, the Internet, social networks, protein networks, recommendation systems, and many more. Graphs with millions or even billions of nodes and edges are commonplace. How to store such large graphs efficiently? What are the core operations ..."
Abstract

Cited by 22 (5 self)
 Add to MetaCart
(Show Context)
Graphs appear in numerous applications including cybersecurity, the Internet, social networks, protein networks, recommendation systems, and many more. Graphs with millions or even billions of nodes and edges are commonplace. How to store such large graphs efficiently? What are the core operations/queries on those graph? How to answer the graph queries quickly? We propose GBASE, a scalable and general graph management and mining system. The key novelties lie in 1) our storage and compression scheme for a parallel setting and 2) the carefully chosen graph operations and their efficient implementation. We designed and implemented an instance of GBASE using MAPREDUCE/HADOOP. GBASE provides a parallel indexing mechanism for graph mining operations that both saves storage space, as well as accelerates queries. We ran numerous experiments on real graphs, spanning billions of nodes and edges, and we show that our proposed GBASE is indeed fast, scalable and nimble, with significant savings in space and time.
Clustering Very Large Multidimensional Datasets with
"... Given a very large moderatetohigh dimensionality dataset, how could one cluster its points? For datasets that don’t fit even on a single disk, parallelism is a first class option. In this paper we explore MapReduce for clustering this kind of data. The main questions are (a) how to minimize the I/ ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
(Show Context)
Given a very large moderatetohigh dimensionality dataset, how could one cluster its points? For datasets that don’t fit even on a single disk, parallelism is a first class option. In this paper we explore MapReduce for clustering this kind of data. The main questions are (a) how to minimize the I/O cost, taking into account the already existing data partition (e.g., on disks), and (b) how to minimize the network cost among processing nodes. Either of them may be a bottleneck. Thus, we propose the Best of both Worlds – BoW method, that automatically spots the bottleneck and chooses a good strategy. Our main contributions are: (1) We propose BoW and carefully derive its cost functions, which dynamically choose the best strategy; (2) We show that BoW has numerous desirable features: it can work with most serial clustering methods as a pluggedin clustering subroutine, it balances the cost for disk accesses and network accesses, achieving a very good tradeoff between the two, it uses no userdefined parameters (thanks to our reasonable defaults), it matches the clustering quality of the serial algorithm, and it has nearlinear scaleup; and finally, (3) We report experiments on real and synthetic data with billions of points, using up to 1, 024 cores in parallel. To the best of our knowledge, our Yahoo! web is the largest real dataset ever reported in the database subspace clustering literature. Spanning 0.2 TB of multidimensional data, it took only 8 minutes to be clustered, using 128 cores. 1.
Beyond ‘Caveman Communities’: Hubs and Spokes for Graph Compression and Mining
"... Abstract—Given a real world graph, how should we layout its edges? How can we compress it? These questions are closely related, and the typical approach so far is to find cliquelike communities, like the ‘cavemen graph’, and compress them. We show that the blockdiagonal mental image of the ‘cavemen ..."
Abstract

Cited by 21 (9 self)
 Add to MetaCart
(Show Context)
Abstract—Given a real world graph, how should we layout its edges? How can we compress it? These questions are closely related, and the typical approach so far is to find cliquelike communities, like the ‘cavemen graph’, and compress them. We show that the blockdiagonal mental image of the ‘cavemen graph ’ is the wrong paradigm, in full agreement with earlier results that real world graphs have no good cuts. Instead, we propose to envision graphs as a collection of hubs connecting spokes, with superhubs connecting the hubs, and so on, recursively. Based on the idea, we propose the SLASHBURN method (burn the hubs, and slash the remaining graph into smaller connected components). Our view point has several advantages: (a) it avoids the ‘no good cuts ’ problem, (b) it gives better compression, and (c) it leads to faster execution times for matrixvector operations, which are the backbone of most graph processing tools. Experimental results show that our SLASHBURN method consistently outperforms other methods on all datasets, giving good compression and faster running time.
Mizan: A system for dynamic load balancing in largescale graph processing
 In EuroSys ’13
, 2013
"... Pregel [23] was recently introduced as a scalable graph mining system that can provide significant performance improvements over traditional MapReduce implementations. Existing implementations focus primarily on graph partitioning as a preprocessing step to balance computation across compute node ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
(Show Context)
Pregel [23] was recently introduced as a scalable graph mining system that can provide significant performance improvements over traditional MapReduce implementations. Existing implementations focus primarily on graph partitioning as a preprocessing step to balance computation across compute nodes. In this paper, we examine the runtime characteristics of a Pregel system. We show that graph partitioning alone is insufficient for minimizing endtoend computation. Especially where data is very large or the runtime behavior of the algorithm is unknown, an adaptive approach is needed. To this end, we introduce Mizan, a Pregel system that achieves efficient load balancing to better adapt to changes in computing needs. Unlike known implementations of Pregel, Mizan does not assume any a priori knowledge of the structure of the graph or behavior of the algorithm. Instead, it monitors the runtime characteristics of the system. Mizan then performs efficient finegrained vertex migration to balance computation and communication. We have fully implemented Mizan; using extensive evaluation we show that—especially for highlydynamic workloads— Mizan provides up to 84 % improvement over techniques leveraging static graph prepartitioning. 1.
A Flexible OpenSource Toolbox for Scalable Complex Graph Analysis
, 2011
"... The Knowledge Discovery Toolbox (KDT) enables domain experts to perform complex analyses of huge datasets on supercomputers using a highlevel language without grappling with the difficulties of writing parallel code, calling parallel libraries, or becoming a graph expert. KDT provides a flexible Py ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
(Show Context)
The Knowledge Discovery Toolbox (KDT) enables domain experts to perform complex analyses of huge datasets on supercomputers using a highlevel language without grappling with the difficulties of writing parallel code, calling parallel libraries, or becoming a graph expert. KDT provides a flexible Python interface to a small set of highlevel graph operations; composing a few of these operations is often sufficient for a specific analysis. Scalability and performance are delivered by linking to a stateoftheart backend compute engine that scales from laptops to large HPC clusters. KDT delivers very competitive performance from a generalpurpose, reusable library for graphs on the order of 10 billion edges and greater. We demonstrate speedup of 1 and 2 orders of magnitude over PBGL and Pegasus, respectively, on some tasks. Examples from simple use cases and key graphanalytic benchmarks illustrate the productivity and performance realized by KDT users. Semantic graph abstractions provide both flexibility and high performance for realworld use cases. Graphalgorithm researchers benefit from the ability to develop algorithms quickly using KDT’s graph and underlying matrix abstractions for distributed memory. KDT is available as opensource code to foster experimentation.
GigaTensor: Scaling Tensor Analysis Up By 100 Times Algorithms and Discoveries
"... Many data are modeled as tensors, or multi dimensional arrays. Examples include the predicates (subject, verb, object) in knowledge bases, hyperlinks and anchor texts in the Web graphs, sensor streams (time, location, and type), social networks over time, and DBLP conferenceauthorkeyword relations ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
(Show Context)
Many data are modeled as tensors, or multi dimensional arrays. Examples include the predicates (subject, verb, object) in knowledge bases, hyperlinks and anchor texts in the Web graphs, sensor streams (time, location, and type), social networks over time, and DBLP conferenceauthorkeyword relations. Tensor decomposition is an important data mining tool with various applications including clustering, trend detection, and anomaly detection. However, current tensor decomposition algorithms are not scalable for large tensors with billions of sizes and hundreds millions of nonzeros: the largest tensor in the literature remains thousands of sizes and hundreds thousands of nonzeros. Consider a knowledge base tensor consisting of about 26 million nounphrases. The intermediate data explosion problem, associated with naive implementations of tensor decomposition algorithms, would require the materialization and the storage of a matrix whose largest dimension would be ≈ 7·10 14; this amounts to ∼ 10 Petabytes, or equivalently a few data centers worth of storage, thereby rendering the tensor analysis of this knowledge base, in the naive way, practically impossible. In this paper, we propose GIGATENSOR, a scalable distributed algorithm for large scale tensor decomposition. GIGATENSOR exploits the sparseness of the real world tensors, and avoids the intermediate data explosion problem by carefully redesigning the tensor decomposition algorithm. Extensive experiments show that our proposed GIGATENSOR solves 100 × bigger problems than existing methods. Furthermore, we employ GIGATENSOR in order to analyze a very large real world, knowledge base tensor and present our astounding findings which include discovery of potential synonyms among millions of nounphrases (e.g. the noun ‘pollutant ’ and the nounphrase ‘greenhouse gases’).