Results 1  10
of
82
Sindice.com: A documentoriented lookup index for open linked data
 International Journal of Metadata, Semantics and Ontologies
"... Developers of Semantic Web applications face a challenge with respect to the decentralised publication model: how and where to find statements about encountered resources. The “linked data” approach mandates that resource URIs should be dereferenced to return resource metadata. But for data discove ..."
Abstract

Cited by 130 (12 self)
 Add to MetaCart
(Show Context)
Developers of Semantic Web applications face a challenge with respect to the decentralised publication model: how and where to find statements about encountered resources. The “linked data” approach mandates that resource URIs should be dereferenced to return resource metadata. But for data discovery linkage itself is not enough, and crawling and indexing of data is necessary. Existing Semantic Web search engines are focused on databaselike functionality, compromising on index size, query performance and live updates. We present Sindice, a lookup index over resources crawled on the Semantic Web. Our index allows applications to automatically locate documents containing information about a given resource. In addition, we allow resource retrieval through uniquely identifying inversefunctional properties, offer a fulltext search and index SPARQL endpoints. Finally we introduce an extension to the sitemap protocol which allows us to efficiently index large Semantic Web datasets with minimal impact on the data providers.
PEGASUS: A PetaScale Graph Mining System Implementation and Observations
 IEEE INTERNATIONAL CONFERENCE ON DATA MINING
, 2009
"... Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga, Tera or P ..."
Abstract

Cited by 128 (26 self)
 Add to MetaCart
(Show Context)
Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga, Tera or Petabytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the HADOOP platform, the open source version of MAPREDUCE. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrixvector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIMV (Generalized Iterated MatrixVector multiplication). GIMV is highly optimized, achieving (a) good scaleup on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the nonoptimized version of GIMV. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web Graphs, thanks to Yahoo!, with ≈ 6,7 billion edges. KeywordsPEGASUS; graph mining; hadoop I.
Distributed Aggregation for DataParallel Computing: Interfaces and Implementations
"... Dataintensive applications are increasingly designed to execute on large computing clusters. Grouped aggregation is a core primitive of many distributed programming models, and it is often the most efficient available mechanism for computations such as matrix multiplication and graph traversal. Suc ..."
Abstract

Cited by 58 (4 self)
 Add to MetaCart
(Show Context)
Dataintensive applications are increasingly designed to execute on large computing clusters. Grouped aggregation is a core primitive of many distributed programming models, and it is often the most efficient available mechanism for computations such as matrix multiplication and graph traversal. Such algorithms typically require nonstandard aggregations that are more sophisticated than traditional builtin database functions such as Sum and Max. As a result, the ease of programming userdefined aggregations, and the efficiency of their implementation, is of great current interest. This paper evaluates the interfaces and implementations for userdefined aggregation in several state of the art distributed computing systems: Hadoop, databases such as Oracle Parallel Server, and DryadLINQ. We show that: the degree of language integration between userdefined functions and the highlevel query language has an impact on code legibility and simplicity; the choice of programming interface has a material effect on the performance of computations; some execution plans perform better than others on average; and that in order to get good performance on a variety of workloads a system must be able to select between execution plans depending on the computation. The interface and execution plan described in the MapReduce paper, and implemented by Hadoop, are found to be among the worstperforming choices.
Doulion: Counting Triangles in Massive Graphs with a Coin
 PROCEEDINGS OF ACM KDD,
, 2009
"... Counting the number of triangles in a graph is a beautiful algorithmic problem which has gained importance over the last years due to its significant role in complex network analysis. Metrics frequently computed such as the clustering coefficient and the transitivity ratio involve the execution of a ..."
Abstract

Cited by 53 (16 self)
 Add to MetaCart
Counting the number of triangles in a graph is a beautiful algorithmic problem which has gained importance over the last years due to its significant role in complex network analysis. Metrics frequently computed such as the clustering coefficient and the transitivity ratio involve the execution of a triangle counting algorithm. Furthermore, several interesting graph mining applications rely on computing the number of triangles in the graph of interest. In this paper, we focus on the problem of counting triangles in a graph. We propose a practical method, out of which all triangle counting algorithms can potentially benefit. Using a straightforward triangle counting algorithm as a black box, we performed 166 experiments on realworld networks and on synthetic datasets as well, where we show that our method works with high accuracy, typically more than 99 % and gives significant speedups, resulting in even ≈ 130 times faster performance.
Spectral Analysis for BillionScale Graphs: Discoveries and Implementation
"... Abstract. Given a graph with billions of nodes and edges, how can we find patterns and anomalies? Are there nodes that participate in too many or too few triangles? Are there closeknit nearcliques? These questions are expensive to answer unless we have the first several eigenvalues and eigenvector ..."
Abstract

Cited by 31 (13 self)
 Add to MetaCart
(Show Context)
Abstract. Given a graph with billions of nodes and edges, how can we find patterns and anomalies? Are there nodes that participate in too many or too few triangles? Are there closeknit nearcliques? These questions are expensive to answer unless we have the first several eigenvalues and eigenvectors of the graph adjacency matrix. However, eigensolvers suffer from subtle problems (e.g., convergence) for large sparse matrices, let alone for billionscale ones. We address this problem with the proposed HEIGEN algorithm, which we carefully design to be accurate, efficient, and able to run on the highly scalable MAPREDUCE (HADOOP) environment. This enables HEIGEN to handle matrices more than 1000 × larger than those which can be analyzed by existing algorithms. We implement HEIGEN and run it on the M45 cluster, one of the top 50 supercomputers in the world. We report important discoveries about nearcliques and triangles on several realworld graphs, including a snapshot of the Twitter social network (38Gb, 2 billion edges) and the “YahooWeb ” dataset, one of the largest publicly available graphs (120Gb, 1.4 billion nodes, 6.6 billion edges). 1
HADI: Fast Diameter Estimation and Mining in Massive Graphs with Hadoop
, 2008
"... How can we quickly find the diameter of a petabytesized graph? Large graphs are ubiquitous: social networks (Facebook, LinkedIn, etc.), the World Wide Web, biological networks, computer networks and many more. The size of graphs of interest has been increasing rapidly in recent years and with it al ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
How can we quickly find the diameter of a petabytesized graph? Large graphs are ubiquitous: social networks (Facebook, LinkedIn, etc.), the World Wide Web, biological networks, computer networks and many more. The size of graphs of interest has been increasing rapidly in recent years and with it also the need for algorithms that can handle tera and petabyte graphs. A promising direction for coping with such sizes is the emerging map/reduce architecture and its opensource implementation, ’HADOOP’. Estimating the diameter of a graph, as well as the radius of each node, is a valuable operation that can help us spot outliers and anomalies. We propose HADI (HAdoop based DIameter estimator), a carefully designed algorithm to compute the diameters of petabytescale graphs. We run the algorithm to analyze the largest public web graph ever analyzed, with billions of nodes and edges. Additional contributions include the following: (a) We propose several performance optimizations (b) we achieve excellent scaleup, and (c) we report interesting observations including outliers and related patterns, on this real graph (116Gb), as well as several other real, smaller graphs. One of the observations is that the Albert et al. conjecture about the diameter of Networked systems are ubiquitous. The analysis of networks such as the World Wide Web, social, computer and biological networks has attracted much attention recently. Some of the typical measures to compute are
Optimizing MapReduce for Multicore Architectures
 Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Tech. Rep
, 2010
"... MapReduce is a programming model for dataparallel programs originally intended for data centers. MapReduce simplifies parallel programming, hiding synchronization and task management. These properties make it a promising programming model for future processors with many cores, and existing MapReduc ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
(Show Context)
MapReduce is a programming model for dataparallel programs originally intended for data centers. MapReduce simplifies parallel programming, hiding synchronization and task management. These properties make it a promising programming model for future processors with many cores, and existing MapReduce libraries such as Phoenix have demonstrated that applications written with MapReduce perform competitively with those written with Pthreads [11]. This paper explores the design of the MapReduce data structures for grouping intermediate key/value pairs, which is often a performance bottleneck on multicore processors. The paper finds the best choice depends on workload characteristics, such as the number of keys used by the application, the degree of repetition of keys, etc. This paper also introduces a new MapReduce library, Metis, with a compromise data structure designed to perform well for most workloads. Experiments with the Phoenix benchmarks on a 16core AMDbased server show that Metis ’ data structure performs better than simpler alternatives, including Phoenix. 1
Clustering Very Large Multidimensional Datasets with
"... Given a very large moderatetohigh dimensionality dataset, how could one cluster its points? For datasets that don’t fit even on a single disk, parallelism is a first class option. In this paper we explore MapReduce for clustering this kind of data. The main questions are (a) how to minimize the I/ ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
(Show Context)
Given a very large moderatetohigh dimensionality dataset, how could one cluster its points? For datasets that don’t fit even on a single disk, parallelism is a first class option. In this paper we explore MapReduce for clustering this kind of data. The main questions are (a) how to minimize the I/O cost, taking into account the already existing data partition (e.g., on disks), and (b) how to minimize the network cost among processing nodes. Either of them may be a bottleneck. Thus, we propose the Best of both Worlds – BoW method, that automatically spots the bottleneck and chooses a good strategy. Our main contributions are: (1) We propose BoW and carefully derive its cost functions, which dynamically choose the best strategy; (2) We show that BoW has numerous desirable features: it can work with most serial clustering methods as a pluggedin clustering subroutine, it balances the cost for disk accesses and network accesses, achieving a very good tradeoff between the two, it uses no userdefined parameters (thanks to our reasonable defaults), it matches the clustering quality of the serial algorithm, and it has nearlinear scaleup; and finally, (3) We report experiments on real and synthetic data with billions of points, using up to 1, 024 cores in parallel. To the best of our knowledge, our Yahoo! web is the largest real dataset ever reported in the database subspace clustering literature. Spanning 0.2 TB of multidimensional data, it took only 8 minutes to be clustered, using 128 cores. 1.
GigaTensor: Scaling Tensor Analysis Up By 100 Times Algorithms and Discoveries
"... Many data are modeled as tensors, or multi dimensional arrays. Examples include the predicates (subject, verb, object) in knowledge bases, hyperlinks and anchor texts in the Web graphs, sensor streams (time, location, and type), social networks over time, and DBLP conferenceauthorkeyword relations ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
(Show Context)
Many data are modeled as tensors, or multi dimensional arrays. Examples include the predicates (subject, verb, object) in knowledge bases, hyperlinks and anchor texts in the Web graphs, sensor streams (time, location, and type), social networks over time, and DBLP conferenceauthorkeyword relations. Tensor decomposition is an important data mining tool with various applications including clustering, trend detection, and anomaly detection. However, current tensor decomposition algorithms are not scalable for large tensors with billions of sizes and hundreds millions of nonzeros: the largest tensor in the literature remains thousands of sizes and hundreds thousands of nonzeros. Consider a knowledge base tensor consisting of about 26 million nounphrases. The intermediate data explosion problem, associated with naive implementations of tensor decomposition algorithms, would require the materialization and the storage of a matrix whose largest dimension would be ≈ 7·10 14; this amounts to ∼ 10 Petabytes, or equivalently a few data centers worth of storage, thereby rendering the tensor analysis of this knowledge base, in the naive way, practically impossible. In this paper, we propose GIGATENSOR, a scalable distributed algorithm for large scale tensor decomposition. GIGATENSOR exploits the sparseness of the real world tensors, and avoids the intermediate data explosion problem by carefully redesigning the tensor decomposition algorithm. Extensive experiments show that our proposed GIGATENSOR solves 100 × bigger problems than existing methods. Furthermore, we employ GIGATENSOR in order to analyze a very large real world, knowledge base tensor and present our astounding findings which include discovery of potential synonyms among millions of nounphrases (e.g. the noun ‘pollutant ’ and the nounphrase ‘greenhouse gases’).
MapReduce for the Cell B.E. Architecture
"... MapReduce is a simple and flexible parallel programming model proposed by Google for large scale data processing in a distributed computing environment [4]. In this paper, we present a design and implementation of MapReduce for the Cell architecture. This model provides a simple machine abstraction ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
MapReduce is a simple and flexible parallel programming model proposed by Google for large scale data processing in a distributed computing environment [4]. In this paper, we present a design and implementation of MapReduce for the Cell architecture. This model provides a simple machine abstraction to users, hiding parallelization and hardware primitives. Our runtime automatically manages parallelization, scheduling, partitioning and memory transfers. We study the basic characteristics of the model and evaluate our runtime’s performance, scalability, and efficiency for microbenchmarks and complete applications. We show that the model is well suited for many applications that map well to the Cell architecture, and that the runtime sustains high performance on these applications. For other applications, we analyze runtime performance and describe why performance is less impressive. Overall, we find that the simplicity of the model and the efficiency of our MapReduce implementation make it an attractive choice for the Cell platform specifically and more generally to distributed memory systems and softwareexposed memories.