Results 1 - 10
of
21
Programming Parallel Algorithms
, 1996
"... In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a th ..."
Abstract
-
Cited by 163 (7 self)
- Add to MetaCart
In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofparallelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and prototyping algorithms. There has been a large gap between languages
Scans as Primitive Parallel Operations
- IEEE Transactions on Computers
, 1987
"... In most parallel random-access machine (P-RAM) models, memory references are assumed to take unit time. In practice, and in theory, certain scan operations, also known as prefix computations, can executed in no more time than these parallel memory references. This paper outline an extensive study of ..."
Abstract
-
Cited by 143 (12 self)
- Add to MetaCart
In most parallel random-access machine (P-RAM) models, memory references are assumed to take unit time. In practice, and in theory, certain scan operations, also known as prefix computations, can executed in no more time than these parallel memory references. This paper outline an extensive study of the effect of including in the P-RAM models, such scan operations as unit-time primitives. The study concludes that the primitives improve the asymptotic running time of many algorithms by an O(lg n) factor, greatly simplify the description of many algorithms, and are significantly easier to implement than memory references. We therefore argue that the algorithm designer should feel free to use these operations as if they were as cheap as a memory reference. This paper describes five algorithms that clearly illustrate how the scan primitives can be used in algorithm design: a radix-sort algorithm, a quicksort algorithm, a minimumspanning -tree algorithm, a line-drawing algorithm and a mergi...
A Comparison of Data-Parallel Algorithms for Connected Components
- In Proc. 6th Ann. Symp. Parallel Algorithms and Architectures (SPAA-94
, 1994
"... This paper presents a pragmatic comparison of three parallel algorithms for finding connected components, together with optimizations on these algorithms. Those being compared are two similar algorithms by Awerbuch and Shiloach [2] and by Shiloach and Vishkin [19] and a randomized contraction algori ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
This paper presents a pragmatic comparison of three parallel algorithms for finding connected components, together with optimizations on these algorithms. Those being compared are two similar algorithms by Awerbuch and Shiloach [2] and by Shiloach and Vishkin [19] and a randomized contraction algorithm by Blelloch [7], based on algorithms by Reif [18] and Phillips [17]. Major improvements are given for the first two which significantly reduces the super-linear component of their work complexity. An improvement is also given for randomized algorithm, and this algorithm is shown to be the fastest of those tested. These comparisons are presented with NESL data-parallel code as executed on a Connection Machine 2. This research was sponsored in part by the Defense Advanced Research Projects Agency, CSTO, under the title "The Fox Project: Advanced Development of Systems Software", ARPA Order No. 8313, issued by ESD/AVS under Contract No. F19628-91-C-0168, and in part by the ONR Graduate Fell...
PEGASUS: A Peta-Scale Graph Mining System- Implementation and Observations
- IEEE INTERNATIONAL CONFERENCE ON DATA MINING
, 2009
"... Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga-, Tera- or P ..."
Abstract
-
Cited by 30 (14 self)
- Add to MetaCart
Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the HADOOP platform, the open source version of MAPREDUCE. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIM-V (Generalized Iterated Matrix-Vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web Graphs, thanks to Yahoo!, with ≈ 6,7 billion edges. Keywords-PEGASUS; graph mining; hadoop I.
On Parallel Hashing and Integer Sorting
, 1991
"... The problem of sorting n integers from a restricted range [1::m], where m is superpolynomial in n, is considered. An o(n log n) randomized algorithm is given. Our algorithm takes O(n log log m) expected time and O(n) space. (Thus, for m = n polylog(n) we have an O(n log log n) algorithm.) The al ..."
Abstract
-
Cited by 24 (9 self)
- Add to MetaCart
The problem of sorting n integers from a restricted range [1::m], where m is superpolynomial in n, is considered. An o(n log n) randomized algorithm is given. Our algorithm takes O(n log log m) expected time and O(n) space. (Thus, for m = n polylog(n) we have an O(n log log n) algorithm.) The algorithm is parallelizable. The resulting parallel algorithm achieves optimal speed up. Some features of the algorithm make us believe that it is relevant for practical applications. A result of independent interest is a parallel hashing technique. The expected construction time is logarithmic using an optimal number of processors, and searching for a value takes O(1) time in the worst case. This technique enables drastic reduction of space requirements for the price of using randomness. Applicability of the technique is demonstrated for the parallel sorting algorithm, and for some parallel string matching algorithms. The parallel sorting algorithm is designed for a strong and non standard mo...
Connected Components on Distributed Memory Machines
- Parallel Algorithms: 3rd DIMACS Implementation Challenge October 17-19, 1994, volume 30 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science
, 1994
"... . The efforts of the theory community to develop efficient PRAM algorithms often receive little attention from application programmers. Although there are PRAM algorithm implementations that perform reasonably on shared memory machines, they often perform poorly on distributed memory machines, where ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
. The efforts of the theory community to develop efficient PRAM algorithms often receive little attention from application programmers. Although there are PRAM algorithm implementations that perform reasonably on shared memory machines, they often perform poorly on distributed memory machines, where the cost of remote memory accesses is relatively high. We present a hybrid approach to solving the connected components problem, whereby a PRAM algorithm is merged with a sequential algorithm and then optimized to create an efficient distributed memory implementation. The sequential algorithm handles local work on each processor, and the PRAM algorithm handles interactions between processors. Our hybrid algorithm uses the Shiloach-Vishkin CRCW PRAM algorithm on a partition of the graph distributed over the processors and sequential breadth-first search within each local subgraph. The implementation uses the Split-C language developed at Berkeley, which provides a global address space and al...
Connected Components Algorithms For Mesh-Connected Parallel Computers
- Parallel Algorithms: 3rd DIMACS Implementation Challenge October 17-19, 1994, volume 30 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science
, 1995
"... . We present a new CREW PRAM algorithm for finding connected components. For a graph G with n vertices and m edges, algorithmA 0 requires at most O(logn) parallel steps and performs O((n+m) log n) work in the worst case. The advantage our algorithm has over others in the literature is that it can be ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
. We present a new CREW PRAM algorithm for finding connected components. For a graph G with n vertices and m edges, algorithmA 0 requires at most O(logn) parallel steps and performs O((n+m) log n) work in the worst case. The advantage our algorithm has over others in the literature is that it can be adapted to a 2-D mesh-connected communication model in which all CREW operations are replaced by O(logn) parallel row and column operations without increasing the time complexity. We present the mapping of A 0 to a mesh-connected computer and describe two implementations, A 1 and A 2 . Algorithm A 1 , which uses an adjacency matrix to represent the graph, performs O(n 2 log n) work. Hence, it only achieves work efficiency on dense graphs. The second implementation, A 2 , uses a sparse representation of the adjacency matrix and again performs O(logn) row and column operations but reduces the work to O((m + n) log n) on all graphs. We report MasPar MP-1 performance figures for implementati...
A Comparison of Parallel Algorithms for Connected Components
- in the Symposium on Parallel Algorithms and Architectures
, 1994
"... This paper presents a comparison of the pragmatic aspects of some parallel algorithms for finding connected components, together with optimizations on these algorithms. The algorithms being compared are two similar algorithms by Shiloach-Vishkin [22] and Awerbuch-Shiloach [2], a randomized contracti ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
This paper presents a comparison of the pragmatic aspects of some parallel algorithms for finding connected components, together with optimizations on these algorithms. The algorithms being compared are two similar algorithms by Shiloach-Vishkin [22] and Awerbuch-Shiloach [2], a randomized contraction algorithm based on algorithms by Reif [21] and Phillips [20], and a hybrid algorithm [11]. Improvements are given for the first two to improve performance significantly, although without improving their asymptotic complexity. The hybrid combines features of the others and is generally the fastest of those tested. Timings were made using NESL [4] code as executed on a Connection Machine 2 and Cray Y-MP/C90. 1 Introduction The complexity of various PRAM algorithms has received much attention, but there has been relatively little work on the implementation and pragmatic efficiency of many of these algorithms. Moreover, much of this work has been for algorithms having regular communication ...
Towards Modeling the Performance of a Fast Connected Components Algorithm on Parallel Machines
- In Proceedings of Supercomputing '95
, 1996
"... : We present and analyze a portable, high-performance algorithm for finding connected components on modern distributed memory multiprocessors. The algorithm is a hybrid of the classic DFS on the subgraph local to each processor and a variant of the Shiloach-Vishkin PRAM algorithm on the global colle ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
: We present and analyze a portable, high-performance algorithm for finding connected components on modern distributed memory multiprocessors. The algorithm is a hybrid of the classic DFS on the subgraph local to each processor and a variant of the Shiloach-Vishkin PRAM algorithm on the global collection of subgraphs. We implement the algorithm in Split-C and measure performance on the the Cray T3D, the Meiko CS-2, and the Thinking Machines CM-5 using a class of graphs derived from cluster dynamics methods in computational physics. On a 256 processor Cray T3D, the implementation outperforms all previous solutions by an order of magnitude. A characterization of graph parameters allows us to select graphs that highlight key performance features. We study the effects of these parameters and machine characteristics on the balance of time between the local and global phases of the algorithm and find that edge density, surface-to-volume ratio, and relative communication cost dominate perform...
Asynchronous Resource Discovery in Peer to Peer Networks
- In 21st Symp. on Reliable Distributed Systems, October 2002 Japan
, 2002
"... The resource discovery problem arises in the context of peer to peer (P2P) networks, where at any point of time a peer may be placed at or removed from any location over a general purpose network (e.g., an Internet site). A vertex (peer) can communicate with another vertex directly if and only if it ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
The resource discovery problem arises in the context of peer to peer (P2P) networks, where at any point of time a peer may be placed at or removed from any location over a general purpose network (e.g., an Internet site). A vertex (peer) can communicate with another vertex directly if and only if it knows a certain routing information to that other vertex. Hence, a critical task is for the peers to convey this routing information to each other. The problem was formalized by Harchol-Balter, Leighton and Lewin [13]. The routing information needed for a vertex to reach another peer is that peer’s identifier (e.g., IP address). A logical directed edge represents the fact that the peer at the tail of the edge knows the IP address of the one at its head. A number of algorithms were developed in [13] for this problem in the model of a synchronous network over a weakly connected directed graph. The best of these algorithms was randomized. Subsequently, a deterministic algorithm for the problem on synchronous networks with improved complexity was presented in [15]. The current paper extends the deterministic algorithm of [15] to the environment of asynchronous networks, maintaining similar complexities (translated to the asynchronous model). These are lower than the complexities that would be needed to synchronize the system. The main technical difficulty in a directed, weakly connected system is to ensure that vertices take consistent steps, even if their knowledge about each other is not symmetric, and even if there is no timeout mechanism (which does exist in synchronous systems) to assist in that. (In particular, as opposed to the case in synchronous systems, here an algorithm cannot first

