Results 1  10
of
62
Efficient External Memory Algorithms by Simulating CoarseGrained Parallel Algorithms
, 2003
"... External memory (EM) algorithms are designed for largescale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to ..."
Abstract

Cited by 45 (11 self)
 Add to MetaCart
External memory (EM) algorithms are designed for largescale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to relate the large body of work on parallel algorithms to EM, but with limited success. The combination of EM computing, on multiple disks, with multiprocessor parallelism has been posted as a challenge by the ACMWorking Group on Storage I/O for LargeScale Computing.
Evaluating Arithmetic Expressions Using Tree Contraction: A Fast and Scalable Parallel Implementation for Symmetric Multiprocessors (SMPs)
 Proc. 9th Int’l Conf. on High Performance Computing (HiPC 2002), volume 2552 of Lecture Notes in Computer Science
, 2002
"... The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the res ..."
Abstract

Cited by 26 (7 self)
 Add to MetaCart
(Show Context)
The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors and across the entire range of instance sizes tested. This linear speedup with the number of processors is one of the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is for evaluating arithmetic expression trees using the algorithmic techniques of list ranking and tree contraction; this problem is not only of interest in its own right, but is representativeof a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of sharedmemory parallel algorithms.
CGMgraph/CGMlib: Implementing and Testing CGM Graph Algorithms on PC Clusters
 International Journal of High Performance Computing Applications
, 2003
"... In this paper, we present CGMgraph, the first integrated library of parallel graph methods for PCclu8(T9 based on CGM algo rithms. CGMgraph implements parallel methods for variou graph prob lems. Ou implementations of deterministic list ranking, Eu er tou con nected components, spanning forest, and ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
(Show Context)
In this paper, we present CGMgraph, the first integrated library of parallel graph methods for PCclu8(T9 based on CGM algo rithms. CGMgraph implements parallel methods for variou graph prob lems. Ou implementations of deterministic list ranking, Eu er tou con nected components, spanning forest, and bipartite graph detection are, to ou r knowledge, the first e#cient implementations for PC clu sters.Ou library also inclu des CGMlib, a library of basic CGM tools su ch as sort ing, prefix su m, one to all broadcast, all to one gather, h Relation, all to all broadcast, array balancing, and CGM partitioning. Both libraries are available for download at http://cgm.dehne.net. 1
Ultimate Parallel List Ranking?
 Journal of Parallel and Distributed Computing
, 2000
"... Two improved listranking algorithms are presented. The "peelingoff" algorithm leads to an optimal PRAM algorithm, but was designed with application on a real parallel machine in mind. It is simpler than earlier algorithms, and in a range of problem sizes, where previously several algor ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
Two improved listranking algorithms are presented. The "peelingoff" algorithm leads to an optimal PRAM algorithm, but was designed with application on a real parallel machine in mind. It is simpler than earlier algorithms, and in a range of problem sizes, where previously several algorithms where required for the best performance, now this single algorithm suffices. If the problem size is much larger than the number of available processors, then the "sparserulingsets" algorithm is even better. In previous versions this algorithm had very restricted practical application because of the large number of communication rounds it was performing. This main weakness of this algorithm is overcome by adding two new ideas, each of which reduces the number of communication rounds by a factor of two. 1 Introduction A list is a basic data structure: it consists of nodes which are linked together, so that every node has precisely one predecessor and one successor, except for the initial n...
Practical Parallel Algorithms for Minimum Spanning Trees
 In Workshop on Advances in Parallel and Distributed Systems
, 1998
"... We study parallel algorithms for computing the minimum spanning tree of a weighted undirected graph G with n vertices and m edges. We consider an input graph G with m=n p, where p is the number of processors. For this case, we show that simple algorithms with dataindependent communication patterns ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
(Show Context)
We study parallel algorithms for computing the minimum spanning tree of a weighted undirected graph G with n vertices and m edges. We consider an input graph G with m=n p, where p is the number of processors. For this case, we show that simple algorithms with dataindependent communication patterns are efficient, both in theory and in practice. The algorithms are evaluated theoretically using Valiant's BSP model of parallel computation and empirically through implementation results.
Bulk Synchronous Parallel Algorithms for the External Memory Model
, 2002
"... Blockwise access to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/O. In this paper we present a simple, deterministic simulation technique which transforms certain Bulk Synchr ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
Blockwise access to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/O. In this paper we present a simple, deterministic simulation technique which transforms certain Bulk Synchronous Parallel (BSP) algorithms into efficient parallel EM algorithms. It optimizes blockwise data access and parallel disk I/O and, at the same time, utilizes multiple processors connected via a communication network or shared memory. We obtain new improved parallel EM algorithms for a large number of problems including sorting, permutation, matrix transpose, several geometric and GIS problems including threedimensional convex hulls (twodimensional Voronoi diagrams), and various graph problems. We show that certain parallel algorithms known for the BSP model can be used to obtain EM algorithms that meet well known I/O complexity lower bounds for various problems, including sorting.
CommunicationOptimal Parallel Minimum Spanning Tree Algorithms
, 1998
"... Lower and upper bounds for finding a minimum spanning tree (MST) in a weighted undirected graph on the BSP model are presented. We provide the first nontrivial lower bounds on the communication volume required to solve the MST problem. Let p denote the number of processors, n the number of nodes of ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
Lower and upper bounds for finding a minimum spanning tree (MST) in a weighted undirected graph on the BSP model are presented. We provide the first nontrivial lower bounds on the communication volume required to solve the MST problem. Let p denote the number of processors, n the number of nodes of the input graph, and m the number of edges of the input graph. We show that in the worst case, a total of \Omega\Gamma \Delta min(m; pn)) bits need to be communicated in order to solve the MST problem, where is the number of bits required to represent a single edge weight. This implies that if each message communicates at most bits, any BSP algorithm for finding an MST requires communication time \Omega\Gamma g \Delta min(m=p; n)), where g is the gap parameter of the BSP model. In addition, we present two algorithms with communication requirements that match our lower bound in different situations. Both algorithms perform linear work for appropriate values of n, m and p, and use a numbe...
Reducing i/o complexity by simulating coarse grained parallel algorithms
 In Proc. IPPS/SPDP
, 1999
"... Blockwise access to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/O. In this paper we present a deterministic simulation technique which transforms parallel algorithms into ( ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
Blockwise access to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/O. In this paper we present a deterministic simulation technique which transforms parallel algorithms into (parallel) external memory algorithms. Specifically, we present a deterministic simulation technique which transforms Coarse Grained Multicomputer (CGM) algorithms into external memory algorithms for the Parallel Disk Model. Our technique optimizes blockwise data access and parallel disk I/O and, at the same time, utilizes multiple processors connected via a communication network or shared memory. We obtain new improved parallel external memory algorithms for a large number of problems including sorting, permutation, matrix transpose, several geometric and GIS problems including 3D convex hulls (2D Voronoi diagrams), and various graph problems. All of the (parallel) external memory algorithms obtained via simulation are analyzed with respect to the computation time, communication time and the number of I/O’s. Our results answer to the challenge posed by the ACM working group on storage I/O for largescale computing [8]. 1
Emulations Between QSM, BSP and LogP: A Framework for GeneralPurpose Parallel Algorithm Design
, 1998
"... We present workpreserving emulations with small slowdown between LogP and two other parallel models: BSP and QSM. In conjunction with earlier workpreserving emulations between QSM and BSP these results establish a close correspondence between these three generalpurpose parallel models. Our result ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
We present workpreserving emulations with small slowdown between LogP and two other parallel models: BSP and QSM. In conjunction with earlier workpreserving emulations between QSM and BSP these results establish a close correspondence between these three generalpurpose parallel models. Our results also correct and improve on results reported earlier on emulations between BSP and LogP. In particular we shed new light on the relative power of stalling and nonstalling LogP models. The QSM is a sharedmemory model with only two parameters –p, the number of processors, and g, a bandwidth parameter. These features of the QSM make it a convenient model for parallel algorithm design, and the simple workpreserving emulations of QSM on BSP and LogP show that algorithms designed on the QSM will map well on to these other models. This presents a strong case for the use of QSM as the model of choice for parallel algorithm design. We present QSM algorithms for three basic problems – prefix sums, sample sort and list ranking. Using appropriate cost measures, we analyze the performance of these algorithms and describe simulation results. These results suggest that QSM analysis will predict algorithm performance quite accurately for problem sizes that arise in practice.
PRO: a model for Parallel ResourceOptimal computation
 IN 16TH ANNUAL INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTING SYSTEMS AND APPLICATIONS. IEEE, THE INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS
, 2002
"... We present a new parallel computation model that enables the design of resourceoptimal scalable parallel algorithms and simplifies their analysis. The model rests on the novel idea of incorporating relative optimality as an integral part and measuring the quality of a parallel algorithm in terms of ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
We present a new parallel computation model that enables the design of resourceoptimal scalable parallel algorithms and simplifies their analysis. The model rests on the novel idea of incorporating relative optimality as an integral part and measuring the quality of a parallel algorithm in terms of granularity.