Results 1  10
of
40
Fast SharedMemory Algorithms for Computing the Minimum Spanning Forest of Sparse Graphs
, 2006
"... ..."
BSP vs LogP
, 1996
"... A quantitative comparison of the BSP and LogP models of parallel computation is developed. We concentrate on a variant of LogP that disallows the socalled stalling behavior, although issues surrounding the stalling phenomenon are also explored. Very efficient cross simulations between the two model ..."
Abstract

Cited by 32 (4 self)
 Add to MetaCart
A quantitative comparison of the BSP and LogP models of parallel computation is developed. We concentrate on a variant of LogP that disallows the socalled stalling behavior, although issues surrounding the stalling phenomenon are also explored. Very efficient cross simulations between the two models are derived, showing their substantial equivalence for algorithmic design guided by asymptotic analysis. It is also shown that the two models can be implemented with similar performance on most pointtopoint networks. In conclusion, within the limits of our analysis that is mainly of an asymptotic nature, BSP and (stallfree) LogP can be viewed as closely related variants within the bandwidthlatency framework for modeling parallel computation. BSP seems somewhat preferable due to its greater simplicity and portability, and slightly greater power. LogP lends itself more naturally to multiuser mode.
Accounting for memory bank contention and delay in highbandwidth multiprocessors
 In Proc. 7th ACM Symp. on Parallel Algorithms and Architectures
, 1997
"... Abstract—For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent years. As a result, several sharedmemory multiprocessors consist of more memory banks than processors. The object ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
(Show Context)
Abstract—For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent years. As a result, several sharedmemory multiprocessors consist of more memory banks than processors. The object of this paper is to provide a simple model (with only a few parameters) for the design and analysis of irregular parallel algorithms that will give a reasonable characterization of performance on such machines. For this purpose, we extend Valiant’s bulksynchronous parallel (BSP) model with two parameters: a parameter for memory bank delay, the minimum time for servicing requests at a bank, and a parameter for memory bank expansion, the ratio of the number of banks to the number of processors. We call this model the (d, x)BSP. We show experimentally that the (d, x)BSP captures the impact of bank contention and delay on the CRAY C90 and J90 for irregular access patterns, without modeling machinespecific details of these machines. The model has clarified the performance characteristics of several unstructured algorithms on the CRAY C90 and J90, and allowed us to explore tradeoffs and optimizations for these algorithms. In addition to modeling individual algorithms directly, we also consider the use of the (d, x)BSP as a bridging model for emulating a very highlevel abstract model, the Parallel Random Access Machine (PRAM). We provide matching upper and lower bounds for emulating the EREW and QRQW PRAMs on the (d, x)BSP.
Cacheefficient Dynamic Programming Algorithms for Multicores
, 2008
"... We present cacheefficient chip multiprocessor (CMP) algorithms with good speedup for some widely used dynamic programming algorithms. We consider three types of caching systems for CMPs: DCMP with a private cache for each core, SCMP with a single cache shared by all cores, and Multicore, which h ..."
Abstract

Cited by 27 (4 self)
 Add to MetaCart
(Show Context)
We present cacheefficient chip multiprocessor (CMP) algorithms with good speedup for some widely used dynamic programming algorithms. We consider three types of caching systems for CMPs: DCMP with a private cache for each core, SCMP with a single cache shared by all cores, and Multicore, which has private L1 caches and a shared L2 cache. We derive results for three classes of problems: local dependency dynamic programming (LDDP), Gaussian Elimination Paradigm (GEP), and parenthesis problem. For each class of problems, we develop a generic CMP algorithm with an associated tiling sequence. We then tailor this tiling sequence to each caching model and provide a parallel schedule that results in a cacheefficient parallel execution up to the critical path length of the underlying dynamic programming algorithm. We present experimental results on an 8core Opteron for two sequence alignment problems that are important examples of LDDP. Our experimental results show good speedups for simple versions of our algorithms.
Concurrent Threads and Optimal Parallel Minimum Spanning Trees Algorithm
 J. ACM
, 2001
"... This paper resolves a longstanding open problem on whether the concurrent write capability of parallel random access machine (PRAM) is essential for solving fundamental graph problems like connected components and minimum spanning trees in O(log n) time. Specically, we present a new algorithm to so ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
(Show Context)
This paper resolves a longstanding open problem on whether the concurrent write capability of parallel random access machine (PRAM) is essential for solving fundamental graph problems like connected components and minimum spanning trees in O(log n) time. Specically, we present a new algorithm to solve these problems in O(log n) time using a linear number of processors on the exclusiveread exclusivewrite PRAM. The logarithmic time bound is actually optimal since it is well known that even computing the \OR" of n bits
A randomized timework optimal parallel algorithm for finding a minimum spanning forest
 SIAM J. COMPUT
, 1999
"... We present a randomized algorithm to find a minimum spanning forest (MSF) in an undirected graph. With high probability, the algorithm runs in logarithmic time and linear work on an exclusive read exclusive write (EREW) PRAM. This result is optimal w.r.t. both work and parallel time, and is the fi ..."
Abstract

Cited by 20 (3 self)
 Add to MetaCart
We present a randomized algorithm to find a minimum spanning forest (MSF) in an undirected graph. With high probability, the algorithm runs in logarithmic time and linear work on an exclusive read exclusive write (EREW) PRAM. This result is optimal w.r.t. both work and parallel time, and is the first provably optimal parallel algorithm for this problem under both measures. We also give a simple, general processor allocation scheme for treelike computations.
Using PRAM Algorithms on a UniformMemoryAccess SharedMemory Architecture
 Proc. 5th Int’l Workshop on Algorithm Engineering (WAE 2001), volume 2141 of Lecture Notes in Computer Science
, 2001
"... The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the res ..."
Abstract

Cited by 20 (11 self)
 Add to MetaCart
(Show Context)
The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors (from 1 to 64) and across the entire range of instance sizes tested. This linear speedup with the number of processors is, to our knowledge, the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is a graph decomposition algorithm that also requires the computation of a spanning tree; this problem is not only of interest in its own right, but is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of sharedmemory parallel algorithms.
A Randomized Linear Work EREW PRAM Algorithm to Find a Minimum Spanning Forest
, 1997
"... We present a randomized EREW PRAM algorithm to find a minimum spanning forest in a weighted undirected graph. On an nvertex graph the algorithm runs in o((log n) 1+ffl ) expected time for any ffl ? 0 and performs linear expected work. This is the first linear work, polylog time algorithm on th ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
We present a randomized EREW PRAM algorithm to find a minimum spanning forest in a weighted undirected graph. On an nvertex graph the algorithm runs in o((log n) 1+ffl ) expected time for any ffl ? 0 and performs linear expected work. This is the first linear work, polylog time algorithm on the EREW PRAM for this problem. This also gives parallel algorithms that perform expected linear work on two more realistic models of parallel computation, the QSM and the BSP. 1 Introduction The design of efficient algorithms to find a minimum spanning forest (MSF) in a weighted undirected graph is a fundamental problem that has received much attention. There have been many algorithms designed for the MSF problem that run in close to linear time (see, e.g., [CLR91]). Recently a randomized lineartime algorithm for this problem was presented in [KKT95]. Based on this work [CKT94] presented a randomized parallel algorithm on the CRCW PRAM which runs in O(2 log n log n) expected time whil...
Modeling parallel bandwidth: Local vs. global restrictions
"... Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., bsp and logp) account for bandwidth limitations using a perprocessor parameter g> 1, such that eachpro cessor can send/receive ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., bsp and logp) account for bandwidth limitations using a perprocessor parameter g> 1, such that eachpro cessor can send/receive at most h messages in g h time. Other models (e.g., pram(m)) account for bandwidth limitations as an aggregate parameter m<p, such thatthe p processors can send at most m messages in total at each step. This paper provides the rst detailed study of the algorithmic implications of modeling parallel bandwidth as a perprocessor (local) limitation versus an aggregate (global) limitation. We consider a number of basic problems