Results 1 - 10
of
14
Communication-Efficient Parallel Sorting
, 1996
"... We study the problem of sorting n numbers on a p-processor bulk-synchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processor-to-processor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sort ..."
Abstract
-
Cited by 60 (2 self)
- Add to MetaCart
We study the problem of sorting n numbers on a p-processor bulk-synchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processor-to-processor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sorting methods that use internal computation time that is O( n log n p ) and a number of communication rounds that is O( log n log(h+1) ) for h = \Theta(n=p). The internal computation bound is optimal for any comparison-based sorting algorithm. Moreover, the number of communication rounds is bounded by a constant for the (practical) situations when p n 1\Gamma1=c for a constant c 1. In fact, we show that our bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for we show that just computing the "or" of n bits distributed evenly to the first O(n=h) of an arbitrary number of processors in a BSP computer requires\Omega\Gammaqui n= log(h...
Can a Shared-Memory Model Serve as a Bridging Model for Parallel Computation?
, 1999
"... There has been a great deal of interest recently in the development of general-purpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style fo ..."
Abstract
-
Cited by 41 (11 self)
- Add to MetaCart
There has been a great deal of interest recently in the development of general-purpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style for designing algorithms when compared with the PRAM model. Indeed, while many consider data parallelism as a convenient style, and the shared-memory abstraction as an easyto-use platform, the bandwidth limitations of current machines have diverted much attention to message-passing and distributed-memory models (such as the BSP and LogP) that account more properly for these limitations. In this paper we consider the question of whether a shared-memory model can serve as an effective bridging model for parallel computation. In particular, can a shared-memory model be as effective as, say, the BSP? As a candidate for a bridging model, we introduce the Queuing Shared-Memory (QSM) model, which accounts for limited communication bandwidth while still providing a simple shared-memory abstraction. We substantiate the ability of the QSM to serve as a bridging model by providing a simple work-preserving emulation of the QSM on both the BSP, and on a related model, the (d, x)-BSP. We present evidence that the features of the QSM are essential to its effectiveness as a bridging model. In addition, we describe scenarios
Parallel Sorting With Limited Bandwidth
- in Proc. 7th ACM Symp. on Parallel Algorithms and Architectures
, 1995
"... We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the trade-off between the amount of local computation an ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the trade-off between the amount of local computation and the amount of interprocessor communication required for parallel sorting algorithms. We prove a lower bound of \Omega\Gamma n log m m ) on the time to sort n numbers in an exclusive-read variant of the PRAM(m) model. We show that Leighton's Columnsort can be used to give an asymptotically matching upper bound in the case where m grows as a fractional power of n. The bounds are of a surprising form, in that they have little dependence on the parameter p. This implies that attempting to distribute the workload across more processors while holding the problem size and the size of the shared memory fixed will not improve the optimal running time of sorting in this model. We also show that bot...
Parallel Balanced Allocations
- IN PROCEEDINGS OF THE 8TH ANNUAL ACM SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1996
"... We study the well known problem of throwing m balls into n bins. If each ball in the sequential game is allowed to select more than one bin, the maximum load of the bins can be exponentially reduced compared to the `classical balls into bins' game. We consider a static and a dynamic variant of a ra ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
We study the well known problem of throwing m balls into n bins. If each ball in the sequential game is allowed to select more than one bin, the maximum load of the bins can be exponentially reduced compared to the `classical balls into bins' game. We consider a static and a dynamic variant of a randomized parallel allocation where each ball can choose a constant number of bins. All results hold with high probability. In the static case all m balls arrive at the same time. We analyze for m = n a very simple optimal class of protocols achieving maximum load O i r q log n log log n j if r rounds of communication are allowed. This matches the lower bound of [ACMR95]. Furthermore, we generalize the protocols to the case of m ? n balls. An optimal load of O(m=n) can be achieved using log log n log(m=n) rounds of communication. Hence, for m = n log log n log log log n balls this slackness allows to hide the amount of communication. In the `classical balls into bins' game this op...
Modeling parallel bandwidth: Local vs. global restrictions
"... Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., bsp and logp) account for bandwidth limitations using a per-processor parameter g> 1, such that eachpro cessor can send/receive at ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., bsp and logp) account for bandwidth limitations using a per-processor parameter g> 1, such that eachpro cessor can send/receive at most h messages in g h time. Other models (e.g., pram(m)) account for bandwidth limitations as an aggregate parameter m<p, such thatthe p processors can send at most m messages in total at each step. This paper provides the rst detailed study of the algorithmic implications of modeling parallel bandwidth as a per-processor (local) limitation versus an aggregate (global) limitation. We consider a number of basic problems
New Coding Techniques for Improved Bandwidth Utilization
- In Proc. 37th IEEE Symp. on Foundations of Computer Science
, 1998
"... this paper, we introduce a new coding technique for transmitting the XOR of carefully selected patterns of bits to be communicated which greatly reduces bandwidth requirements in some settings. This technique has broader applications. For example, we demonstrate that the coding technique has a surpr ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
this paper, we introduce a new coding technique for transmitting the XOR of carefully selected patterns of bits to be communicated which greatly reduces bandwidth requirements in some settings. This technique has broader applications. For example, we demonstrate that the coding technique has a surprising application to a simple I/O (Input / Output) complexity problem related to finding the transpose of a matrix. Our main results are developed in the PRAM(m) model, a limited bandwidth PRAM model where p processors communicate through a small globally shared memory of m bits. We provide new algorithms for the problems of sorting and permutation routing. For the concurrent read PRAM(m), as p grows with m
A General Purpose Shared-Memory Model For Parallel Computation
, 1997
"... We describe a general-purpose shared-memory model for parallel computation, called the qsm [21], which provides a high-level shared-memory abstraction for parallel algorithm design, as well as the ability to be emulated in an effective manner on the bsp, a lower-level, distributed-memory model. We ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
We describe a general-purpose shared-memory model for parallel computation, called the qsm [21], which provides a high-level shared-memory abstraction for parallel algorithm design, as well as the ability to be emulated in an effective manner on the bsp, a lower-level, distributed-memory model. We present new emulation results that show that very little generality is lost by not having a `gap parameter' at memory.
Parallel Algorithms for Database Operations and a Database Operation for Parallel Algorithms
, 1995
"... This paper establishes some significant links between two areas: (i) relational parallel database systems; and (ii) the design and analysis of parallel algorithms. The paper begins with a fundamental but very simple observation: implementing a Join operation in the context of relational parallel da ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
This paper establishes some significant links between two areas: (i) relational parallel database systems; and (ii) the design and analysis of parallel algorithms. The paper begins with a fundamental but very simple observation: implementing a Join operation in the context of relational parallel database systems is at least as expensive as implementing an arbitrary PRAM computation. Thus, the efficiency with which a given parallel computer can support a parallel relational database where Joins are fairly frequent is strongly related to the efficiency with which that computer can support the PRAM as one of its programmer 's models. The main technical contribution is an efficient parallel algorithm for the Join operation on a model where, in order to use the available bandwidth effectively, communication has to be performed in large blocks. 1 1 Introduction A key performance bottleneck for various database applications on serial computers has been high latency and low bandwidth while ...
WHAT GOOD ARE SHARED-MEMORY MODELS?
- INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING
, 1996
"... Shared memory models have been criticized for years for failing to model essential realities of parallel machines. Given the current wave of popular message-passing and distributed memory models (e.g., BSP, LOGP), it is natural to ask whether shared memory models have outlived any usefulness they ma ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Shared memory models have been criticized for years for failing to model essential realities of parallel machines. Given the current wave of popular message-passing and distributed memory models (e.g., BSP, LOGP), it is natural to ask whether shared memory models have outlived any usefulness they may have had. In this invited position papel; we discuss the continuing importance of shared memory models in the design and analysis of par-allel algorithms. We describe a new model, the Queuing Shared Memory (QSM) model, that accounts for limited communication bandwidth while still providing a shared memory abstraction, and provide evidence of its practicality. Finally, we discuss important areas for future models research. We argue that the compelling need for parallel computing in large scale data analysis (e.g., decision support, data mining) implies that the most important modeling issue going forward concerns how best to model disk I/O.
Compression using efficient multi-casting
- Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing
, 2000
"... Many multiprocessor systems have the ability to broad-cast and/or multicast information efficiently. However, this ability is often overlooked when designing algo-rithms for these systems. In this paper, we introduce a new compression technique that uses efficient multicas-ting to significantly redu ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Many multiprocessor systems have the ability to broad-cast and/or multicast information efficiently. However, this ability is often overlooked when designing algo-rithms for these systems. In this paper, we introduce a new compression technique that uses efficient multicas-ting to significantly reduce the amount of information communicated during parallel and distributed computa-tion, resulting in significantly faster algorithms for Fast Fourier Transforms and sorting on shared memory par-allel models with limited bandwidth. These algorithms demonstrate the importance of taking advantage of ef-ficient multicasting. The compression technique uses a new, natural variant of Ramsey theory, which may be of independent interest. 1.

