Results 1 - 10
of
174
BSPlib: The BSP Programming Library
, 1998
"... BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming ..."
Abstract
-
Cited by 98 (8 self)
- Add to MetaCart
BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming in two distinct styles: direct remote memory access using put or get operations, and bulk synchronous message passing. Currently, implementations of BSPlib exist for a variety of modern architectures, including massively parallel computers with distributed memory, shared memory multiprocessors, and networks of workstations. BSPlib has been used in several scientific and industrial applications; this paper briefly describes applications in benchmarking, Fast Fourier Transforms, sorting, and molecular dynamics.
Scientific Computing on Bulk Synchronous Parallel Architectures
"... We theoretically and experimentally analyse the efficiency with which a wide range of important scientific computations can be performed on bulk synchronous parallel architectures. ..."
Abstract
-
Cited by 75 (16 self)
- Add to MetaCart
(Show Context)
We theoretically and experimentally analyse the efficiency with which a wide range of important scientific computations can be performed on bulk synchronous parallel architectures.
Communication-Efficient Parallel Sorting
, 1996
"... We study the problem of sorting n numbers on a p-processor bulk-synchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processor-to-processor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sort ..."
Abstract
-
Cited by 74 (5 self)
- Add to MetaCart
(Show Context)
We study the problem of sorting n numbers on a p-processor bulk-synchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processor-to-processor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sorting methods that use internal computation time that is O( n log n p ) and a number of communication rounds that is O( log n log(h+1) ) for h = \Theta(n=p). The internal computation bound is optimal for any comparison-based sorting algorithm. Moreover, the number of communication rounds is bounded by a constant for the (practical) situations when p n 1\Gamma1=c for a constant c 1. In fact, we show that our bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for we show that just computing the "or" of n bits distributed evenly to the first O(n=h) of an arbitrary number of processors in a BSP computer requires\Omega\Gammaqui n= log(h...
Vsched: Mixing batch and interactive virtual machines using periodic real-time scheduling
- In Proceedings of ACM/IEEE SC 2005 (Supercomputing
, 2005
"... We are developing Virtuoso, a system for distributed computing using virtual machines (VMs). Virtuoso must be able to mix batch and interactive VMs on the same physical hardware, while satisfying constraints on responsiveness and compute rates for each workload. VSched is the component of Virtuoso t ..."
Abstract
-
Cited by 72 (16 self)
- Add to MetaCart
(Show Context)
We are developing Virtuoso, a system for distributed computing using virtual machines (VMs). Virtuoso must be able to mix batch and interactive VMs on the same physical hardware, while satisfying constraints on responsiveness and compute rates for each workload. VSched is the component of Virtuoso that provides this capability. VSched is an entirely user-level tool that interacts with the stock Linux kernel running below any type-II virtual machine monitor to schedule all VMs (indeed, any process) using a periodic real-time scheduling model. This abstraction allows compute rate and responsiveness constraints to be straightforwardly described using a period and a slice within the period, and it allows for fast and simple admission control. This paper makes the case for periodic real-time scheduling for VM-based computing environments, and then describes and evaluates VSched. It also applies VSched to scheduling parallel workloads, showing that it can help a BSP application maintain a fixed stable performance despite externally caused load imbalance.
Efficient parallel graph algorithms for coarse grained multicomputers and BSP (Extended Abstract)
- in Proc. 24th International Colloquium on Automata, Languages and Programming (ICALP'97
, 1997
"... In this paper, we present deterministic parallel algorithms for the coarse grained multicomputer (CGM) and bulk-synchronous parallel computer (BSP) models which solve the following well known graph problems: (1) list ranking, (2) Euler tour construction, (3) computing the connected components and s ..."
Abstract
-
Cited by 62 (22 self)
- Add to MetaCart
In this paper, we present deterministic parallel algorithms for the coarse grained multicomputer (CGM) and bulk-synchronous parallel computer (BSP) models which solve the following well known graph problems: (1) list ranking, (2) Euler tour construction, (3) computing the connected components and spanning forest, (4) lowest common ancestor preprocessing, (5) tree contraction and expression tree evaluation, (6) computing an ear decomposition or open ear decomposition, (7) 2-edge connectivity and biconnectivity (testing and component computation), and (8) cordal graph recognition (finding a perfect elimination ordering). The algorithms for Problems 1-7 require O(log p) communication rounds and linear sequential work per round. Our results for Problems 1 and 2, i.e.they are fully scalable, and for Problems hold for arbitrary ratios n p 3-8 it is assumed that n p,>0, which is true for all commercially
A Randomized Parallel 3D Convex Hull Algorithm For Coarse Grained Multicomputers
- In Proc. ACM Symp. on Parallel Algorithms and Architectures
, 1995
"... We present a randomized parallel algorithm for constructing the 3D convex hull on a generic p-processor coarse grained multicomputer with arbitrary interconection network and n=p local memory per processor, where n=p p 2+ffl (for some arbitrarily small ffl ? 0). For any given set of n points in ..."
Abstract
-
Cited by 50 (10 self)
- Add to MetaCart
(Show Context)
We present a randomized parallel algorithm for constructing the 3D convex hull on a generic p-processor coarse grained multicomputer with arbitrary interconection network and n=p local memory per processor, where n=p p 2+ffl (for some arbitrarily small ffl ? 0). For any given set of n points in 3-space, the algorithm computes the 3D convex hull, with high probaility, in O( n log n p ) local computation time and O(1) communication phases with at most O(n=p) data sent/received by each processor. That is, with high probability, the algorithm computes the 3D convex hull of an arbitrary point set in time O( n logn p + \Gamma n;p ), where \Gamma n;p denotes the time complexity of one communication phase. The assumption n p p 2+ffl implies a coarse grained, limited parallelism, model which is applicable to most commercially available multiprocessors. In the terminology of the BSP model, our algorithm requires, with high probability, O(1) supersteps, synchronization period L = \Th...
Efficient External Memory Algorithms by Simulating Coarse-Grained Parallel Algorithms
, 2003
"... External memory (EM) algorithms are designed for large-scale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to ..."
Abstract
-
Cited by 45 (11 self)
- Add to MetaCart
(Show Context)
External memory (EM) algorithms are designed for large-scale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to relate the large body of work on parallel algorithms to EM, but with limited success. The combination of EM computing, on multiple disks, with multiprocessor parallelism has been posted as a challenge by the ACMWorking Group on Storage I/O for Large-Scale Computing.
Doubly Logarithmic Communication Algorithms for Optical Communication Parallel Computers
- In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1994
"... In this paper we consider the problem of interprocessor communication on parallel computers that have optical communication networks. We consider the Completely Connected Optical Communication Parallel Computer (OCPC), which has a completely connected optical network and also the Mesh of Optical Bus ..."
Abstract
-
Cited by 41 (5 self)
- Add to MetaCart
In this paper we consider the problem of interprocessor communication on parallel computers that have optical communication networks. We consider the Completely Connected Optical Communication Parallel Computer (OCPC), which has a completely connected optical network and also the Mesh of Optical Buses Parallel Computer (MOBPC) , which has a mesh of optical buses as its communication network. The particular communication problem that we study is that of realizing an h-relation. In this problem, each processor has at most h messages to send and at most h messages to receive. It is clear that any 1-relation can be realized in one communication step on an OCPC. However, the best previously known p-processor OCPC algorithm for realizing an arbitrary h-relation for h ? 1 requires \Theta(h + log p) expected communication steps. (This algorithm is due to Valiant and is based on earlier work of Anderson and Miller.) Valiant's algorithm is optimal only for h = \Omega\Gamma139 p) and it is an op...
Can a Shared-Memory Model Serve as a Bridging Model for Parallel Computation?
, 1999
"... There has been a great deal of interest recently in the development of general-purpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style fo ..."
Abstract
-
Cited by 40 (12 self)
- Add to MetaCart
There has been a great deal of interest recently in the development of general-purpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style for designing algorithms when compared with the PRAM model. Indeed, while many consider data parallelism as a convenient style, and the shared-memory abstraction as an easyto-use platform, the bandwidth limitations of current machines have diverted much attention to message-passing and distributed-memory models (such as the BSP and LogP) that account more properly for these limitations. In this paper we consider the question of whether a shared-memory model can serve as an effective bridging model for parallel computation. In particular, can a shared-memory model be as effective as, say, the BSP? As a candidate for a bridging model, we introduce the Queuing Shared-Memory (QSM) model, which accounts for limited communication bandwidth while still providing a simple shared-memory abstraction. We substantiate the ability of the QSM to serve as a bridging model by providing a simple work-preserving emulation of the QSM on both the BSP, and on a related model, the (d, x)-BSP. We present evidence that the features of the QSM are essential to its effectiveness as a bridging model. In addition, we describe scenarios
Towards Efficiency and Portability: Programming with the BSP Model
- IN PROC. 8TH ACM SYMP. ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1996
"... The Bulk-Synchronous Parallel (BSP) model was proposed by Valiant as a model for general-purpose parallel computation. The objective of the model is to allow the design of parallel programs that can be executed efficiently on a variety of architectures. While many theoretical arguments in support of ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
(Show Context)
The Bulk-Synchronous Parallel (BSP) model was proposed by Valiant as a model for general-purpose parallel computation. The objective of the model is to allow the design of parallel programs that can be executed efficiently on a variety of architectures. While many theoretical arguments in support of the BSP model have been presented, the degree to which the model can be efficiently utilized on existing parallel machines remains unclear. To explore this question, we implemented a small library of BSP functions, called the Green BSP library, on several parallel platforms. We also created a number of parallel applications based on this library. Here, we report on the performance of six of these applications on three different parallel platforms. Our preliminary results suggest that the BSP model can be used to develop efficient and portable programs for a range of machines and applications.