Results 1 - 10
of
560
Programming Parallel Algorithms
, 1996
"... In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a th ..."
Abstract
-
Cited by 237 (11 self)
- Add to MetaCart
In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofparallelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and prototyping algorithms. There has been a large gap between languages
MagPIe: MPI’s Collective Communication Operations for Clustered Wide Area Systems
- Proc PPoPP'99
, 1999
"... Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have d ..."
Abstract
-
Cited by 172 (27 self)
- Add to MetaCart
(Show Context)
Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have developed MAGPIE, a library of collective communication operations optimized for wide area systems. MAGPIE's algorithms send the minimal amount of data over the slow wide area links, and only incur a single wide area latency. Using our system, existing MPI applications can be run unmodified on geographically distributed systems. On moderate cluster sizes, using a wide area latency of 10 milliseconds and a bandwidth of 1 MByte/s, MAGPIE executes operations up to 10 times faster than MPICH, a widely used MPI implementation; application kernels improve by up to a factor of 4. Due to the structure of our algorithms, MAGPIE's advantage increases for higher wide area latencies.
Models and Languages for Parallel Computation
- ACM COMPUTING SURVEYS
, 1998
"... We survey parallel programming models and languages using 6 criteria [:] should be easy to program, have a software development methodology, be architecture-independent, be easy to understand, guranatee performance, and provide info about the cost of programs. ... We consider programming models in ..."
Abstract
-
Cited by 168 (4 self)
- Add to MetaCart
(Show Context)
We survey parallel programming models and languages using 6 criteria [:] should be easy to program, have a software development methodology, be architecture-independent, be easy to understand, guranatee performance, and provide info about the cost of programs. ... We consider programming models in 6 categories, depending on the level of abstraction they provide.
Parallel Programming in Split-C
- In Proceedings of Supercomputing '93
, 1993
"... We introduce the Split-C language, a parallel extension of C intended for high performance programming on distributed memory multiprocessors, and demonstrate the use of the language in optimizing parallel programs. Split-C provides a global address space with a clear concept of locality and unusual ..."
Abstract
-
Cited by 164 (18 self)
- Add to MetaCart
(Show Context)
We introduce the Split-C language, a parallel extension of C intended for high performance programming on distributed memory multiprocessors, and demonstrate the use of the language in optimizing parallel programs. Split-C provides a global address space with a clear concept of locality and unusual assignment operators. These are used as tools to reduce the frequency and cost of remote access. The language allows a mixture of shared memory, message passing, and data parallel programming styles while providing efficient access to the underlying machine. We demonstrate the basic language concepts using regular and irregular parallel programs and give performance results for various stages of program optimization. 1 Overview Split-C is a parallel extension of the C programming language that supports efficient access to a global address space on current distributed memory multiprocessors. It retains the "small language" character of C and supports careful engineering and optimization of ...
Effects of communication latency, overhead, and bandwidth in a cluster architecture
- In Proceedings of the 24th Annual International Symposium on Computer Architecture
, 1997
"... This work provides a systematic study of the impact of communication performance on parallel applications in a high performance network of workstations. We develop an experimental system in which the communication latency, overhead, and bandwidth can be independently varied to observe the effects on ..."
Abstract
-
Cited by 108 (6 self)
- Add to MetaCart
(Show Context)
This work provides a systematic study of the impact of communication performance on parallel applications in a high performance network of workstations. We develop an experimental system in which the communication latency, overhead, and bandwidth can be independently varied to observe the effects on a wide range of applications. Our results indicate that current efforts to improve cluster communication performance to that of tightly integrated parallel machines results in significantly improved application performance. We show that applications demonstrate strong sensitivity to overhead, slowing down by a factor of 60 on 32 processors when overhead is increased from 3 to 103 s. Applications in this study are also sensitive to per-message bandwidth, but are surprisingly tolerant of increased latency and lower per-byte bandwidth. Finally, most applications demonstrate a highly linear dependence to both overhead and per-message bandwidth, indicating that further improvements in communication performance will continue to improve application performance. 1
An Implementation and Analysis of the Virtual Interface Architecture
- IN PROCEEDINGS OF SC'98
, 1998
"... Rapid developments in networking technology and a rise in clustered computing have driven research studies in high performance communication architectures. In an effort to standardize the work in this area, industry leaders have developed the Virtual Interface Architecture (VIA) specification. This ..."
Abstract
-
Cited by 108 (8 self)
- Add to MetaCart
Rapid developments in networking technology and a rise in clustered computing have driven research studies in high performance communication architectures. In an effort to standardize the work in this area, industry leaders have developed the Virtual Interface Architecture (VIA) specification. This architecture seeks to provide an operating system-independent infrastructure for high-performance user-level networking in a generic environment. This paper evaluates the inherent costs and performance potential of the Virtual Interface Architecture through a prototype implementation over Myrinet. The VIA prototype is compared against established research user-level networks using simple communication benchmarks on the same hardware. We consider extensions to the VI Architecture that improve its performance for certain types of communication traffic and outline further research areas in the VIA design space that merit investigation.
A Model of Computation for MapReduce
- Proc. ACM-SIAM SODA
, 2010
"... In recent years the MapReduce framework has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. Used daily at companies such as Yahoo!, Google, Amazon, and Facebook, and adopted more recently by several universities, it allows for ..."
Abstract
-
Cited by 103 (7 self)
- Add to MetaCart
(Show Context)
In recent years the MapReduce framework has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. Used daily at companies such as Yahoo!, Google, Amazon, and Facebook, and adopted more recently by several universities, it allows for easy parallelization of data intensive computations over many machines. One key feature of MapReduce that differentiates it from previous models of parallel computation is that it interleaves sequential and parallel computation. We propose a model of efficient computation using the MapReduce paradigm. Since MapReduce is designed for computations over massive data sets, our model limits the number of machines and the memory per machine to be substantially sublinear in the size of the input. On the other hand, we place very loose restrictions on the computational power of of any individual machine— our model allows each machine to perform sequential computations in time polynomial in the size of the original input. We compare MapReduce to the PRAM model of computation. We prove a simulation lemma showing that a large class of PRAM algorithms can be efficiently simulated via MapReduce. The strength of MapReduce, however, lies in the fact that it uses both sequential and parallel computation. We demonstrate how algorithms can take advantage of this fact to compute an MST of a dense graph in only two rounds, as opposed to Ω(log(n)) rounds needed in the standard PRAM model. We show how to evaluate a wide class of functions using the MapReduce framework. We conclude by applying this result to show how to compute some basic algorithmic problems such as undirected s-t connectivity in the MapReduce framework. 1
Efficient Algorithms for All-to-All Communications in Multi-Port Message-Passing Systems
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1997
"... We present efficient algorithms for two all-to-all communication operations in message-passing systems: index (or all-toall personalized communication) and concatenation (or all-to-all broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any point-to- ..."
Abstract
-
Cited by 103 (0 self)
- Add to MetaCart
We present efficient algorithms for two all-to-all communication operations in message-passing systems: index (or all-toall personalized communication) and concatenation (or all-to-all broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any point-to-point communication is independent of the sender-receiver pair. We also assume that each processor has k ≥ 1 ports, through which it can send and receive k messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication start-up time, and on the communication bandwidth. In the index operation among n processors, initially, each processor has n blocks of data, and the goal is to exchange the i th block of processor j with the j th block of processor i. We present a class of index algorithms that is designed for all values of n and that features a trade-off between the communication start-up time and the data transfer time. This class of algorithms includes two special cases: an algorithm that is optimal with respect to the measure of the start-up time, and an algorithm that is optimal with respect to the measure of the data transfer time. We also present experimental results featuring the performance tuneability of our index algorithms on the IBM SP-1 parallel system. In the concatenation operation, among n processors, initially, each processor has one block of data, and the goal is to concatenate the n blocks of data from the n processors, and to make the concatenation result known to all the processors. We present a concatenation algorithm that is optimal, for most values of n, in the number of communication rounds and in the amount of data transferred.
BSPlib: The BSP Programming Library
, 1998
"... BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming ..."
Abstract
-
Cited by 98 (8 self)
- Add to MetaCart
BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming in two distinct styles: direct remote memory access using put or get operations, and bulk synchronous message passing. Currently, implementations of BSPlib exist for a variety of modern architectures, including massively parallel computers with distributed memory, shared memory multiprocessors, and networks of workstations. BSPlib has been used in several scientific and industrial applications; this paper briefly describes applications in benchmarking, Fast Fourier Transforms, sorting, and molecular dynamics.