Results 1  10
of
118
LogP: Towards a Realistic Model of Parallel Computation
, 1993
"... A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding developme ..."
Abstract

Cited by 560 (15 self)
 Add to MetaCart
(Show Context)
A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. It is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM5.
LogGP: Incorporating Long Messages into the LogP Model  One step closer towards a realistic model for parallel computation
, 1995
"... We present a new model of parallel computationthe LogGP modeland use it to analyze a number of algorithms, most notably, the single node scatter (onetoall personalized broadcast). The LogGP model is an extension of the LogP model for parallel computation [CKP + 93] which abstracts the comm ..."
Abstract

Cited by 287 (1 self)
 Add to MetaCart
We present a new model of parallel computationthe LogGP modeland use it to analyze a number of algorithms, most notably, the single node scatter (onetoall personalized broadcast). The LogGP model is an extension of the LogP model for parallel computation [CKP + 93] which abstracts the communication of fixedsized short messages through the use of four parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P ). As evidenced by experimental data, the LogP model can accurately predict communication performance when only short messages are sent (as on the CM5) [CKP + 93, CDMS94]. However, many existing parallel machines have special support for long messages and achieve a much higher bandwidth for long messages compared to short messages (e.g., IBM SP2, Paragon, Meiko CS2, Ncube/2). We extend the basic LogP model with a linear model for long messages. This combination, which we call the LogGP model of parallel computation, has o...
Programming Parallel Algorithms
, 1996
"... In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a th ..."
Abstract

Cited by 237 (11 self)
 Add to MetaCart
(Show Context)
In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofparallelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and prototyping algorithms. There has been a large gap between languages
Efficient Algorithms for AlltoAll Communications in MultiPort MessagePassing Systems
 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1997
"... We present efficient algorithms for two alltoall communication operations in messagepassing systems: index (or alltoall personalized communication) and concatenation (or alltoall broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any pointto ..."
Abstract

Cited by 103 (0 self)
 Add to MetaCart
We present efficient algorithms for two alltoall communication operations in messagepassing systems: index (or alltoall personalized communication) and concatenation (or alltoall broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any pointtopoint communication is independent of the senderreceiver pair. We also assume that each processor has k ≥ 1 ports, through which it can send and receive k messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication startup time, and on the communication bandwidth. In the index operation among n processors, initially, each processor has n blocks of data, and the goal is to exchange the i th block of processor j with the j th block of processor i. We present a class of index algorithms that is designed for all values of n and that features a tradeoff between the communication startup time and the data transfer time. This class of algorithms includes two special cases: an algorithm that is optimal with respect to the measure of the startup time, and an algorithm that is optimal with respect to the measure of the data transfer time. We also present experimental results featuring the performance tuneability of our index algorithms on the IBM SP1 parallel system. In the concatenation operation, among n processors, initially, each processor has one block of data, and the goal is to concatenate the n blocks of data from the n processors, and to make the concatenation result known to all the processors. We present a concatenation algorithm that is optimal, for most values of n, in the number of communication rounds and in the amount of data transferred.
Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance
, 2000
"... The ecient implementation of collective communication operations has received much attention. Initial eorts modeled network communication and produced \optimal" trees based on those models. However, the models used by these initial eorts assumed equal pointtopoint latencies between any two pr ..."
Abstract

Cited by 89 (12 self)
 Add to MetaCart
(Show Context)
The ecient implementation of collective communication operations has received much attention. Initial eorts modeled network communication and produced \optimal" trees based on those models. However, the models used by these initial eorts assumed equal pointtopoint latencies between any two processes. This assumption is violated in heterogeneous systems such as clusters of SMPs and widearea \computational grids", and as a result, collective operations that utilize the trees generated by these models perform suboptimally. In response, more recent work has focused on creating topologyaware trees for collective operations that minimize communication across slower channels (e.g., a widearea network). While these efforts have signicant communication benets, they all limit their view of the network to only two layers. We present a strategy based upon a multilayer view of the network. By creating multilevel topology trees we take advantage of communication cost dierences at every lev...
Faultlocal distributed mending
 In Proceedings of the 14th Annual ACM Symposium on Principles of Distributed Computing
, 1995
"... As communication networks grow, existing fault handling tools that involve global measures such as global timeouts or reset procedures become increasingly unaffordable, since their cost grows with the size of the network. Rather, for a fault handling mechanism to scale to large networks, its cost m ..."
Abstract

Cited by 69 (16 self)
 Add to MetaCart
As communication networks grow, existing fault handling tools that involve global measures such as global timeouts or reset procedures become increasingly unaffordable, since their cost grows with the size of the network. Rather, for a fault handling mechanism to scale to large networks, its cost must depend only on the number of failed nodes Žwhich, thanks to today’s technology, grows much more slowly than the networks.. Moreover, it should allow the nonfaulty regions of the networks to continue their operation even during the recovery of the faulty parts. This paper introduces the concepts fault locality and faultlocally mendable problems, which are problems for which there are correction algorithms Žapplied after faults. whose cost depends only on the Ž unknown. number of faults. We show that any inputoutput problem is faultlocally mendable. The solution involves a novel technique combining data structures and ‘‘local votes’ ’ among nodes, which may be of interest in itself. � 1999 Academic Press * Alexander Goldberg lecturer.
CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers
 IEEE Transactions on Parallel and Distributed Systems
, 1995
"... AbstractA collective communication library for parallel computers includes frequently used operations such as broadcast, reduce, scatter, gather, concatenate, synchronize, and shift. Such a library provides users with a convenient programming interface, efficient communication operations, and the a ..."
Abstract

Cited by 68 (8 self)
 Add to MetaCart
AbstractA collective communication library for parallel computers includes frequently used operations such as broadcast, reduce, scatter, gather, concatenate, synchronize, and shift. Such a library provides users with a convenient programming interface, efficient communication operations, and the advantage of portability. A library of this nature, the Collective Communication Library (CCL), intended for the line of scalable parallel amputer products by IBM, has been designed. CCL is pact of the parallel application programming interface of the recently announced IBM 9076 Scalable POWERparallel System 1 (SP1). In this paper, we examine several issues related to the functionality, correctness, and performance of a portable collective communication library while focusing on three novel aspects in the design and implementation of CCL: 1) the introduction of process groups, 2) the definition of semantics that ensures correctness, and 3) the design of new and tunable algorithms based on a realistic pointtopoint communication model. Index Terms Collective communication algorithms, collective communication semantics, messagepassing parallel systems, portable library, process group, tunable algorithms. I.
Message Multicasting In Heterogeneous Networks
, 1998
"... In heterogeneous networks, sending messages may incur different delays on different links, and each node may have a different switching time between messages. The well studied Telephone model is obtained when all link delays and switching times are equal to one unit. We investigate the problem of fi ..."
Abstract

Cited by 55 (0 self)
 Add to MetaCart
In heterogeneous networks, sending messages may incur different delays on different links, and each node may have a different switching time between messages. The well studied Telephone model is obtained when all link delays and switching times are equal to one unit. We investigate the problem of finding the minimum time required to multicast a message from one source to a subset of the nodes of size k. The problem is NPhard even in the basic Telephone model. We present a polynomial time algorithm that approximates the minimum multicast time within a factor of O(log k). Our algorithm improves on the best known approximation factor for the Telephone model by a factor of O log n log log k . No approximation algorithms were known for the general model considered in this paper.
New models and algorithms for future networks
 IEEE Transactions on Information Theory
, 1988
"... In future networks transmission and switching capacity will dominate processing capacity. In this paper we investigate the way in which distributed algorithms should be changed in order to operate efficiently in this new environment. We introduce a class of new models for distributed algorithms whic ..."
Abstract

Cited by 45 (20 self)
 Add to MetaCart
In future networks transmission and switching capacity will dominate processing capacity. In this paper we investigate the way in which distributed algorithms should be changed in order to operate efficiently in this new environment. We introduce a class of new models for distributed algorithms which make explicit the difference between switching and processing. Based on these new models we define new message and time complexity measures which, we believe, capture the costs in many high speed networks more accurately than traditional measures. In order to explore the consequences of the new models, we examine three problems in distributed computation. For the problem of maintaining network topology we devise a broadcast algorithm which takes O(n) messages and O ( log n) time in the new measure. For the problem of leader election we present a simple algorithm that uses O(n) messages and O(n) time. The third problem, distributed computation of a ″globally sensitive ″ function, demonstrates some important features and tradeoffs in the new models and emphasizes the differences with the traditional network model.