Results 1 - 10
of
22
Models of Machines and Computation for Mapping in Multicomputers
, 1993
"... It is now more than a quarter of a century since researchers started publishing papers on mapping strategies for distributing computation across the computation resource of multiprocessor systems. There exists a large body of literature on the subject, but there is no commonly-accepted framework ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
It is now more than a quarter of a century since researchers started publishing papers on mapping strategies for distributing computation across the computation resource of multiprocessor systems. There exists a large body of literature on the subject, but there is no commonly-accepted framework whereby results in the field can be compared. Nor is it always easy to assess the relevance of a new result to a particular problem. Furthermore, changes in parallel computing technology have made some of the earlier work of less relevance to current multiprocessor systems. Versions of the mapping problem are classified, and research in the field is considered in terms of its relevance to the problem of programming currently available hardware in the form of a distributed memory multiple instruction stream multiple data stream computer: a multicomputer.
Performance Modeling of Distributed Memory Architectures
- Journal of Parallel and Distributed Computing
, 1991
"... We provide performance models for several primitive operations on data structures distributed over memory units interconnected by a Boolean cube network. In particular, we model single source, and multiple source concurrent broadcasting or reduction, concurrent gather and scatter operations, shifts ..."
Abstract
-
Cited by 20 (7 self)
- Add to MetaCart
We provide performance models for several primitive operations on data structures distributed over memory units interconnected by a Boolean cube network. In particular, we model single source, and multiple source concurrent broadcasting or reduction, concurrent gather and scatter operations, shifts along several axes of multi-dimensional arrays, and emulation of butterfly networks. We also show how the processor configuration, data aggregation, and the encoding of the address space affect the performance for two important basic computations: the multiplication of arbitrarily shaped matrices, and the Fast Fourier Transform. We also give an example of the performance behavior for local matrix operations for a processor with a single path to local memory, and a set of registers. The analytic models are verified by measurements on the Connection Machine model CM-2. 1 Introduction This paper addresses crucial issues in performance modeling of distributed memory architectures designed for s...
Minimizing the Communication Time for Matrix Multiplication on Multiprocessors
- Parallel Computing
, 1992
"... We present one matrix multiplication algorithm for two--dimensional arrays of processing nodes, and one algorithm for three--dimensional nodal arrays. One--dimensional nodal arrays are treated as a degenerate case. The algorithms are designed to utilize fully the communications bandwidth in high deg ..."
Abstract
-
Cited by 20 (9 self)
- Add to MetaCart
We present one matrix multiplication algorithm for two--dimensional arrays of processing nodes, and one algorithm for three--dimensional nodal arrays. One--dimensional nodal arrays are treated as a degenerate case. The algorithms are designed to utilize fully the communications bandwidth in high degree networks in which the one--, two--, or three--dimensional arrays may be embedded. For binary n-cubes, our algorithms offer a speedup of the communication over previous algorithms for square matrices and square two--dimensional arrays by a factor of n 2 . Configuring the N = 2 n processing nodes as a three-dimensional array may reduce the communication complexity by a factor of N 1 6 compared to a two--dimensional nodal array. The three--dimensional algorithm requires temporary storage proportional to the length of the nodal array axis aligned with the axis shared between the multiplier and the multiplicand. The optimal two--dimensional nodal array shape with respect to communicati...
Index Transformation Algorithms in a Linear Algebra Framework
- IEEE Transactions on Parallel and Distributed Systems
, 1994
"... We present a linear algebraic formulation for a class of index transformations such as Gray code encoding and decoding, matrix transpose, bit reversal, vector reversal, shuffles, and other index or dimension permutations. This formulation unifies, simplifies, and can be used to derive algorithms for ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
We present a linear algebraic formulation for a class of index transformations such as Gray code encoding and decoding, matrix transpose, bit reversal, vector reversal, shuffles, and other index or dimension permutations. This formulation unifies, simplifies, and can be used to derive algorithms for hypercube multiprocessors. We show how all the widely known properties of Gray codes, and some not so well-known properties as well, can be derived using this framework. Using this framework, we relate hypercube communications algorithms to Gauss-Jordan elimination on a matrix of 0's and 1's. Keywords and phrases: binary-complement/permute, binary hypercube, Connection Machine, Gray code, index transformation, multiprocessor communication, routing, shuffle Simultaneously appears as Lawrence Berkeley Laboratory technical report LBL--31841. y Supported by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy under Contract DE-AC03-76SF000...
Computing Global Combine Operations in the Multi-Port Postal Model
, 1996
"... Consider a message-passing system of n processors, in which each processor holds one piece of data initially. The goal is to compute an associative and commutative reduction function on the n distributed pieces of data and to make the result known to all the n processors. This operation is frequent ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Consider a message-passing system of n processors, in which each processor holds one piece of data initially. The goal is to compute an associative and commutative reduction function on the n distributed pieces of data and to make the result known to all the n processors. This operation is frequently used in many message-passing systems and is typically referred to as global combine, census computation, or gossiping. This paper explores the problem of global combine in the multi-port postal model for message-passing systems. This model is characterized by three parameters: n --- the number of processors, k --- the number of ports per processor, and --- the communication latency. In this model, in every round r, each processor can send k distinct messages to k other processors, and it can receive k messages that were sent out from k other processors \Gamma 1 rounds earlier. This paper provides an optimal algorithm for the global combine problem that requires the least number of comm...
Minimizing Communication Overhead Using Pipelining for Multi-Dimensional FFT on Distributed Memory Machines
, 1993
"... this paper we have presented different algorithms to compute the bi-dimensional FFT. These methods allow the overlapping of the communications by the computations and to reduce the number of start-up costs. We have shown that the overlap is total using coarse grain pipelining. The experiments corrob ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
this paper we have presented different algorithms to compute the bi-dimensional FFT. These methods allow the overlapping of the communications by the computations and to reduce the number of start-up costs. We have shown that the overlap is total using coarse grain pipelining. The experiments corroborate nicely this theoretical analysis. Some other methods, using the SPMD-like programming paradigm, and other experiments are discussed in [4]. References
Scattering and Gathering Messages in Networks of Processors
, 1993
"... The operations of scattering and gathering in a network of processors involve one processor of the network --- call it P 0 --- communicating with all other processors. In scattering, P 0 sends (possibly) distinct messages to all other processors; in gathering, the other processors send (possibly) di ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
The operations of scattering and gathering in a network of processors involve one processor of the network --- call it P 0 --- communicating with all other processors. In scattering, P 0 sends (possibly) distinct messages to all other processors; in gathering, the other processors send (possibly) distinct messages to P 0 . We consider networks that are trees of processors; we present algorithms for scattering messages from and gathering messages to the processor that resides at the root of the tree. The algorithms are: ffl quite general, in that the messages transmitted can differ arbitrarily in length; ffl quite strong, in that they send messages along noncolliding paths, hence do not require any buffering or queuing mechanisms in the processors; ffl quite efficient: the algorithms for scattering in general trees are optimal, the algorithm for gathering in a path is optimal, and the algorithms for gathering in general trees are nearly optimal. Our algorithms can easily be converte...
Generalized Shuffle Permutations on Boolean Cubes
- J. Parallel and Distributed Computing
, 1991
"... . In a generalized shuffle permutation an address (a q\Gamma1 a q\Gamma2 : : : a 0 ) receives its content from an address obtained through a cyclic shift on a subset of the q dimensions used for the encoding of the addresses. Bit-complementation may be combined with the shift. We give an algorithm t ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
. In a generalized shuffle permutation an address (a q\Gamma1 a q\Gamma2 : : : a 0 ) receives its content from an address obtained through a cyclic shift on a subset of the q dimensions used for the encoding of the addresses. Bit-complementation may be combined with the shift. We give an algorithm that requires K 2 + 2 exchanges for K elements per processor, when storage dimensions are part of the permutation, and concurrent communication on all ports of every processor is possible. The number of element exchanges in sequence is independent of the number of processor dimensions oe r in the permutation. With no storage dimensions in the permutation our best algorithm requires (oe r + 1)d K 2oe r e element exchanges. We also give an algorithm for oe r = 2, or the real shuffle consists of a number of cycles of length two, that requires K 2 +1 element exchanges in sequence when there is no bit complement. The lower bound is K 2 for both real and mixed shuffles with no bit compl...
Matrix Transpose on Meshes: Theory and Practice
- Computers and Artificial Intelligence
, 1997
"... Matrix transpose is a fundamental communication operation which is not dealt with optimally by general purpose routing schemes. For two dimensional meshes, the first optimal routing schedule is given. The strategy is simple enough to be implemented, but details of the available hardware are not favo ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
Matrix transpose is a fundamental communication operation which is not dealt with optimally by general purpose routing schemes. For two dimensional meshes, the first optimal routing schedule is given. The strategy is simple enough to be implemented, but details of the available hardware are not favorable. However, alternative algorithms, designed along the same lines, give an improvement on the Intel Paragon. 1 Introduction Various models for parallel machines have been considered. Despite of their large diameter, meshes are of great importance because of their simple structure and efficient layout. In a d- dimensional mesh, the processing units, PUs, form an array of size n \Theta \Delta \Delta \Delta \Theta n and are connected by a d-dimensional grid of communication links. A number of parallel computers with two-dimensional and three-dimensional mesh topology has been built, for instance the Intel Paragon, the CRAY T3E, and the J-Machine of MIT. Routing. In a routing problem pac...
A Vector Space Framework for Parallel Stable Permutations
- In Second International Workshop on Formal Methods for Parallel Programming: Theory and Applications
, 1995
"... We establish a formal foundation for stable permutations in the domain of a parallel model of computation applicable to a customized set of complexity metrics. By means of vector spaces, we develop an algebrao--geometric representation that is expressive, flexible and simple to use, and present a ta ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
We establish a formal foundation for stable permutations in the domain of a parallel model of computation applicable to a customized set of complexity metrics. By means of vector spaces, we develop an algebrao--geometric representation that is expressive, flexible and simple to use, and present a taxonomy categorizing stable permutations into classes of index--digit, linear, translation, affine and polynomial permutations. For each class, we demonstrate its general behavioral properties and then analyze particular examples in each class, where we derive results about its inverse, fixed instances, number of instances local and nonlocal to a processor, as well as its compositional relationships to other permutations. Such examples are bit--reversal, radix-- Q exchange, radix--Q shuffle and unshuffle within the index--digit class, radix--Q butterfly and 1's complement within the translation class, binary--to--Gray and Gray--to--binary within the linear class, and arithmetic add 1, arithm...

