Massively parallel computing holds the promise of extreme performance. The utility of these systems will depend heavily upon the availability of libraries until compilation and run--time system technology is developed to a level comparable to what today is common on most uniprocessor systems. Critical for performance is the ability to exploit locality of reference and effective management of the communication resources. We discuss some techniques for preserving locality of reference in distributed memory architectures. In particular, we discuss the benefits of multidimensional address spaces instead of the conventional linearized address spaces, partitioning of irregular grids, and placement of partitions among nodes. Some of these techniques are supported as language directives, others as run--time system functions, and others still are part of the Connection Machine Scientific Software Library, CMSSL. We briefly discuss some of the unique design issues in this library for distributed memory architectures, and some of the novel ideas with respect to managing data allocation, and automatic selection of algorithms with respect to performance. The CMSSL also includes a set of communication primitives we have found very useful on the Connection Machine systems in implementing scientific and engineering applications. We briefly review some of the techniques used to fully utilize the bandwidth of the binary cube network of the CM--2 and CM--200 Connection Machine systems. 1
|
963
|
Performance Fortran Forum. High Performance Fortran language specification version 1.0
– High
- 1993
|
|
617
|
A set of level 3 basic linear algebra subprograms
– Dongarra, Croz, et al.
- 1990
|
|
487
|
The cache performance and optimizations of blocked algorithms
– LAM, ROTHBERG, et al.
- 1991
|
|
414
|
Partitioning sparse matrices with eigenvectors of graphs
– Pothen, Simon, et al.
- 1990
|
|
398
|
A fast algorithm for particle simulations
– Greengard, Rokhlin
- 1987
|
|
395
|
Basic linear algebra subprograms for Fortran usage
– Lawson, Hanson, et al.
- 1979
|
|
366
|
An extended set of Fortran basic linear algebra subprograms: model implementation and test programs
– Dongarra, Croz, et al.
- 1988
|
|
252
|
Partitioning of unstructured problems for parallel processing
– Simon
|
|
202
|
How toEmulate Shared Memory
– Ranade
- 1991
|
|
199
|
On the multi-level splitting of finite element spaces
– Yserentant
- 1986
|
|
182
|
Algebraic connectivity of graphs
– FIEDLER
- 1973
|
|
182
|
Universal schemes for parallel communication
– Valiant, Brebner
- 1981
|
|
173
|
A scheme for fast parallel communication
– Valiant
- 1982
|
|
139
|
A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory
– Fiedler
- 1975
|
|
99
|
I/O complexity: The red-blue pebble game
– Hong, Kung
- 1981
|
|
88
|
A fast adaptive multipole algorithm for particle simulations
– Carrier, Greengard, et al.
- 1988
|
|
78
|
Communication efficient basic linear algebra computations on hypercube system
– Johnsson
- 1987
|
|
76
|
A VLSI Architecture for Concurrent Data Structures
– Dally
- 1986
|
|
66
|
An implementation of the fast multipole method without multipoles
– Anderson
- 1992
|
|
63
|
Multiprocessor FFTs
– Swarztrauber
- 1987
|
|
62
|
Impact of hierarchical memory systems on linear algebra algorithm design
– Gallivan, Jalby, et al.
- 1988
|
|
57
|
Computer architecture a quantative approachâ, third edition
– Hennessy, Patterson
|
|
51
|
The Fluent Abstract Machine
– Ranade, Bhatt, et al.
- 1988
|
|
49
|
The J-Machine: A fine-grain concurrent computer
– Dally, Chien, et al.
- 1989
|
|
47
|
On the problem of optimizing data transfers for complex memory systems
– Gallivan, Jalby, et al.
- 1988
|
|
44
|
Optimal Communication Algorithms for Hypercubes
– Bertsekas, Ozveren, et al.
- 1991
|
|
42
|
Combinatorial Algorithms
– Reingold, Nievergelt, et al.
- 1977
|
|
41
|
Fast Fourier Transforms - For Fun and Profit
– Gentleman, Sande
- 1966
|
|
38
|
Fortran at ten gigaflops: The Connection Machine convolution compiler
– BROMLEY, HELLER, et al.
- 1991
|
|
32
|
Eigenvectors of acyclic matrices
– Fiedler
- 1975
|
|
30
|
Johnsson and Ching-Tien Ho. Spanning graphs for optimum broadcasting and personalized communication in hypercubes
– Lennart
- 1989
|
|
30
|
A new method for solving triangular systems on distributedmemory message-passing multiprocessors
– Li, Coleman
- 1989
|
|
29
|
Embedding of Grids into Optimal Hypercubes
– Chan
- 1991
|
|
28
|
Intensive hypercube communication I: prearranged communication in link-bound machines
– Stout, Wagar
- 1987
|
|
27
|
Data Parallel Finite Element Techniques for Large-scale Computational Fluid Dynamics
– Johan
- 1992
|
|
24
|
Embedding meshes in Boolean cubes by graph decomposition
– Ho, Johnsson
- 1990
|
|
23
|
Decomposition into Cycles I: Hamilton Decompositions
– Alspach, Bermond, et al.
- 1990
|
|
23
|
Multiplication of matrices of arbitrary shape on a Data Parallel Computer
– Mathur, Johnsson
- 1994
|
|
21
|
Passing messages in link-bound hypercubes
– Stout, Wagar
- 1987
|
|
20
|
A parallel triangular solver for a distributed memory multiprocessor
– Li, Coleman
- 1988
|
|
20
|
Block cyclic dense linear algebra
– Lichtenstein, Johnsson
- 1993
|
|
19
|
B-valuation of graphs
– Havel, Mov'arek
|
|
19
|
Performance modeling of distributed memory architectures
– Johnsson
- 1991
|
|
18
|
Minimizing the communication time for matrix multiplication on multiprocessors
– Johnsson
- 1993
|
|
16
|
Johnsson and Ching-Tien Ho. Generalized shuffle permutations on Boolean cubes
– Lennart
- 1992
|
|
15
|
Communication efficient multi-processor FFT
– JOHNSSON, JACQUEMIN, et al.
- 1992
|
|
14
|
Computing fast Fourier transforms on Boolean cubes and related networks
– Johnsson, Ho, et al.
- 1987
|
|
13
|
All-to-all broadcast with applications on the Connection Machine
– Brunet, Johnsson
- 1992
|
|
13
|
Matrix multiplication on Boolean cubes using generic communication primitives
– Johnsson, Ho
- 1989
|
|
13
|
A New Era of Fast Dynamic RAMs
– Jones
- 1992
|