Download:
by Frederic T. Chong, Shamik D. Sharma Y, Eric A. Brewer Z, Joel Saltz X
In Rajiv K. Kalia and Priya Vashishta, editors, Toward Teraflop Computing and New Grand Challenge Applications
ftp://ftp.cs.umd.edu/pub/hpsl/papers/papers-pdf/irreg-dags.pdf
Add To MetaCart
Abstract:
We examine multiprocessor runtime support for ne-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of 34 for 128 processors, resulting in an absolute performance of over 33 million double-precision oating point operations per second. We achieve these speedups with non-matrix-speci c methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we nd that it is challenging to keep their cost low enough to make them worthwhile on small, ne-grained problems. Additionally, we nd that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly bene t from architectural support for low-overhead communication.
Citations
|
237
|
Users' guide for the Harwell-Boeing sparse matrix collection (release I).Technical Report TR/PA/92/86, Research and Technology Division, Boeing Computer Services
– Duff, Grimes, et al.
- 1992
|
|
154
|
The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor
– Agarwal, Chaiken, et al.
- 1991
|
|
115
|
RunTime Parallelization and Scheduling of Loops
– Saltz, Mirchandaney, et al.
- 1991
|
|
98
|
A Comparison of Clustering Heuristics for Scheduling DAGS on Multiprocessors
– Gemsoulis, Yang
- 1992
|
|
96
|
Partitioning and Scheduling Parallel Programs for Execution on Multiprocessors
– Sarkar
- 1989
|
|
70
|
How to get good performance from the cm-5 data network
– Brewer, Kuszmaul
- 1994
|
|
48
|
Implementing an irregular application on a distributed memory multiprocessor
– Chakrabarti, Yelick
- 1993
|
|
39
|
Optimal parallel solution of sparse triangular systems
– Alvarado, Schreiber
- 1990
|
|
32
|
Assessing the benefits of fine-grained parallelism in dataflow programs
– Arvind, Maa
- 1988
|
|
30
|
A parallel solution method for large sparse systems of equations
– Lucas, Blank, et al.
- 1987
|
|
26
|
Eicken et al., Active Messages: a Mechanism for Integrated Communication and Computation
– von
- 1992
|
|
26
|
Scheduling and Code Generation for Parallel Architectures
– Yang
- 1993
|
|
23
|
Performance of the iPSC/860 Node Architecture
– Moyer
- 1991
|
|
23
|
Experience with fine-grain synchronization in mimd machines for preconditioned conjugate gradient
– Yeung, Agarwal
- 1993
|
|
21
|
Leiserson et al. The Network Architecture of the Connection Machine CM-5
– Charles
- 1992
|
|
16
|
et al, “The message-driven processor: A multicomputer processing node with efficient mechanisms
– Dally
- 1992
|
|
12
|
Distributed solution of sparse linear systems
– Heath, Raghavan
- 1993
|
|
9
|
Aggregation methods for solving sparse triangular systems on multiprocessors
– Saltz
- 1990
|
|
7
|
and Padma Raghavan. Distributed solution of sparse linear systems
– Heath
- 1993
|
|
7
|
T3D system architecture overview
– CRAY
- 1993
|
|
6
|
Data Flow Computing and the Conjugate Gradient Method
– Rubin
- 1992
|
|
5
|
Overview of the START(*T) multithreaded computer
– Beckerle
- 1993
|
|
4
|
Strata: A high-performance communications library
– Brewer, Blumofe
- 1994
|
|
3
|
Experience with ne-grain synchronization in MIMD machines for preconditioned conjugate gradient
– Yeung, Agarwal
- 1993
|
|
2
|
A fast rerdering algorithm for parallel sparse triangular solution
– Pothen, Alvarado
- 1992
|
|
1
|
Assessing the bene ts of ne-grained parallelism in data ow programs
– Arvind, Maa
- 1988
|
|
1
|
Data ow computing and the conjugate gradient method
– Rubin
- 1992
|