Abstract. In a serial computational environment, transportable efficiency is the essential motivation for developing blocking strategies and block-partitioned algorithms. An algorithmic blocking factor adjusts the granularity of the subtasks to maximize the efficiency of the hardware resources. In a distributed-memory environment, load balance is the essential motivation for distributing array entries over a collection of processes according to the block cyclic decomposition scheme. A distribution blocking factor is used to partition an array into blocks that are then mapped onto the processes. Optimal values of the algorithmic and distribution blocking factors often differ for a given algorithm and target architecture. Despite this fact, most of the parallel algorithms proposed in the literature assume the values of these blocking factors to be identical. This feature limits the flexibility and ease of use of such algorithms. When these blocking factors differ, methods are necessary to redistribute some data into the appropriate algorithmic form. This paper presents and discusses such algorithmic redistribution methods for the block cyclic decomposition scheme. Algorithmic redistribution methods attempt to reorganize logically the computations and communications within an algorithmic context. In order to derive such methods, some properties of the block cyclic data distribution are first exhibited. Various algorithmic redistribution methods are
|
231
|
ScaLAPACK Users’ Guide
– Blackford, Choi, et al.
- 1997
|
|
154
|
Optimizing matrix multiply using PHiPAC: A portable, highperformance, ANSI C coding methodology
– Bilmes, Asanovic, et al.
- 1997
|
|
150
|
ScaLAPACK: a scalable linear algebra library for distributed memoryconcurrent computers
– Choi, Dongarra, et al.
- 1992
|
|
141
|
ScaLAPACK: a Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance," presented at Supercomputing '96
– Blackford, al
- 1996
|
|
85
|
A linear algebra framework for static hpf code distribution
– Ancourt, Coelho, et al.
- 1993
|
|
72
|
Software libraries for linear algebra computations on high performance computers
– Dongarra, Walker
- 1994
|
|
53
|
The torus-wrap mapping for dense matrix calculations on massively parallel computers
– Hendrickson, Womble
- 1994
|
|
47
|
Matrix algorithms on a hypercube I: Matrix multiplication
– Fox, Otto, et al.
- 1987
|
|
43
|
PUMMA: Parallel Universal Matrix Multiplication Algorithms,” Concurrency
– Choi, Dongarra, et al.
- 1994
|
|
37
|
Parallel Solution of Triangular Systems on Distributed-Memory Multiprocessors
– Heath, Romine
- 1988
|
|
32
|
A high performance matrix multiplication algorithm on a distributed memory parallel computer using overlapped communication
– Agarwal, Gustavson, et al.
- 1994
|
|
31
|
Parallel matrix transpose algorithms on distributed memory concurrent computers. Technical Report, TM-12309. Oak Ridge Bational Laboratory, Mat heinatical Sciences Section
– Choi, Dongarra, et al.
- 1993
|
|
29
|
de Geijn. Parallelizing the QR Algorithm for the Unsymmetric Algebraic Eigenvalue Problem: Myths and Reality
– Henry, Van
- 1997
|
|
28
|
Scheduling block-cyclic array redistribution
– Desprez, Dongarra, et al.
- 1998
|
|
27
|
PB-BLAS: A set of parallel block basic linear algebra subprograms
– Choi, Dongarra, et al.
- 1994
|
|
27
|
de Geijn. Two dimensional basic linear algebra communication subprograms. LAPACK Working Note 37
– Dongarra, van
- 1992
|
|
25
|
A User’s Guide to the BLACS v1.0
– Dongarra, Whaley
- 1995
|
|
20
|
Improving performance of linear algebra algorithms for dense matrices, using algorithmic prefetch. IBM
– Agarwal, Gustavson, et al.
- 1994
|
|
19
|
de Geijn. Parallel implementation of BLAS: General techniques for level 3 BLAS
– Chtchelkanova, Gunnels, et al.
- 1995
|
|
18
|
de Vorst. Parallel LU decomposition on a transputer network
– Bisseling, van
|
|
16
|
The distributed solution of linear systems using the torus wrap data mapping. Engineering Computing and Analysis
– Ashcraft
- 1990
|
|
14
|
QR factorization of a dense matrix on a hypercube multiprocessor
– Chu, George
- 1990
|
|
11
|
A Parallel Block Implementation of Level 3 BLAS for MIMD Vector Processors
– Dayde, Duff, et al.
- 1994
|
|
11
|
A Parallel Eigensolver for Dense Symmetric Matrices. submitted to
– Hendrickson, Jessup, et al.
- 1996
|
|
9
|
A new parallel matrix multiplication algorithm on distributed memory concurrent computers. Concurrency: Practice and Experience
– Choi
- 1998
|
|
9
|
Scalability issues in the design of a library for dense linear algebra
– Dongarra, Geijn, et al.
- 1994
|
|
8
|
Generating Local Adresses and Communication Sets for Data-Parallel Program
– Chatterjee, Gilbert, et al.
- 1993
|
|
7
|
LU Factorization Algorithms on Distributed Memory Multiprocessor Architectures
– Geist, Romine
- 1988
|
|
6
|
The Parallelization of Level 2
– Aboelaze, Chrisochoides, et al.
- 1991
|
|
6
|
The Data-Distribution-Independent Approach to Scalable Parallel Libraries
– Bangalore
- 1995
|
|
5
|
der Vorst. Parallel Triangular System Solving on a mesh network of Transputers
– BISSELING, VAN
- 1991
|
|
3
|
The LINPACK Benchmark on the AP 1000
– Brent
- 1992
|