Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, traditional column-major or row-major array layouts incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts for improving the performance of parallel recursive matrix multiplication algorithms. We extend previous work by Frens and Wise on recursive matrix multiplication to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. We show that while recursive array layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2--2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms; we provide an algorithmic explanation of this phenomenon. We demonstrate that carrying the recursive layout down to the level of individual matrix elements is counterproductive, and that a combination of recursive layouts down to canonically ordered matrix tiles instead yields higher performance. We evaluate five recursive layouts with successivelyincreasing complexity of address computation, and show that addressing overheads can be kept in control even for the most computationally demanding of these layouts. Finally, we provide a critique of the Cilk system that we used to parallelize our code. 1
|
676
|
A data locality optimizing algorithm
– Wolf, Lam
- 1991
|
|
617
|
A set of level 3 basic linear algebra subprograms
– Dongarra, Croz, et al.
- 1990
|
|
487
|
The cache performance and optimizations of blocked algorithms
– LAM, ROTHBERG, et al.
- 1991
|
|
479
|
Accuracy and stability of numerical algorithms
– HIGHAM
- 2002
|
|
376
|
Parallel Computer Architecture: A Hardware/Software Approach
– Culler, Singh
- 1998
|
|
299
|
Cilk: An efficient multithreaded runtime system
– Blumofe, Joerg, et al.
- 1995
|
|
267
|
FFTW: An adaptive software architecture for the FFT
– Frigo, Johnson
- 1998
|
|
233
|
Automatically Tuned Linear Algebra Software
– Whaley, Dongarra
- 1998
|
|
230
|
Evaluating associativity in CPU caches
– Hill, Smith
- 1989
|
|
230
|
Gaussian elimination is not optimal
– STRASSEN
- 1969
|
|
188
|
Compiler optimizations for improving data locality
– Carr, McKinley, et al.
- 1994
|
|
173
|
Linear Clustering of Objects with Multiple Attributes
– Jagadish
|
|
173
|
More iteration space tiling
– Wolfe
- 1989
|
|
167
|
Space-filling Curves
– Sagan
- 1994
|
|
154
|
Optimizing matrix multiply using PHiPAC: A portable, highperformance, ANSI C coding methodology
– Bilmes, Asanovic, et al.
- 1997
|
|
152
|
Unifying data and control transformations for distributed shared memory machines
– Cierniak, Li
- 1995
|
|
124
|
A parallel hashed oct-tree n-body algorithm
– Warren, Salmon
- 1993
|
|
116
|
Automatic Data Partitioning on Distributed Memory Multicomputers
– Gupta
- 1992
|
|
68
|
Auto-blocking matrix multiplication or tracking blas3 performance from source code
– Frens, Wise
- 1997
|
|
61
|
Sur une courbe qui remplit toute une aire plaine
– Peano
- 1890
|
|
58
|
Optimal Evaluation of Array Expressions on Massively Parallel Machines
– Chatterjee, Gilbert, et al.
- 1995
|
|
56
|
Space-filling curves: their generation and their application to band reduction
– Bially
- 1969
|
|
52
|
Dynamic Partitioning of NonUniform Structured Workloads with Spacefilling Curves
– Pilkington, Baden
- 1996
|
|
45
|
Memory storage patterns in parallel processing
– Mace
- 1987
|
|
42
|
Hierarchical Tiling for Improved Superscalar Performance
– Carter, Ferrante, et al.
- 1995
|
|
39
|
Tuning strassen’s matrix multiplication for memory efficiency
– Thottethodi, Chatterjee, et al.
- 1998
|
|
34
|
High Performance Fortran for highly irregular problems
– Hu, Johnsson, et al.
- 1997
|
|
32
|
An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors
– Singh, Joe, et al.
- 1993
|
|
30
|
Balancing processor loads and exploiting data locality in N-body simulations
– Banicescu, Hummel
- 1995
|
|
30
|
Uber stetige abbildung einer linie auf ein fl"achenst"uk. Mathematische Annalen,38:459-460, 1891. [99] K. Hinsen. High level scientific programming with Python
– Hilbert
- 2002
|
|
26
|
The High Performance Fortran Handbook. Scientific and Engineering Computation
– Koelbel, Loveman, et al.
- 1994
|
|
25
|
Load Balancing and Data Locality via Fractiling: An Experimental Study
– Hummel, Banicescu, et al.
- 1995
|
|
24
|
Automatic data layout for distributed memory machines
– Kennedy, Kremer
- 1998
|
|
24
|
Steele Jr. Data optimization: Allocation of arrays to reduce communication on SIMD machines
– Knobe, Lukas, et al.
- 1990
|
|
20
|
Optimizing raster storage: An examination of four alternatives
– Goodchild, Grandfield
- 1983
|
|
15
|
Digital Design
– Mano
- 1984
|
|
14
|
Efficient procedures for using matrix algorithms
– Fischer, Probert
- 1974
|
|
11
|
Graphical data bases built on Peano space-filling curves
– Laurini
- 1985
|
|
9
|
Analysis of the clustering properting of Hilbert space-filling curve
– Moon, Jagadish, et al.
- 1996
|
|
3
|
Recursion leads to automatic variable blockingfor dense linearalgebra algorithms
– Gustavson
- 1997
|