MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Recursive array layouts and fast parallel matrix multiplication (1999) [42 citations — 6 self]

Download:
Download as a PDF | Download as a PS
by Siddhartha Chatterjee, Alvin R. Lebeck, Praveen K. Patnala, Mithuna Thottethodi
In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures
http://www.cs.duke.edu/~alvy/papers/spaa99.ps
Add To MetaCart

Abstract:

Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, traditional column-major or row-major array layouts incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts for improving the performance of parallel recursive matrix multiplication algorithms. We extend previous work by Frens and Wise on recursive matrix multiplication to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. We show that while recursive array layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2--2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms; we provide an algorithmic explanation of this phenomenon. We demonstrate that carrying the recursive layout down to the level of individual matrix elements is counterproductive, and that a combination of recursive layouts down to canonically ordered matrix tiles instead yields higher performance. We evaluate five recursive layouts with successivelyincreasing complexity of address computation, and show that addressing overheads can be kept in control even for the most computationally demanding of these layouts. Finally, we provide a critique of the Cilk system that we used to parallelize our code. 1

Citations

676 A data locality optimizing algorithm – Wolf, Lam - 1991
617 A set of level 3 basic linear algebra subprograms – Dongarra, Croz, et al. - 1990
487 The cache performance and optimizations of blocked algorithms – LAM, ROTHBERG, et al. - 1991
479 Accuracy and stability of numerical algorithms – HIGHAM - 2002
376 Parallel Computer Architecture: A Hardware/Software Approach – Culler, Singh - 1998
299 Cilk: An efficient multithreaded runtime system – Blumofe, Joerg, et al. - 1995
267 FFTW: An adaptive software architecture for the FFT – Frigo, Johnson - 1998
233 Automatically Tuned Linear Algebra Software – Whaley, Dongarra - 1998
230 Evaluating associativity in CPU caches – Hill, Smith - 1989
230 Gaussian elimination is not optimal – STRASSEN - 1969
188 Compiler optimizations for improving data locality – Carr, McKinley, et al. - 1994
173 Linear Clustering of Objects with Multiple Attributes – Jagadish
173 More iteration space tiling – Wolfe - 1989
167 Space-filling Curves – Sagan - 1994
154 Optimizing matrix multiply using PHiPAC: A portable, highperformance, ANSI C coding methodology – Bilmes, Asanovic, et al. - 1997
152 Unifying data and control transformations for distributed shared memory machines – Cierniak, Li - 1995
124 A parallel hashed oct-tree n-body algorithm – Warren, Salmon - 1993
116 Automatic Data Partitioning on Distributed Memory Multicomputers – Gupta - 1992
68 Auto-blocking matrix multiplication or tracking blas3 performance from source code – Frens, Wise - 1997
61 Sur une courbe qui remplit toute une aire plaine – Peano - 1890
58 Optimal Evaluation of Array Expressions on Massively Parallel Machines – Chatterjee, Gilbert, et al. - 1995
56 Space-filling curves: their generation and their application to band reduction – Bially - 1969
52 Dynamic Partitioning of NonUniform Structured Workloads with Spacefilling Curves – Pilkington, Baden - 1996
45 Memory storage patterns in parallel processing – Mace - 1987
42 Hierarchical Tiling for Improved Superscalar Performance – Carter, Ferrante, et al. - 1995
39 Tuning strassen’s matrix multiplication for memory efficiency – Thottethodi, Chatterjee, et al. - 1998
34 High Performance Fortran for highly irregular problems – Hu, Johnsson, et al. - 1997
32 An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors – Singh, Joe, et al. - 1993
30 Balancing processor loads and exploiting data locality in N-body simulations – Banicescu, Hummel - 1995
30 Uber stetige abbildung einer linie auf ein fl"achenst"uk. Mathematische Annalen,38:459-460, 1891. [99] K. Hinsen. High level scientific programming with Python – Hilbert - 2002
26 The High Performance Fortran Handbook. Scientific and Engineering Computation – Koelbel, Loveman, et al. - 1994
25 Load Balancing and Data Locality via Fractiling: An Experimental Study – Hummel, Banicescu, et al. - 1995
24 Automatic data layout for distributed memory machines – Kennedy, Kremer - 1998
24 Steele Jr. Data optimization: Allocation of arrays to reduce communication on SIMD machines – Knobe, Lukas, et al. - 1990
20 Optimizing raster storage: An examination of four alternatives – Goodchild, Grandfield - 1983
15 Digital Design – Mano - 1984
14 Efficient procedures for using matrix algorithms – Fischer, Probert - 1974
11 Graphical data bases built on Peano space-filling curves – Laurini - 1985
9 Analysis of the clustering properting of Hilbert space-filling curve – Moon, Jagadish, et al. - 1996
3 Recursion leads to automatic variable blockingfor dense linearalgebra algorithms – Gustavson - 1997