Results 1 - 10
of
31
Nonlinear Array Layouts for Hierarchical Memory Systems
, 1999
"... Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. ..."
Abstract
-
Cited by 67 (4 self)
- Add to MetaCart
Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by re-ordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (2--5% of total running time) and high performance benefits (reducing execution time by factors of 1.1-2.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursion-based control structures may be needed to fully exploit their potential.
Graph Partitioning for High Performance Scientific Simulations
, 2000
"... Contents 0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 0.2 Modeling Mesh-based Computations as Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 3 0.3 Static Graph Partitioning Techniques . . . . . . . . . . . . . . . . . . . ..."
Abstract
-
Cited by 48 (5 self)
- Add to MetaCart
Contents 0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 0.2 Modeling Mesh-based Computations as Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 3 0.3 Static Graph Partitioning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 0.3.1 Geometric Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 0.3.2 Combinatorial Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 0.3.3 Spectral Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 0.3.4 Multilevel Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 0.3.5 Combined Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 0.3.6 Qualitative Comparison of Graph Partitioning Schemes . . . . . . . . . . . . . . . . . 16 0.4 Load Balancing of Adaptive Computations . . . . . .
Recursive Array Layouts and Fast Parallel Matrix Multiplication
- In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures
, 1999
"... Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, traditional column-major or row-major array layouts i ..."
Abstract
-
Cited by 44 (3 self)
- Add to MetaCart
Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, traditional column-major or row-major array layouts incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts for improving the performance of parallel recursive matrix multiplication algorithms. We extend previous work by Frens and Wise on recursive matrix multiplication to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. We show that while recursive array layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2--2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms;...
Tuning Strassen's Matrix Multiplication for Memory Efficiency
- IN PROCEEDINGS OF SC98 (CD-ROM
, 1998
"... Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this alg ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memory-friendly. First, the algorithm internally uses a non-standard array layout known as Morton order that is based on a quad-tree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms th...
Recursive Array Layouts and Fast Matrix Multiplication
, 1999
"... The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in memory system performance as matrix size var ..."
Abstract
-
Cited by 31 (0 self)
- Add to MetaCart
The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2--2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between ...
A unified algorithm for load-balancing adaptive scientific simulations
- In Proceedings of the ACM/IEEE Symposium on Supercomputing (SC’00). IEEE Computer
, 2000
"... Adaptive scientific simulations require that periodic repartitioning occur dynamically throughout the course of the computation. The repartitionings should be computed so as to minimize both the inter-processor communications incurred during the iterative mesh-based computation and the data redistri ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
Adaptive scientific simulations require that periodic repartitioning occur dynamically throughout the course of the computation. The repartitionings should be computed so as to minimize both the inter-processor communications incurred during the iterative mesh-based computation and the data redistribution costs required to balance the load. Recently developed schemes for computing repartitionings provide the user with only a limited control of the tradeoffs among these objectives. This paper describes a new Unified Repartitioning Algorithm that can tradeoff one objective for the other dependent upon a user-defined parameter describing the relative costs of these objectives. We show that the Unified Repartitioning Algorithm is able to reduce the precise overheads associated with repartitioning as well as or better than other repartitioning schemes for a variety of problems, regardless of the relative costs of performing inter-processor communication and data redistribution. Our experimental results show that this scheme is extremely fast and scalable to large problems.
Cache-Efficient Matrix Transposition
"... We investigate the memory system performance of several algorithms for transposing an N N matrix in-place, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall runn ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
We investigate the memory system performance of several algorithms for transposing an N N matrix in-place, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time of the algorithms. We use various memory models to capture and analyze the effect of various facets of cache memory architecture that guide the choice of a particular algorithm, and attempt to experimentally validate the predictions of the model. Our major conclusions are as follows: limited associativity in the mapping from main memory addresses to cache sets can significantly degrade running time; the limited number of TLB entries can easily lead to thrashing; the fanciest optimal algorithms are not competitive on real machines even at fairly large problem sizes unless cache miss penalties are quite high; low-level performance tuning “hacks”, such as register tiling and array alignment, can significantly distort the effects of improved algorithms; and hierarchical nonlinear layouts are inherently superior to the standard canonical layouts (such as row- or column-major) for
this problem.
A Parallel Software Infrastructure for Dynamic Block-Irregular Scientific Calculations
, 1995
"... ..."
Parallel Domain Decomposition and Load Balancing Using Space-Filling Curves
- in Proceedings of the 4th IEEE Conference on High Performance Computing
, 1997
"... Partitioning techniques based on space-filling curves have received much recent attention due to their low running time and good load balance characteristics. The basic idea underlying these methods is to order the multidimensional data according to a space-filling curve and partition the resulting ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Partitioning techniques based on space-filling curves have received much recent attention due to their low running time and good load balance characteristics. The basic idea underlying these methods is to order the multidimensional data according to a space-filling curve and partition the resulting onedimensional order. However, space-filling curves are defined for points that lie on a uniform grid of a particular resolution. It is typically assumed that the coordinates of the points are representable using a fixed number of bits, and the run-times of the algorithms depend upon the number of bits used. In this paper, we present a simple and efficient technique for ordering arbitrary and dynamic multidimensional data using space-filling curves and its application to parallel domain decomposition and load balancing. Our technique is based on a comparison routine that determines the relative position of two points in the order induced by a space-filling curve. The comparison routine could then be used...
Dynamic octree load balancing using space-filling curves
, 2003
"... The Zoltan dynamic load balancing library provides applications with a reusable object oriented interface to several load balancing techniques, including coordinate bisection, octree/space filling curve methods, and multilevel graph partitioners. We describe enhancements to Zoltan’s octree load bala ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
The Zoltan dynamic load balancing library provides applications with a reusable object oriented interface to several load balancing techniques, including coordinate bisection, octree/space filling curve methods, and multilevel graph partitioners. We describe enhancements to Zoltan’s octree load balancing procedure and its distributed structures that improve performance of the space filling curve (SFC) traversals by

