Results 1  10
of
115
Data Transformations for Eliminating Conflict Misses
 In Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation
, 1998
"... Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compiletime datalayout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Intervariable padding adjusts variable base addresses ..."
Abstract

Cited by 134 (12 self)
 Add to MetaCart
(Show Context)
Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compiletime datalayout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Intervariable padding adjusts variable base addresses, while intravariable padding modifies array dimension sizes. Two levels of precision are evaluated. PadLite only uses array and column dimension sizes, relying on assumptions about common array reference patterns. Pad analyzes programs, detecting conflict misses by linearizing array references and calculating conflict distances between uniformlygenerated references. The Euclidean algorithm for computing the gcd of two numbers is used to predict conflicts between different array columns for linear algebra codes. Experiments on a range of programs indicate PadLite can eliminate conflicts for benchmarks, but Pad is more effective over a range of cache and problem sizes. Padding reduces c...
Precise Miss Analysis for Program Transformations with Caches of Arbitrary Associativity
 In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1998
"... Analyzing and optimizing program memory performance is a pressing problem in highperformance computer architectures. Currently, software solutions addressing the processormemory performance gap include compiler or programmerapplied optimizations like data structure padding, matrix blocking, and ot ..."
Abstract

Cited by 88 (1 self)
 Add to MetaCart
(Show Context)
Analyzing and optimizing program memory performance is a pressing problem in highperformance computer architectures. Currently, software solutions addressing the processormemory performance gap include compiler or programmerapplied optimizations like data structure padding, matrix blocking, and other program transformations. Compiler optimization can be effective, but the lack of precise analysis and optimization frameworks makes it impossible to confidently make optimal, rather than heuristicbased, program transformations. Imprecision is most problematic in situations where hardtopredict cache conflicts foil heuristic approaches. Furthermore, the lack of a general framework for compiler memory performance analysis makes it impossible to understand the combined effects of several program transformations. The Cache Miss Equation (CME) framework discussed in this paper addresses these issues. We express memory reference and cache conflict behavior in terms of sets of equations. The ...
Nonlinear Array Layouts for Hierarchical Memory Systems
, 1999
"... Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, nonprogrammable array attribute. ..."
Abstract

Cited by 78 (5 self)
 Add to MetaCart
(Show Context)
Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, nonprogrammable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by reordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (25% of total running time) and high performance benefits (reducing execution time by factors of 1.12.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursionbased control structures may be needed to fully exploit their potential.
Tiling Optimizations for 3D Scientific Computations
, 2000
"... Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cann ..."
Abstract

Cited by 68 (4 self)
 Add to MetaCart
(Show Context)
Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cannot fit in cachefor larger problem sizes. Tiling is a program transformation compilers can apply to capture this reuse, but successful application of tiling requires selection of nonconflicting tiles and/or padding array dimensions to eliminate conflicts. We present new algorithms and cost models for selecting tiling shapes and array pads. We explain why tiling is rarely needed for 2D PDE solvers, but can be helpful for 3D stencil codes. Experimental results show tiling 3D codes can reduce miss rates and achieve performance improvements of 17121% for key scientific kernels, including a 27% average improvement for the key computational loop nest in the SPEC/NAS benchmark MGRID.
Synthesizing transformations for locality enhancement of imperfectlynested loop nests
 In Proceedings of the 2000 ACM International Conference on Supercomputing
, 2000
"... We present an approach for synthesizing transformations to enhance locality in imperfectlynested loops. The key idea is to embed the iteration space of every statement in a loop nest into a special iteration space called the product space. The product space can be viewed as a perfectlynested loop ..."
Abstract

Cited by 63 (3 self)
 Add to MetaCart
We present an approach for synthesizing transformations to enhance locality in imperfectlynested loops. The key idea is to embed the iteration space of every statement in a loop nest into a special iteration space called the product space. The product space can be viewed as a perfectlynested loop nest, so embedding generalizes techniques like code sinking and loop fusion that are used in ad hoc ways in current compilers to produce perfectlynested loops from imperfectlynested ones. In contrast to these ad hoc techniques however, our embeddings are chosen carefully to enhance locality. The product space is then transformed further to enhance locality, after which fully permutable loops are tiled, and code is generated. We evaluate the effectiveness of this approach for dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the SPEC benchmarks. 1. BACKGROUND AND PREVIOUSWORK Sophisticated algorithms based on polyhedral algebra have been developed for determining good sequences of linear loop transformations (permutation, skewing, reversal and scaling) for enhancing locality in perfectlynested loops 1. Highlights of this technology are the following. The iterations of the loop nest are modeled as points in an integer lattice, and linear loop transformations are modeled as nonsingular matrices mapping one lattice to another. A sequence of loop transformations is modeled by the product of matrices representing the individual transformations; since the set of nonsingular matrices is closed under matrix product, this means that a sequence of linear loop transformations can be represented by a nonsingular matrix. The problem of finding an optimal sequence of linear loop transformations is thus reduced to the problem of finding an integer matrix that satisfies some desired property, permitting the full machinery of matrix methods and lattice theory to ¢ This work was supported by NSF grants CCR9720211, EIA9726388, ACI9870687,EIA9972853. £ A perfectlynested loop is a set of loops in which all assignment statements are contained in the innermost loop.
A Comparison of Compiler Tiling Algorithms
, 1999
"... Linear algebra codes contain data locality which can be exploited by tiling multiple loop nests. Several approaches to tiling have been suggested for avoiding conflict misses in low associativity caches. We propose a new technique based on intravariable padding and compare its performance with exis ..."
Abstract

Cited by 56 (8 self)
 Add to MetaCart
Linear algebra codes contain data locality which can be exploited by tiling multiple loop nests. Several approaches to tiling have been suggested for avoiding conflict misses in low associativity caches. We propose a new technique based on intravariable padding and compare its performance with existing techniques. Results show padding improves performance of matrix multiply by over 100 % in some cases over a range of matrix sizes. Comparing the efficacy of different tiling algorithms, we discover rectangular tiles are slightly more efficient than square tiles. Overall, tiling improves performance from 0250%. Copying tiles at run time proves to be quite effective.
Static Timing Analysis of Embedded Software
 IN PROC. DESIGN AUTOMATION CONF
, 1997
"... This paper examines the problem of statically analyzing the performance of embedded software. This problem is motivated by the increasing growth of embedded systems and a lack of appropriate analysis tools. We study different performance metrics that need to be considered in this context and examin ..."
Abstract

Cited by 48 (0 self)
 Add to MetaCart
This paper examines the problem of statically analyzing the performance of embedded software. This problem is motivated by the increasing growth of embedded systems and a lack of appropriate analysis tools. We study different performance metrics that need to be considered in this context and examine a range of techniques that have been proposed for analysis. Very broadly these can be classified into path analysis and system utilization analysis techniques. It is observed that these are interdependent, and thus need to be considered together in any analysis framework.
Tuning Strassen's Matrix Multiplication for Memory Efficiency
 IN PROCEEDINGS OF SC98 (CDROM
, 1998
"... Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of thi ..."
Abstract

Cited by 43 (5 self)
 Add to MetaCart
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memoryfriendly. First, the algorithm internally uses a nonstandard array layout known as Morton order that is based on a quadtree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms th...
Analytical modeling of setassociative cache behavior
 IEEE Transactions on Computers
, 1999
"... ..."
(Show Context)
Modulo Scheduling for a FullyDistributed Clustered . . .
, 2000
"... Clustering is an approach that many microprocessors are adopting in recent times in order to mitigate the increasing penalties of wire delays. In this work we propose a novel clustered VLIW architecture which has all its resources partitioned among clusters, including the cache memory. A modulo sche ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
Clustering is an approach that many microprocessors are adopting in recent times in order to mitigate the increasing penalties of wire delays. In this work we propose a novel clustered VLIW architecture which has all its resources partitioned among clusters, including the cache memory. A modulo scheduling scheme for this architecture is also proposed. This algorithm takes into account both register and memory intercluster communications so that the final schedule results in a cluster assignment that favors cluster locality in cache references and register accesses. It has been evaluated for both 2 and 4cluster configurations and for differing number and latencies of intercluster buses. The proposed algorithm produces schedules with very low communication requirements and outperforms previous clusteroriented schedulers. 1. Introduction Technology projections point to wire delays as being one of the main hurdles for improving instruction throughput of future microprocessors [23]. ...