Results 1 - 10
of
231
Improving Data Locality with Loop Transformations
- ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1996
"... ..."
(Show Context)
Cache Miss Equations: A Compiler Framework for Analyzing and Tuning Memory Behavior
- ACM Transactions on Programming Languages and Systems
, 1999
"... This article describes methods for generating and solving Cache Miss Equations (CMEs) that give a detailed representation of cache behavior, including conflict misses, in loop-oriented scientific code. Implemented within the SUIF compiler framework, our approach extends traditional compiler reuse an ..."
Abstract
-
Cited by 168 (1 self)
- Add to MetaCart
(Show Context)
This article describes methods for generating and solving Cache Miss Equations (CMEs) that give a detailed representation of cache behavior, including conflict misses, in loop-oriented scientific code. Implemented within the SUIF compiler framework, our approach extends traditional compiler reuse analysis to generate linear Diophantine equations that summarize each loop's memory behavior. While solving these equations is in general di#- cult, we show that is also unnecessary, as mathematical techniques for manipulating Diophantine equations allow us to relatively easily compute and/or reduce the number of possible solutions, where each solution corresponds to a potential cache miss. The mathematical precision of CMEs allows us to find true optimal solutions for transformations such as blocking or padding. The generality of CMEs also allows us to reason about interactions between transformations applied in concert. The article also gives examples of their use to determine array padding and o#set amounts that minimize cache misses, and to determine optimal blocking factors for tiled code. Overall, these equations represent an analysis framework that o#ers the generality and precision needed for detailed compiler optimizations
Data-centric Multi-level Blocking
, 1997
"... We present a simple and novel framework for generating blocked codes for high-performance machines with a memory hierarchy. Unlike traditional compiler techniques like tiling, which are based on reasoning about the control flow of programs, our techniques are based on reasoning directly about the fl ..."
Abstract
-
Cited by 155 (10 self)
- Add to MetaCart
(Show Context)
We present a simple and novel framework for generating blocked codes for high-performance machines with a memory hierarchy. Unlike traditional compiler techniques like tiling, which are based on reasoning about the control flow of programs, our techniques are based on reasoning directly about the flow of data through the memory hierarchy. Our data-centric transformations permit a more direct solution to the problem of enhancing data locality than current control-centric techniques do, and generalize easily to multiple levels of memory hierarchy. We buttress these claims with performance numbers for standard benchmarks from the problem domain of dense numerical linear algebra. The simplicity and intuitive appeal of our approach should make it attractive to compiler writers as well as to library writers. 1 Introduction Data reuse is imperative for good performance on modern high-performance computers because the memory architecture of these machines is a hierarchy in which the cost of ...
Cache Miss Equations: An Analytical Representation of Cache Misses
- In Proceedings of the 1997 ACM International Conference on Supercomputing
, 1997
"... With the widening performance gap between processors and main memory, efficient memory accessing behavior is necessary for good program performance. Both hand-tuning and compiler optimization techniques are often used to transform codes to improve memory performance. Effective transformations requir ..."
Abstract
-
Cited by 115 (4 self)
- Add to MetaCart
With the widening performance gap between processors and main memory, efficient memory accessing behavior is necessary for good program performance. Both hand-tuning and compiler optimization techniques are often used to transform codes to improve memory performance. Effective transformations require detailed knowledge about the frequency and causes of cache misses in the code.
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation
, 2000
"... Loop tiling and unrolling are two important program transformations to exploit locality and expose instruction level parallelism, respectively. However, these transformations are not independent and each can adversely affect the goal of the other. Furthermore, the best combination will vary drama ..."
Abstract
-
Cited by 108 (9 self)
- Add to MetaCart
Loop tiling and unrolling are two important program transformations to exploit locality and expose instruction level parallelism, respectively. However, these transformations are not independent and each can adversely affect the goal of the other. Furthermore, the best combination will vary dramatically from one processor to the next. In this paper, we therefore address the problem of how to select tile sizes and unroll factors simultaneously. We approach this problem in an architecturally adaptive manner by means of iterative compilation, where we generate many versions of a program and decide upon the best by actually executing them and measuring their execution time. We evaluate several iterative strategies based on genetic algorithms, random sampling and simulated annealing. We compare the levels of optimization obtained by iterative compilation to several well-known static techniques and show that we outperform each of them on a range of benchmarks across a variety of ar...
A comparison of empirical and model-driven optimization
- In ACM Symp. on Programming Language Design and Implementation (PLDI’03
, 2003
"... Empirical program optimizers estimate the values of key optimization parameters by generating different program versions and running them on the actual hardware to determine which values give the best performance. In contrast, conventional compilers use models of programs and machines to choose thes ..."
Abstract
-
Cited by 99 (12 self)
- Add to MetaCart
(Show Context)
Empirical program optimizers estimate the values of key optimization parameters by generating different program versions and running them on the actual hardware to determine which values give the best performance. In contrast, conventional compilers use models of programs and machines to choose these parameters. It is widely believed that empirical optimization is more effective than model-driven optimization, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the empirical optimization engine in ATLAS (a system for generating dense numerical linear algebra libraries) with a model-based optimization engine that used detailed models to estimate values for optimization parameters, and then measured the relative performance of the two systems on three different hardware platforms. Our experiments show that although model-based optimization can be surprisingly effective, useful models may have to consider not only hardware parameters but also the ability of back-end compilers to exploit hardware resources. 1.
Precise Miss Analysis for Program Transformations with Caches of Arbitrary Associativity
- In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1998
"... Analyzing and optimizing program memory performance is a pressing problem in high-performance computer architectures. Currently, software solutions addressing the processormemory performance gap include compiler- or programmerapplied optimizations like data structure padding, matrix blocking, and ot ..."
Abstract
-
Cited by 87 (1 self)
- Add to MetaCart
Analyzing and optimizing program memory performance is a pressing problem in high-performance computer architectures. Currently, software solutions addressing the processormemory performance gap include compiler- or programmerapplied optimizations like data structure padding, matrix blocking, and other program transformations. Compiler optimization can be effective, but the lack of precise analysis and optimization frameworks makes it impossible to confidently make optimal, rather than heuristic-based, program transformations. Imprecision is most problematic in situations where hard-to-predict cache conflicts foil heuristic approaches. Furthermore, the lack of a general framework for compiler memory performance analysis makes it impossible to understand the combined effects of several program transformations. The Cache Miss Equation (CME) framework discussed in this paper addresses these issues. We express memory reference and cache conflict behavior in terms of sets of equations. The ...
Reuse distance as a metric for cache behavior
- IN PROCEEDINGS OF THE IASTED CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING AND SYSTEMS
, 2001
"... The widening gap between memory and processor speed causes more and more programs to shift from CPUbounded to memory speed-bounded, even in the presence of multi-level caches. Powerful cache optimizations are needed to improve the cache behavior and increase the execution speed of these programs. Ma ..."
Abstract
-
Cited by 80 (11 self)
- Add to MetaCart
The widening gap between memory and processor speed causes more and more programs to shift from CPUbounded to memory speed-bounded, even in the presence of multi-level caches. Powerful cache optimizations are needed to improve the cache behavior and increase the execution speed of these programs. Many optimizations have been proposed, and one can wonder what new optimizations should focus on. To answer this question, the distribution of the conflict and capacity misses was measured in the execution of code generated by a state-of-the-art EPIC compiler. The results show that cache conflict misses are reduced, but only a small fraction of the large number of capacity misses are eliminated. Furthermore, it is observed that some program transformations to enhance the parallelism may counter the optimizations to reduce the capacity misses. In order to minimize the capacity misses, the effect of program transformations and hardware solutions are explored and examples show that a directed approach can be very effective.
Nonlinear Array Layouts for Hierarchical Memory Systems
, 1999
"... Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. ..."
Abstract
-
Cited by 76 (5 self)
- Add to MetaCart
(Show Context)
Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by re-ordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (2--5% of total running time) and high performance benefits (reducing execution time by factors of 1.1-2.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursion-based control structures may be needed to fully exploit their potential.
A memory model for scientific algorithms on graphics processors
- in Proc. of the ACM/IEEE Conference on Supercomputing (SC’06
, 2006
"... We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics ..."
Abstract
-
Cited by 74 (4 self)
- Add to MetaCart
We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and use the 3C’s model to analyze the cache misses. Moreover, we present techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of our model, we highlight its performance on three memory-intensive scientific applications – sorting, fast Fourier transform and dense matrix-multiplication. In practice, our cache-efficient algorithms for these applications are able to achieve memory throughput of 30–50 GB/s on a NVIDIA 7900 GTX GPU. We also compare our results with prior GPU-based and CPU-based implementations on highend processors. In practice, we are able to achieve 2–5× performance improvement.