Results 1 - 10
of
176
The Design and Use of SimplePower: A Cycle-Accurate Energy Estimation Tool
, 2000
"... In this paper, we present the design and use of a comprehensive framework, SimplePower, for evaluating the effect of high-level algorithmic, architectural, and compilation tradeoffs on energy. An execution-driven, cycle-accurate RT level energy estimation tool that uses transition sensitive energy m ..."
Abstract
-
Cited by 198 (11 self)
- Add to MetaCart
In this paper, we present the design and use of a comprehensive framework, SimplePower, for evaluating the effect of high-level algorithmic, architectural, and compilation tradeoffs on energy. An execution-driven, cycle-accurate RT level energy estimation tool that uses transition sensitive energy models forms the cornerstone of this framework. SimplePower also provides the energy consumed in the memory system and on-chip buses using analytical energy models.
Data and Computation Transformations for Multiprocessors
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1995
"... Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We havedeveloped the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimiza ..."
Abstract
-
Cited by 177 (15 self)
- Add to MetaCart
(Show Context)
Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We havedeveloped the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimization algorithm consists of two steps. The first step chooses the parallelization and computation assignment such that synchronization and data sharing are minimized. The second step then restructures the layout of the data in the shared address space with an algorithm that is based on a new data transformation framework. We ran our compiler on a set of application programs and measured their performance on the Stanford DASH multiprocessor. Our results show that the compiler can effectively optimize parallelism in conjunction with memory subsystem performance. 1 Introduction In the last decade, microprocessor speeds have been steadily improving at a rate of 50% to 100% every year[16]. Meanwh...
Data-centric Multi-level Blocking
, 1997
"... We present a simple and novel framework for generating blocked codes for high-performance machines with a memory hierarchy. Unlike traditional compiler techniques like tiling, which are based on reasoning about the control flow of programs, our techniques are based on reasoning directly about the fl ..."
Abstract
-
Cited by 155 (10 self)
- Add to MetaCart
(Show Context)
We present a simple and novel framework for generating blocked codes for high-performance machines with a memory hierarchy. Unlike traditional compiler techniques like tiling, which are based on reasoning about the control flow of programs, our techniques are based on reasoning directly about the flow of data through the memory hierarchy. Our data-centric transformations permit a more direct solution to the problem of enhancing data locality than current control-centric techniques do, and generalize easily to multiple levels of memory hierarchy. We buttress these claims with performance numbers for standard benchmarks from the problem domain of dense numerical linear algebra. The simplicity and intuitive appeal of our approach should make it attractive to compiler writers as well as to library writers. 1 Introduction Data reuse is imperative for good performance on modern high-performance computers because the memory architecture of these machines is a hierarchy in which the cost of ...
Data Transformations for Eliminating Conflict Misses
- In Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation
, 1998
"... Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compile-time data-layout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Inter-variable padding adjusts variable base addresses ..."
Abstract
-
Cited by 134 (12 self)
- Add to MetaCart
(Show Context)
Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compile-time data-layout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Inter-variable padding adjusts variable base addresses, while intra-variable padding modifies array dimension sizes. Two levels of precision are evaluated. PadLite only uses array and column dimension sizes, relying on assumptions about common array reference patterns. Pad analyzes programs, detecting conflict misses by linearizing array references and calculating conflict distances between uniformly-generated references. The Euclidean algorithm for computing the gcd of two numbers is used to predict conflicts between different array columns for linear algebra codes. Experiments on a range of programs indicate PadLite can eliminate conflicts for benchmarks, but Pad is more effective over a range of cache and problem sizes. Padding reduces c...
Improving Cache Performance in Dynamic Applications through Data and Computation Reorganization at Run Time
- IN PROCEEDINGS OF THE SIGPLAN ’99 CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION
, 1999
"... With the rapid improvement of processor speed, performance of the memory hierarchy has become the principal bottleneck for most applications. A number of compiler transformations have been developed to improve data reuse in cache and registers, thus reducing the total number of direct memory accesse ..."
Abstract
-
Cited by 109 (20 self)
- Add to MetaCart
With the rapid improvement of processor speed, performance of the memory hierarchy has become the principal bottleneck for most applications. A number of compiler transformations have been developed to improve data reuse in cache and registers, thus reducing the total number of direct memory accesses in a program. Until now, however, most data reuse transformations have been static---applied only at compile time. As a result, these transformations cannot be used to optimize irregular and dynamic applications, in which the data layout and data access patterns remain unknown until run time and may even change during the computation. In this paper, we explore ways to achieve better data reuse in irregular and dynamic applications by building on the inspector-executor method used by Saltz for run-time parallelization. In particular, we present and evaluate a dynamic approach for improving both computation and data locality in irregular programs. Our results demonstrate that run-time progra...
Parallel data mining for association rules on shared-memory multiprocessors
- In Proc. Supercomputing’96
, 1996
"... Abstract. In this paper we present a new parallel algorithm for data mining of association rules on shared-memory multiprocessors. We study the degree of parallelism, synchronization, and data locality issues, and present optimizations for fast frequency computation. Experiments show that a signific ..."
Abstract
-
Cited by 91 (20 self)
- Add to MetaCart
Abstract. In this paper we present a new parallel algorithm for data mining of association rules on shared-memory multiprocessors. We study the degree of parallelism, synchronization, and data locality issues, and present optimizations for fast frequency computation. Experiments show that a significant improvement of performance is achieved using our proposed optimizations. We also achieved good speed-up for the parallel algorithm. A lot of data-mining tasks (e.g. association rules, sequential patterns) use complex pointer-based data structures (e.g. hash trees) that typically suffer from suboptimal data locality. In the multiprocessor case shared access to these data structures may also result in false sharing. For these tasks it is commonly observed that the recursive data structure is built once and accessed multiple times during each iteration. Furthermore, the access patterns after the build phase are highly ordered. In such cases locality and false sharing sensitive memory placement of these structures can enhance performance significantly. We evaluate a set of placement policies for parallel association discovery, and show that simple placement schemes can improve execution time by more than a factor of two. More complex schemes yield additional gains.
Auto-Blocking Matrix-Multiplication or Tracking BLAS3 Performance from Source Code
- In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1997
"... An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking hand-coded BLAS3 routines. ``Proof of concept'' is demonstrated by racing ..."
Abstract
-
Cited by 86 (6 self)
- Add to MetaCart
(Show Context)
An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking hand-coded BLAS3 routines. ``Proof of concept'' is demonstrated by racing the in-place algorithm against manufacturer's hand-tuned BLAS3 routines; it can win. The recursive code bifurcates naturally at the top level into independent block-oriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI's C compilers that merrily unroll loops for the superscalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter programmers from using this rich class of recursive algorithms.
Nonlinear Array Layouts for Hierarchical Memory Systems
, 1999
"... Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. ..."
Abstract
-
Cited by 76 (5 self)
- Add to MetaCart
(Show Context)
Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by re-ordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (2--5% of total running time) and high performance benefits (reducing execution time by factors of 1.1-2.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursion-based control structures may be needed to fully exploit their potential.
Tiling Optimizations for 3D Scientific Computations
, 2000
"... Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cann ..."
Abstract
-
Cited by 69 (4 self)
- Add to MetaCart
(Show Context)
Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cannot fit in cachefor larger problem sizes. Tiling is a program transformation compilers can apply to capture this reuse, but successful application of tiling requires selection of non-conflicting tiles and/or padding array dimensions to eliminate conflicts. We present new algorithms and cost models for selecting tiling shapes and array pads. We explain why tiling is rarely needed for 2D PDE solvers, but can be helpful for 3D stencil codes. Experimental results show tiling 3D codes can reduce miss rates and achieve performance improvements of 17--121% for key scientific kernels, including a 27% average improvement for the key computational loop nest in the SPEC/NAS benchmark MGRID.