The delivered performance on modern processors that employ deep memory hierarchies is closely related to the performance of the memory subsystem. Compiler optimizations aimed at improving cache locality are critical in realizing the performance potential of powerful processors. For scientific applications, several loop transformations have been shown to be useful in improving both temporal and spatial locality. Recently, there has been some work in the area of data layout optimizations, i.e., changing the memory layouts of multi-dimensional arrays from the language-defined default such as column-major storage in Fortran. The effect of such memory layout decisions is on the spatial locality characteristics of loop nests. While data layout transformations are not constrained by data dependences, they have no effect on temporal locality. On the other hand, loop transformations are not readily applicable to imperfect loop nests and are constrained by data dependences. More importantly, loop transformations affect the memory access patterns of all the arrays accessed in a loop nest, and as a result, the locality characteristics of some of the arrays may worsen. This paper presents a technique based on integer linear programming (ILP) that attempts to derive the best combination of loop and data layout transformations. Prior attempts to unify loop and data
|
3148
|
Computer Architecture: A Quantitative Approach
– Hennessy, Patterson
- 1996
|
|
676
|
A data locality optimizing algorithm
– Wolf, Lam
- 1991
|
|
549
|
High-Performance Compilers for Parallel Computing
– Wolfe
|
|
487
|
The cache performance and optimizations of blocked algorithms
– LAM, ROTHBERG, et al.
- 1991
|
|
253
|
Improving data locality with loop transformations
– McKinley, Carr, et al.
- 1996
|
|
251
|
Strategies for cache and local memory management by global program transformation
– Gannon, Jalby, et al.
- 1988
|
|
173
|
More iteration space tiling
– Wolfe
- 1989
|
|
168
|
Tile size selection using cache organization and data layout
– Coleman, McKinley
- 1995
|
|
159
|
Data and computation transformation for multiprocessors
– Anderson, Amarasinghe, et al.
- 1995
|
|
152
|
Unifying data and control transformations for distributed shared memory machines
– Cierniak, Li
- 1995
|
|
151
|
Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers
– Gupta, Banerjee
- 1992
|
|
140
|
The Livermore Fortran kernels: a computer test of the numerical performance range
– McMahon
- 1986
|
|
135
|
Software methods for improvement of cache performance on supercomputer applications
– Porterfield
- 1989
|
|
129
|
Data-centric multi-level blocking
– Kodukula, Ahmed, et al.
- 1997
|
|
109
|
Reducing false sharing on shared memory multiprocessors through compile time data transformations
– Jeremiassen, Eggers
- 1995
|
|
104
|
Data transformations for eliminating conflict misses
– Rivera, Tseng
- 1998
|
|
100
|
On estimating and enhancing cache effectiveness
– Ferrante, Sarkar, et al.
- 1991
|
|
87
|
Parafrase-2: An environment for parallelizing, partitioning, synchronizing, and scheduling programs on multiprocessors
– Polychronopoulos, Girkar, et al.
- 1989
|
|
82
|
Eliminating false sharing
– Eggers, Jeremiassen
- 1991
|
|
75
|
Automatic Data Layout for High Performance Fortran
– Kremer
- 1995
|
|
71
|
The Omega Library Interface Guide
– Kelly, Maslov, et al.
- 1996
|
|
62
|
A quantitative analysis of loop nest locality
– McKinley, Temam
- 1996
|
|
56
|
Compiling for NUMA Parallel Machines
– Li
- 1993
|
|
45
|
Improving the Performance of Virtual Memory Computers
– Abu-Sufah
- 1979
|
|
45
|
Integer and combinatorial optimization. WileyInterscience
– Nemhauser, Wolsey
- 1999
|
|
44
|
Improving locality using loop and data transformations in an integrated framework
– Kandemir, Choudhary, et al.
- 1998
|
|
39
|
Non-singular data transformations: Definition, validity, applications
– O’Boyle, Knijnenburg
- 1996
|
|
36
|
New CPU benchmark suites from SPEC
– Dixit
- 1992
|
|
35
|
Optimizing data locality by array restructuring
– Leung, Zahorjan
- 1995
|
|
34
|
A Novel Approach Towards Automatic Data Distribution
– Garcia, Ayguade, et al.
- 1995
|
|
34
|
Reduction of cache coherence overhead by compiler data layout and loop transformation
– Ju, Dietz
- 1992
|
|
33
|
A compiler algorithm for optimizing locality in loop nests
– Kandemir, Ramanujam, et al.
- 1997
|
|
26
|
A matrix-based approach to the global locality optimization problem
– Kandemir, Choudhary, et al.
- 1998
|
|
25
|
Performance Computational Chemistry Group, NWChem, A Computational Chemistry Package for Parallel Computers, Version 4.1
– High
|
|
24
|
Automatic selection of Dynamic Data Partitioning Schemes for Distributed-Memory Multicomputers
– Palermo, Banerjee
- 1995
|
|
22
|
Combining Optimization for Cache and Instruction-Level Parallelism
– Carr
- 1996
|
|
22
|
Hierarchical tiling: a methodology for high performance
– Carter, Ferrante, et al.
- 1996
|
|
21
|
Integrating loop and data transformations for global optimisation
– O’Boyle, Knijnenburg
- 1998
|
|
18
|
Compiling communication efficient programs for massively parallel machines
– Li, Chen
- 1991
|
|
17
|
Dynamic Data Distribution with Control Flow Analysis
– Garcia, Ayguade, et al.
- 1996
|
|
17
|
The combined effectiveness of unimodular transformations, tiling, and software prefetching
– Saavedra, Mao, et al.
- 1996
|
|
15
|
A hyperplane based approach for optimizing spatial locality in loop nests
– Kandemir, Choudhary, et al.
- 1998
|
|
14
|
Locality analysis for distributed shared-memory multiprocessors
– Sarkar, Gao, et al.
- 1996
|
|
12
|
Impact of cache interferences on usual numerical dense loop nests
– Temam, Fricker, et al.
- 1993
|
|
11
|
Data-distribution support on distributed-shared memory multiprocessors
– Chandra, Chen, et al.
- 1997
|
|
11
|
Automatic partitioning of data and computations on scalable shared memory multiprocessors
– Tandri, Abdelrahman
- 1997
|
|
9
|
Transformations for imperfectly nested loops
– Kodukula, Pingali
- 1996
|
|
9
|
The Perfect Club Benchmarks: Effective performance evaluation of supercomputers
– Club
- 1989
|
|
3
|
A graph based framework to detect optimal memory layouts for improving data locality
– Kandemir, Choudhary, et al.
- 1999
|
|
2
|
lp solve version 2.1, Available from ftp:// ftp.es.ele.tue.nl/pub/lp solve
– Berkelaar
|