Results 11  20
of
64
Towards Automatic Synthesis of HighPerformance Codes for Electronic Structure Calculations: Data Locality Optimization
 In Proc. of the Intl. Conf. on High Performance Computing
, 2001
"... The goal of our project is the development of a program synthesis system to facilitate the development of highperformance parallel programs for a class of computations encountered in computational chemistry and computational physics. These computations are expressible as a set of tensor contract ..."
Abstract

Cited by 34 (26 self)
 Add to MetaCart
(Show Context)
The goal of our project is the development of a program synthesis system to facilitate the development of highperformance parallel programs for a class of computations encountered in computational chemistry and computational physics. These computations are expressible as a set of tensor contractions and arise in electronic structure calculations. This paper provides an overview of a planned synthesis system that will take as input a highlevel specification of the computation and generate highperformance parallel code for a number of target architectures. We focus on an approach to performing data locality optimization in this context. Preliminary experimental results on an SGI Origin 2000 are encouraging and demonstrate that the approach is effective.
PLUTO: A practical and fully automatic polyhedral program optimization system
, 2008
"... We present the design and implementation of a fully automatic polyhedral sourcetosource transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytica ..."
Abstract

Cited by 29 (7 self)
 Add to MetaCart
(Show Context)
We present the design and implementation of a fully automatic polyhedral sourcetosource transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical modeldriven automatic transformation in the polyhedral model – far beyond what is possible by current production compilers. Unlike previous works, our approach is an endtoend fully automatic one driven by an integer linear optimization framework that takes an explicit view of finding good ways of tiling for parallelism and locality using affine transformations. We also address generation of tiled code for multiple statement domains of arbitrary dimensionalities under (statementwise) affine transformations – an issue that has not been addressed previously. Experimental results from the implemented system show very high speedups for local and parallel execution on multicores over stateoftheart compiler frameworks from the research community as well as the best native compilers. The system also enables the easy use of powerful empirical/iterative optimization for general arbitrarily nested loop sequences.
Increasing temporal locality with skewing and recursive blocking
 In Proc. SC2001
, 2001
"... We present a strategy, called recursive prismatic time skewing, that increase temporal reuse at all memory hierarchy levels, thus improving the performance of scientific codes that use iterative methods. Prismatic time skewing partitions iteration space of multiple loops into skewed prisms with both ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
(Show Context)
We present a strategy, called recursive prismatic time skewing, that increase temporal reuse at all memory hierarchy levels, thus improving the performance of scientific codes that use iterative methods. Prismatic time skewing partitions iteration space of multiple loops into skewed prisms with both spatial and temporal (or convergence) dimensions. Novel aspects of this work include: multidimensional loop skewing; handling carried data dependences in the skewed loops without additional storage; bidirectional skewing to accommodate periodic boundary conditions; and an analysis and transformation strategy that works interprocedurally. We combine prismatic skewing with a recursive blocking strategy to boost reuse at all levels in a memory hierarchy. A preliminary evaluation of these techniques shows significant performance improvements compared both to original codes and to methods described previously in the literature. With an interprocedural application of our techniques, we were able to reduce total primary cache misses of a large application code by 27 % and secondary cache misses by 119%. 1.
Sparse Tiling for Stationary Iterative Methods
 INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
, 2004
"... In modern computers, a program’s data locality can affect performance significantly. This paper details full sparse tiling, a runtime reordering transformation that improves the data locality for stationary iterative methods such as Gauss–Seidel operating on sparse matrices. In scientific applicati ..."
Abstract

Cited by 25 (8 self)
 Add to MetaCart
In modern computers, a program’s data locality can affect performance significantly. This paper details full sparse tiling, a runtime reordering transformation that improves the data locality for stationary iterative methods such as Gauss–Seidel operating on sparse matrices. In scientific applications such as finite element analysis, these iterative methods dominate the execution time. Full sparse tiling chooses a permutation of the rows and columns of the sparse matrix, and then an order of execution that achieves better data locality. We prove that full sparsetiled Gauss–Seidel generates a solution that is bitwise identical to traditional Gauss–Seidel on the permuted matrix. We also present measurements of the performance improvements and the overheads of full sparse tiling and of cache blocking for irregular grids, a related technique developed by Douglas et al.
Loop Optimizations for a Class of MemoryConstrained Computations
, 2001
"... Computeintensive multidimensional summations that involve products of several arrays arise in the modeling of electronic structure of materials. Sometimes several alternative formulations of a computation, representing different spacetime tradeoffs, are possible. By computing and storing some int ..."
Abstract

Cited by 25 (19 self)
 Add to MetaCart
Computeintensive multidimensional summations that involve products of several arrays arise in the modeling of electronic structure of materials. Sometimes several alternative formulations of a computation, representing different spacetime tradeoffs, are possible. By computing and storing some intermediate arrays, reduction of the number of arithmetic operations is possible, but the size of intermediate temporary arrays may be prohibitively large. Loop fusion can be applied to reduce memory requirements, but that could impede effective tiling to minimize memory access costs. This paper develops an integrated model combining loop tiling for enhancing data reuse, and loop fusion for reduction of memory for intermediate temporary arrays. An algorithm is presented that addresses the selection of tile sizes and choice of loops for fusion, with the objective of minimizing cache misses while keeping the total memory usage within a given limit. Experimental results are reported that demonstrate the effectiveness of the combined loop tiling and fusion transformations performed by using the developed framework.
Automatic Transformations for CommunicationMinimized Parallelization and Locality Optimization in the Polyhedral Model
"... Abstract. Many compute intensive applications spend a significant fraction of their time in nested loops. The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses for parallel execution. Affine transformations in this model capture a complex sequence of execut ..."
Abstract

Cited by 22 (9 self)
 Add to MetaCart
(Show Context)
Abstract. Many compute intensive applications spend a significant fraction of their time in nested loops. The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses for parallel execution. Affine transformations in this model capture a complex sequence of executionreordering loop transformations that can improve performance by parallelization as well as locality enhancement. Although a significant amount of research has addressed affine scheduling and partitioning, the problem of automatically finding good affine transforms for communicationoptimized coarsegrained parallelization along with locality optimization for the general case of arbitrarilynested loop sequences remains a challenging problem. In this paper, we propose an automatic transformation framework to optimize arbitrarilynested loop sequences with affine dependences for parallelism and locality simultaneously. The approach finds good tiling hyperplanes by embedding a powerful and versatile cost function into an Integer Linear Programming formulation. These tiling hyperplanes are used for communicationminimized coarsegrained parallelization as well as locality optimization. It enables the minimization of intertile communication volume in the processor space, and minimization of reuse distances for local execution at each node. Programs requiring onedimensional versus multidimensional time schedules (with schedulingbased approaches) are all handled with the same algorithm. Synchronizationfree parallelism, permutable loops or pipelined parallelism at various levels can be detected. Preliminary results from the implemented framework show promising performance and scalability with input size. 1
Transforming complex loop nests for locality
 THE JOURNAL OF SUPERCOMPUTING
"... Over the past 20 years, increases in processor speed have dramatically outstripped performance increases for standard memory chips. To bridge this gap, compilers must optimize applications so that data fetched into caches are reused before being displaced. Existing compiler techniques can efficientl ..."
Abstract

Cited by 19 (9 self)
 Add to MetaCart
(Show Context)
Over the past 20 years, increases in processor speed have dramatically outstripped performance increases for standard memory chips. To bridge this gap, compilers must optimize applications so that data fetched into caches are reused before being displaced. Existing compiler techniques can efficiently optimize simple loop structures such as sequences of perfectly nested loops. However, on more complicated structures, existing techniques are either ineffective or require too much computation time to be practical for a commercial compiler. To optimize complex loop structures both effectively and inexpensively, we present a novel loop transformation, dependence hoisting, for optimizing arbitrarily nested loops, and an efficient framework that applies the new technique to aggressively optimize benchmarks for better locality. Our technique is as inexpensive as the traditional unimodular loop transformation techniques and thus can be incorporated into commercial compilers. In addition, it is highly effective and is able to block several linear algebra kernels containing highly challenging loop structures, in particular, Cholesky, QR, LU factorization without pivoting, and LU with partial pivoting. The automatic blocking of QR and pivoting LU is a notable achievement — to our knowledge, few previous compiler techniques, including theoretically more general loop transformation frameworks [21, 27, 1, 23, 31], were able to completely automate the blocking of these kernels, and none has produced the same blocking as
Automatic Tiling of Iterative Stencil Loops
 ACM TRANSACTIONS ON PROGRAMMING LANGUAGE SYSTEMS
, 2004
"... ... This paper presents a compiler framework for automatic tiling of iterative stencil loops, with the objective of improving the cache performance. The paper first presents a technique which allows loop tiling to satisfy data dependences in spite of the di#culty created by imperfectlynested inner ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
... This paper presents a compiler framework for automatic tiling of iterative stencil loops, with the objective of improving the cache performance. The paper first presents a technique which allows loop tiling to satisfy data dependences in spite of the di#culty created by imperfectlynested inner loops. It does so by skewing the inner loops over the time steps and by applying a uniform skew factor to all loops at the same nesting level. Based on a memory cost analysis, the paper shows that the skew factor must be minimized at every loop level in order to minimize cache misses. A graphtheoretical algorithm, which takes polynomial time, is presented to determine the minimum skew factor. Furthermore, the memorycost analysis derives the tile size which minimizes capacity misses. Given the tile size, an e#cient and general arraypadding scheme is applied to remove conflict misses. Experiments are conducted on sixteen test programs and preliminary results show an average speedup of 1.58 and a maximum speedup of 5.06 across those test programs
A Framework for Sparse Matrix Code Synthesis from Highlevel Specifications
, 2000
"... We present compiler technology for synthesizing sparse matrix code from (i) dense matrix code, and (ii) a description of the index structure of a sparse matrix. Our approach is to embed statement instances into a Cartesian product of statement iteration and data spaces, and to produce efficient spar ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
We present compiler technology for synthesizing sparse matrix code from (i) dense matrix code, and (ii) a description of the index structure of a sparse matrix. Our approach is to embed statement instances into a Cartesian product of statement iteration and data spaces, and to produce efficient sparse code by identifying common enumerations for multiple references to sparse matrices. The approach works for imperfectlynested codes with dependences, and produces sparse code competitive with handwritten library code for the Basic Linear Algebra Subroutines (BLAS). 1 Introduction Many applications that require highperformance computing perform computations on sparse matrices. For example, the finiteelement method for solving partial differential equations approximately requires the solution of large linear systems of the form Ax = b where A is a large sparse matrix. Some websearch engines and datamining codes compute eigenvectors of large sparse matrices that represent how often cer...