Results 1  10
of
266
Code generation in the polyhedral model is easier than you think
 In IEEE Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT’04
, 2004
"... Many advances in automatic parallelization and optimization have been achieved through the polyhedral model. It has been extensively shown that this computational model provides convenient abstractions to reason about and apply program transformations. Nevertheless, the complexity of code generation ..."
Abstract

Cited by 167 (16 self)
 Add to MetaCart
(Show Context)
Many advances in automatic parallelization and optimization have been achieved through the polyhedral model. It has been extensively shown that this computational model provides convenient abstractions to reason about and apply program transformations. Nevertheless, the complexity of code generation has long been a deterrent for using polyhedral representation in optimizing compilers. First, code generators have a hard time coping with generated code size and control overhead that may spoil theoretical benefits achieved by the transformations. Second, this step is usually time consuming, hampering the integration of the polyhedral framework in production compilers or feedbackdirected, iterative optimization schemes. Moreover, current code generation algorithms only cover a restrictive set of possible transformation functions. This paper discusses a general transformation framework able to deal with nonunimodular, noninvertible, nonintegral or even nonuniform functions. It presents several improvements to a stateoftheart code generation algorithm. Two directions are explored: generated code size and code generator efficiency. Experimental evidence proves the ability of the improved method to handle reallife problems. 1.
Maximizing Parallelism and Minimizing Synchronization with Affine Transforms
 Parallel Computing
, 1997
"... This paper presents the first algorithm to find the optimal affine transform that maximizes the degree of parallelism while minimizing the degree of synchronization in a program with arbitrary loop nestings and affine data accesses. The problem is formulated without the use of imprecise data depende ..."
Abstract

Cited by 148 (6 self)
 Add to MetaCart
This paper presents the first algorithm to find the optimal affine transform that maximizes the degree of parallelism while minimizing the degree of synchronization in a program with arbitrary loop nestings and affine data accesses. The problem is formulated without the use of imprecise data dependence abstractions such as data dependence vectors. The algorithm presented subsumes previously proposed program transformation algorithms that are based on unimodular transformations, loop fusion, fission, scaling, reindexing and/or statement reordering. 1 Introduction As multiprocessors become popular, it is important to develop compilers that can automatically translate sequential programs into efficient parallel code. Getting high performance on a multiprocessor requires not only finding parallelism in the program but also minimizing the synchronization overhead. Synchronization is expensive on a multiprocessor. The cost of synchronization goes far beyond just the operations that manipul...
A practical automatic polyhedral parallelizer and locality optimizer
 In PLDI ’08: Proceedings of the ACM SIGPLAN 2008 conference on Programming language design and implementation
, 2008
"... We present the design and implementation of an automatic polyhedral sourcetosource transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical mod ..."
Abstract

Cited by 122 (8 self)
 Add to MetaCart
(Show Context)
We present the design and implementation of an automatic polyhedral sourcetosource transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical modeldriven automatic transformation in the polyhedral model.Unlike previous polyhedral frameworks, our approach is an endtoend fully automatic one driven by an integer linear optimization framework that takes an explicit view of finding good ways of tiling for parallelism and locality using affine transformations. The framework has been implemented into a tool to automatically generate OpenMP parallel code from C program sections. Experimental results from the tool show very high performance for local and parallel execution on multicores, when compared with stateoftheart compiler frameworks from the research community as well as the best native production compilers. The system also enables the easy use of powerful empirical/iterative optimization for general arbitrarily nested loop sequences.
Semiautomatic composition of loop transformations for deep parallelism and memory hierarchies.
 Int. J. Parallel Program.,
, 2006
"... Modern compilers are responsible for translating the idealistic operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Since optimization problems are associated with huge and unstructured search spaces, this combinational task is ..."
Abstract

Cited by 82 (24 self)
 Add to MetaCart
(Show Context)
Modern compilers are responsible for translating the idealistic operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Since optimization problems are associated with huge and unstructured search spaces, this combinational task is poorly achieved in general, resulting in weak scalability and disappointing sustained performance. We address this challenge by working on the program representation itself, using a semiautomatic optimization approach to demonstrate that current compilers offen suffer from unnecessary constraints and intricacies that can be avoided in a semantically richer transformation framework. Technically, the purpose of this paper is threefold: (1) to show that syntactic code representations close to the operational semantics lead to rigid phase ordering and cumbersome expression of architectureaware loop transformations, (2) to illustrate how complex transformation sequences may be needed to achieve significant performance benefits, (3) to facilitate the automatic search for program transformation sequences, improving on classical polyhedral representations to better support operation research strategies in a simpler, structured search space. The proposed framework relies on a unified polyhedral representation of loops and statements, using normalization rules to allow flexible and expressive transformation sequencing. This representation allows to extend the scalability of polyhedral dependence analysis, and to delay the (automatic) legality checks until the end of a transformation sequence. Our work leverages on algorithmic advances in polyhedral code generation and has been implemented in a modern research compiler.
Fuzzy Array Dataflow Analysis
, 1997
"... This paper deals with more general control structures, such as ifs and while loops, together with unrestricted array subscripting. Notice that we assume that unstructrured programs are preprocessed, and that for instance gotos are first converted in whiles. However, with such unpredictable, dynamic ..."
Abstract

Cited by 78 (18 self)
 Add to MetaCart
This paper deals with more general control structures, such as ifs and while loops, together with unrestricted array subscripting. Notice that we assume that unstructrured programs are preprocessed, and that for instance gotos are first converted in whiles. However, with such unpredictable, dynamic control structures, no exact information can be hoped for in the general case. However, the aim of this paper is threefold. First, we aim at showing that even partial information can be automatically gathered thanks to Fuzzy Array Dataflow Analysis (FADA). This paper extends our previous work [4] on FADA to general, nonaffine array subscripts. The second purpose of this paper is to formalize and generalize these previous formalisms and to prove general results. Third, we will show that the precise, classical ADA is a special case of FADA.
A Framework for Unifying Reordering Transformations
, 1993
"... We present a framework for unifying iteration reordering transformations such as loop interchange, loop distribution, skewing, tiling, index set splitting and statement reordering. The framework is based on the idea that a transformation can be represented as a schedule that maps the original iterat ..."
Abstract

Cited by 76 (10 self)
 Add to MetaCart
We present a framework for unifying iteration reordering transformations such as loop interchange, loop distribution, skewing, tiling, index set splitting and statement reordering. The framework is based on the idea that a transformation can be represented as a schedule that maps the original iteration space to a new iteration space. The framework is designed to provide a uniform way to represent and reason about transformations. As part of the framework, we provide algorithms to assist in the building and use of schedules. In particular, we provide algorithms to test the legality of schedules, to align schedules and to generate optimized code for schedules. This work is supported by an NSF PYI grant CCR9157384 and by a Packard Fellowship. 1 Introduction Optimizing compilers reorder iterations of statements to improve instruction scheduling, register use, and cache utilization, and to expose parallelism. Many different reordering transformations have been developed and studied, su...
Code Generation for Multiple Mappings
 IN FRONTIERS '95: THE 5TH SYMPOSIUM ON THE FRONTIERS OF MASSIVELY PARALLEL COMPUTATION
, 1994
"... There has been a great amount of recent work toward unifying iteration reordering transformations. Many of these approaches represent transformations as affine mappings from the original iteration space to a new iteration space. These approaches show a great deal of promise, but they all rely on the ..."
Abstract

Cited by 75 (2 self)
 Add to MetaCart
There has been a great amount of recent work toward unifying iteration reordering transformations. Many of these approaches represent transformations as affine mappings from the original iteration space to a new iteration space. These approaches show a great deal of promise, but they all rely on the ability to generate code that iterates over the points in these new iteration spaces in the appropriate order. This problem has been fairly wellstudied in the case where all statements use the same mapping. We have developed an algorithm for the less wellstudied case where each statement uses a potentially different mapping. Unlike many other approaches, our algorithm can also generate code from mappings corresponding to loop blocking. We address the important tradeoff between reducing control overhead and duplicating code.
Is Search Really Necessary to Generate HighPerformance BLAS?
, 2005
"... Abstract — A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of p ..."
Abstract

Cited by 64 (13 self)
 Add to MetaCart
Abstract — A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of parameter values by generating programs with many different combinations of parameter values, and running them on the actual hardware to determine which values give the best performance. It is widely believed that traditional modeldriven optimization cannot compete with searchbased empirical optimization because tractable analytical models cannot capture all the complexities of modern highperformance architectures, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the global search engine in ATLAS with a modeldriven optimization engine, and measured the relative performance of the code produced by the two systems on a variety of architectures. Since both systems use the same code generator, any differences in the performance of the code produced by the two systems can come only from differences in optimization parameter values. Our experiments show that modeldriven optimization can be surprisingly effective, and can generate code with performance comparable to that of code generated by ATLAS using global search. Index Terms — program optimization, empirical optimization, modeldriven optimization, compilers, library generators, BLAS, highperformance computing
Synthesizing transformations for locality enhancement of imperfectlynested loop nests
 In Proceedings of the 2000 ACM International Conference on Supercomputing
, 2000
"... We present an approach for synthesizing transformations to enhance locality in imperfectlynested loops. The key idea is to embed the iteration space of every statement in a loop nest into a special iteration space called the product space. The product space can be viewed as a perfectlynested loop ..."
Abstract

Cited by 64 (3 self)
 Add to MetaCart
We present an approach for synthesizing transformations to enhance locality in imperfectlynested loops. The key idea is to embed the iteration space of every statement in a loop nest into a special iteration space called the product space. The product space can be viewed as a perfectlynested loop nest, so embedding generalizes techniques like code sinking and loop fusion that are used in ad hoc ways in current compilers to produce perfectlynested loops from imperfectlynested ones. In contrast to these ad hoc techniques however, our embeddings are chosen carefully to enhance locality. The product space is then transformed further to enhance locality, after which fully permutable loops are tiled, and code is generated. We evaluate the effectiveness of this approach for dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the SPEC benchmarks. 1. BACKGROUND AND PREVIOUSWORK Sophisticated algorithms based on polyhedral algebra have been developed for determining good sequences of linear loop transformations (permutation, skewing, reversal and scaling) for enhancing locality in perfectlynested loops 1. Highlights of this technology are the following. The iterations of the loop nest are modeled as points in an integer lattice, and linear loop transformations are modeled as nonsingular matrices mapping one lattice to another. A sequence of loop transformations is modeled by the product of matrices representing the individual transformations; since the set of nonsingular matrices is closed under matrix product, this means that a sequence of linear loop transformations can be represented by a nonsingular matrix. The problem of finding an optimal sequence of linear loop transformations is thus reduced to the problem of finding an integer matrix that satisfies some desired property, permitting the full machinery of matrix methods and lattice theory to ¢ This work was supported by NSF grants CCR9720211, EIA9726388, ACI9870687,EIA9972853. £ A perfectlynested loop is a set of loops in which all assignment statements are contained in the innermost loop.
Quantifying the MultiLevel Nature of Tiling Interactions
 INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING
, 1997
"... Optimizations, including tiling, often target a single level of memory or parallelism, such as cache. These optimizations usually operate on a levelbylevel basis, guided by a cost function parameterized by features of that single level. The benefit of optimizations guided by these onelevel cost f ..."
Abstract

Cited by 64 (7 self)
 Add to MetaCart
Optimizations, including tiling, often target a single level of memory or parallelism, such as cache. These optimizations usually operate on a levelbylevel basis, guided by a cost function parameterized by features of that single level. The benefit of optimizations guided by these onelevel cost functions decreases as architectures trend towards a hierarchy of memory and parallelism. We look at three common architectural scenarios. For each, we quantify the improvement a single tiling choice could realize by using information from multiple levels in concert. To do so, we derive multilevel cost functions which guide the optimal choice of tile size and shape. We give both analyses and simulation results to support our points.