Results 1  10
of
37
Quantifying the MultiLevel Nature of Tiling Interactions
 INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING
, 1997
"... Optimizations, including tiling, often target a single level of memory or parallelism, such as cache. These optimizations usually operate on a levelbylevel basis, guided by a cost function parameterized by features of that single level. The benefit of optimizations guided by these onelevel cost f ..."
Abstract

Cited by 65 (7 self)
 Add to MetaCart
Optimizations, including tiling, often target a single level of memory or parallelism, such as cache. These optimizations usually operate on a levelbylevel basis, guided by a cost function parameterized by features of that single level. The benefit of optimizations guided by these onelevel cost functions decreases as architectures trend towards a hierarchy of memory and parallelism. We look at three common architectural scenarios. For each, we quantify the improvement a single tiling choice could realize by using information from multiple levels in concert. To do so, we derive multilevel cost functions which guide the optimal choice of tile size and shape. We give both analyses and simulation results to support our points.
Selecting Tile Shape for Minimal Execution Time
, 1999
"... Many computationallyintensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiplynested loops which have a regular stencil of data dependences. Tiling is a wellknown optimization that improve ..."
Abstract

Cited by 28 (2 self)
 Add to MetaCart
Many computationallyintensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiplynested loops which have a regular stencil of data dependences. Tiling is a wellknown optimization that improves performance on such loops, particularly for computers with a multilevelled hierarchy of parallelism and memory. Most previous work on tiling restricts the tile shape to be rectangular. Our previous work and its extension by Desprez, Dongarra, Rastello and Robert showed that for doubly nested loops, using parallelograms can improve parallel execution time by decreasing the idle time, the time that a processor spends waiting for data or synchronization. In this paper, we extend that work to more deeply nested loops, as well as to more complex loop bounds. We introduce a model which allows us to demonstrate the equivalence in complexity of linear programming and determining the execution tim...
Loop Optimizations for a Class of MemoryConstrained Computations
, 2001
"... Computeintensive multidimensional summations that involve products of several arrays arise in the modeling of electronic structure of materials. Sometimes several alternative formulations of a computation, representing different spacetime tradeoffs, are possible. By computing and storing some int ..."
Abstract

Cited by 25 (19 self)
 Add to MetaCart
Computeintensive multidimensional summations that involve products of several arrays arise in the modeling of electronic structure of materials. Sometimes several alternative formulations of a computation, representing different spacetime tradeoffs, are possible. By computing and storing some intermediate arrays, reduction of the number of arithmetic operations is possible, but the size of intermediate temporary arrays may be prohibitively large. Loop fusion can be applied to reduce memory requirements, but that could impede effective tiling to minimize memory access costs. This paper develops an integrated model combining loop tiling for enhancing data reuse, and loop fusion for reduction of memory for intermediate temporary arrays. An algorithm is presented that addresses the selection of tile sizes and choice of loops for fusion, with the objective of minimizing cache misses while keeping the total memory usage within a given limit. Experimental results are reported that demonstrate the effectiveness of the combined loop tiling and fusion transformations performed by using the developed framework.
Algorithmic Issues on Heterogeneous Computing Platforms
, 1998
"... This paper discusses some algorithmic issues when computing with a heterogeneous network of workstations (the typical poor man's parallel computer). Dealing with processors of different speeds requires to use more involved strategies than blockcyclic data distributions. Dynamic data distributi ..."
Abstract

Cited by 21 (10 self)
 Add to MetaCart
This paper discusses some algorithmic issues when computing with a heterogeneous network of workstations (the typical poor man's parallel computer). Dealing with processors of different speeds requires to use more involved strategies than blockcyclic data distributions. Dynamic data distribution is a first possibility but may prove impractical and not scalable due to communication and control overhead. Static data distributions tuned to balance execution times constitute another possibility but may prove inefficient due to variations in the processor speeds (e.g. because of different workloads during the computation). We introduce a static distribution strategy that can be refined on the y, and we show that it is wellsuited to parallelizing scientific computing applications such as finitedifference stencils or LU decomposition.
Determining the idle time of a tiling: new results
 Journal of Information Science and Engineering
, 1997
"... In the framework of fully permutable loops, tiling has been studied extensively as a sourcetosource program transformation. We build upon recent results by Hogsted, Carter, and Ferrante [12], who aim at determining the cumulated idle time spent by all processors while executing the partitioned (til ..."
Abstract

Cited by 20 (5 self)
 Add to MetaCart
In the framework of fully permutable loops, tiling has been studied extensively as a sourcetosource program transformation. We build upon recent results by Hogsted, Carter, and Ferrante [12], who aim at determining the cumulated idle time spent by all processors while executing the partitioned (tiled) computation domain. We propose new, much shorter proofs of all their results and extend these in several important directions. More precisely, weprovide an accurate solution for all values of the rise parameter that relates the shape of the iteration space to that of the tiles, and for all possible distributions of the tiles to processors. In contrast, the authors in [12] deal only with a limited number of cases and provide upper bounds rather than exact formulas.
Static Tiling for Heterogeneous Computing Platforms
, 1999
"... In the framework of fully permutable loops, tiling has been extensively studied as a sourceto source program transformation. However, little work has been devoted to the mapping and scheduling of the tiles on physical processors. Moreover, targeting heterogeneous computing platforms has, to the bes ..."
Abstract

Cited by 17 (7 self)
 Add to MetaCart
In the framework of fully permutable loops, tiling has been extensively studied as a sourceto source program transformation. However, little work has been devoted to the mapping and scheduling of the tiles on physical processors. Moreover, targeting heterogeneous computing platforms has, to the best of our knowledge, never been considered. In this paper we extend static tiling techniques to the context of limited computational resources with dierentspeed processors. In particular, we present eÆcient scheduling and mapping strategies that are asymptotically optimal. The practical usefulness of these strategies is fully demonstrated by MPI experiments on a heterogeneous network of workstations. Key words: tiling, communicationcomputation overlap, mapping, limited resources, dierentspeed processors, heterogeneous networks Corresponding author: Yves Robert LIP, Ecole Normale Superieure de Lyon, 69364 Lyon Cedex 07, France Phone: + 33 4 72 72 80 37, Fax: + 33 4 72 72 80 80 Email: Y...
An Efficient Code Generation Technique for Tiled Iteration Spaces
 IEEE Transactions on Parallel and Distributed Systems
, 2003
"... This paper presents a novel approach for the problem of generating tiled code for nested forloops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto para ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
(Show Context)
This paper presents a novel approach for the problem of generating tiled code for nested forloops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto parallel architectures. However, automatic code generation for tiled loops can be a very complex compiler work, especially when nonrectangular tile shapes and iteration space bounds are concerned. Our method considerably enhances previous work on rewriting tiled loops, by considering parallelepiped tiles and arbitrary iteration space shapes. In order to generate tiled code, we first enumerate all tiles containing points within the iteration space and second sweep all points within each tile. For the first subproblem, 1 we refine upon previous results concerning the computation of new loop bounds of an iteration space that has been transformed by a nonunimodular transformation. For the second subproblem, we transform the initial parallelepiped tile into a rectangular one, in order to generate efficient code with the aid of a nonunimodular transformation matrix and its Hermite Normal Form (HNF). Experimental results show that the proposed method significantly accelerates the compilation process and generates much more efficient code.
Parameterized Tiled Loops for Free
, 2007
"... Parameterized tiled loops—where the tile sizes are not fixed at compile time, but remain symbolic parameters until later—are quite useful for iterative compilers and “autotuners” that produce highly optimized libraries and codes. Tile size parameterization could also enable optimizations such as re ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
(Show Context)
Parameterized tiled loops—where the tile sizes are not fixed at compile time, but remain symbolic parameters until later—are quite useful for iterative compilers and “autotuners” that produce highly optimized libraries and codes. Tile size parameterization could also enable optimizations such as register tiling to become dynamic optimizations. Although it is easy to generate such loops for (hyper) rectangular iteration spaces tiled with (hyper) rectangular tiles, many important computations do not fall into this restricted domain. Parameterized tile code generation for the general case of convex iteration spaces being tiled by (hyper) rectangular tiles has in the past been solved with bounding box approaches or symbolic Fourier Motzkin approaches. However, both approaches have less than ideal code generation efficiency and resulting code quality. We present the theoretical foundations, implementation, and experimental validation of a simple, unified technique for generating parameterized tiled code. Our code generation efficiency is comparable to all existing code generation techniques including those for fixed tile sizes, and the resulting code is as efficient as, if not more than, all previous techniques. Thus the technique provides parameterized tiled loops for free! Our “onesizefitsall” solution, which is available as open source software can be adapted for use in production compilers.
Towards optimal multilevel tiling for stencil computations
 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS
, 2007
"... Stencil computations form the performancecritical core of many applications. Tiling and parallelization are two important optimizations to speed up stencil computations. Many tiling and parallelization strategies are applicable to a given stencil computation. The best strategy depends not only on t ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
Stencil computations form the performancecritical core of many applications. Tiling and parallelization are two important optimizations to speed up stencil computations. Many tiling and parallelization strategies are applicable to a given stencil computation. The best strategy depends not only on the combination of the two techniques, but also on many parameters: tile and loop sizes in each dimension; computationcommunication balance of the code; processor architecture; message startup costs; etc. The best choices can only be determined through designspace exploration, which is extremely tedious and error prone to do via exhaustive experimentation. We characterize the space of multilevel tilings and parallelizations for 2D/3D GaussSiedel stencil computation. A systematic exploration of a part of this space enabled us to derive a design which is up to a factor of two faster than the standard implementation. 1.
Affine transformation for communication minimal parallelization and locality optimization of arbitrarily nested loop sequences
, 2007
"... A long running program often spends most of its time in nested loops. The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses for parallel execution. Affine transformations in this model capture a complex sequence of executionreordering loop transformations ..."
Abstract

Cited by 10 (7 self)
 Add to MetaCart
(Show Context)
A long running program often spends most of its time in nested loops. The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses for parallel execution. Affine transformations in this model capture a complex sequence of executionreordering loop transformations that improve performance by parallelization as well as better locality. Although a significant amount of research has addressed affine scheduling and partitioning, the problem of automatically finding good affine transforms for communicationoptimized coarsegrained parallelization along with locality optimization for the general case of arbitrarilynested loop sequences remains a challenging problem most frameworks do not treat parallelization and locality optimization in an integrated manner, and/or do not optimize across a sequence of producerconsumer loops. In this paper, we develop an approach to communication minimization and locality optimization in tiling of arbitrarily nested loop sequences with affine dependences. We address the minimization of intertile communication volume in the processor space, and minimization of reuse distances for local execution at each node. The approach can also fuse across a long sequence of loop nests that have a producer/consumer relationship. Programs requiring onedimensional versus multidimensional time schedules are all handled with the same algorithm. Synchronizationfree parallelism, permutable loops or pipelined parallelism, and inner parallel loops can be detected. Examples are provided that demonstrate the power of the framework. The algorithm has been incorporated into a tool chain to generate transformations from C/Fortran code in a fully automatic fashion.