Results 1  10
of
19
Selecting Tile Shape for Minimal Execution Time
, 1999
"... Many computationallyintensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiplynested loops which have a regular stencil of data dependences. Tiling is a wellknown optimization that improve ..."
Abstract

Cited by 28 (2 self)
 Add to MetaCart
Many computationallyintensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiplynested loops which have a regular stencil of data dependences. Tiling is a wellknown optimization that improves performance on such loops, particularly for computers with a multilevelled hierarchy of parallelism and memory. Most previous work on tiling restricts the tile shape to be rectangular. Our previous work and its extension by Desprez, Dongarra, Rastello and Robert showed that for doubly nested loops, using parallelograms can improve parallel execution time by decreasing the idle time, the time that a processor spends waiting for data or synchronization. In this paper, we extend that work to more deeply nested loops, as well as to more complex loop bounds. We introduce a model which allows us to demonstrate the equivalence in complexity of linear programming and determining the execution tim...
An Efficient Code Generation Technique for Tiled Iteration Spaces
 IEEE Transactions on Parallel and Distributed Systems
, 2003
"... This paper presents a novel approach for the problem of generating tiled code for nested forloops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto para ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
(Show Context)
This paper presents a novel approach for the problem of generating tiled code for nested forloops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto parallel architectures. However, automatic code generation for tiled loops can be a very complex compiler work, especially when nonrectangular tile shapes and iteration space bounds are concerned. Our method considerably enhances previous work on rewriting tiled loops, by considering parallelepiped tiles and arbitrary iteration space shapes. In order to generate tiled code, we first enumerate all tiles containing points within the iteration space and second sweep all points within each tile. For the first subproblem, 1 we refine upon previous results concerning the computation of new loop bounds of an iteration space that has been transformed by a nonunimodular transformation. For the second subproblem, we transform the initial parallelepiped tile into a rectangular one, in order to generate efficient code with the aid of a nonunimodular transformation matrix and its Hermite Normal Form (HNF). Experimental results show that the proposed method significantly accelerates the compilation process and generates much more efficient code.
Automatic Data and Computation Decomposition on Distributed Memory Parallel Computers
 ACM Trans. Programming Languages and Systems
, 2002
"... On shared memory parallel computers (SMPCs) it is natural to focus on decomposing the computation (mainly by distributing the iterations of the nested DoLoops). In contrast, on distributed memory parallel computers (DMPCs) the decomposition of computation and the distribution of data must both be h ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
(Show Context)
On shared memory parallel computers (SMPCs) it is natural to focus on decomposing the computation (mainly by distributing the iterations of the nested DoLoops). In contrast, on distributed memory parallel computers (DMPCs) the decomposition of computation and the distribution of data must both be handledin order to balance the computation load and to minimize the migration of data. We propose and validate experimentally a method for handling computations and data synergistically to optimize the overall execution time. The method relies on a number of novel techniques, also presented in this paper. The core idea is to rank the "importance" of data arrays in a program and define some of the dominant. The intuition is that the dominant arrays are the ones whose migration would be the most expensive. Using the correspondence between iteration space mapping vectors and distributed dimensions of the dominant data array in each nested Doloop, we are able to design algorithms for determin...
A multilevel parallelization framework for highorder stencil computations,” in EuroPar
, 2009
"... Abstract. Stencil based computation on structured grids is a common kernel to broad scientific applications. The order of stencils increases with the required precision, and it is a challenge to optimize such highorder stencils on multicore architectures. Here, we propose a multilevel parallelizati ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Stencil based computation on structured grids is a common kernel to broad scientific applications. The order of stencils increases with the required precision, and it is a challenge to optimize such highorder stencils on multicore architectures. Here, we propose a multilevel parallelization framework that combines: (1) internode parallelism by spatial decomposition; (2) intrachip parallelism through multithreading; and (3) datalevel parallelism via singleinstruction multipledata (SIMD) techniques. The framework is applied to a 6 th order stencil based seismic wave propagation code on a suite of multicore architectures. Strongscaling scalability tests exhibit superlinear speedup due to increasing cache capacity on Intel Harpertown and AMD Barcelona based clusters, whereas weakscaling parallel efficiency is 0.92 on 65,536 BlueGene/P processors. Multithreading+SIMD optimizations achieve 7.85fold speedup on a dual quadcore Intel Clovertown, and the datalevel parallel efficiency is found to depend on the stencil order.
Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs using Memory Mapped Network Interfaces
 In Proceedings of the 2002 ACM/IEEE conference on Supercomputing (SC2002
, 2002
"... In this paper we propose several alternative methods for the compile time scheduling of Tiled Nested Loops onto a fixed size parallel architecture. We investigate the distribution of tiles among processors, provided that we have chosen either a nonoverlapping communication mode, which involves succ ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
In this paper we propose several alternative methods for the compile time scheduling of Tiled Nested Loops onto a fixed size parallel architecture. We investigate the distribution of tiles among processors, provided that we have chosen either a nonoverlapping communication mode, which involves successive computation and communication steps, or an overlapping communication mode, which supposes a pipelined, concurrent execution of communication and computations. In order to utilize the available processors as efficiently as possible, we can either adopt a cyclic assignment schedule, or assign neighboring tiles to the same CPU, or adapt the size and shape of tiles, so that the required number of processors is exactly equal to the number of the available ones. We theoretically and experimentally compare the proposed schedules, so as to design one which achieves the minimum total execution time, depending on the cluster configuration, (i.e. number and type of nodes, interconnect bandwidth, etc) the internal characteristics of the underlying architecture (i.e. NIC and DMA latencies, etc) and the iteration space size and shape. 1.
MessagePassing Code Generation for Nonrectangular Tiling Transformations
 Parallel Computing
, 2006
"... Tiling is a well known loop transformation used to reduce communication overhead in distributed memory machines. Although a lot of theoretical research has been done concerning the selection of proper tile shapes that reduce processor idle times, there is no complete approach to automatically parall ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Tiling is a well known loop transformation used to reduce communication overhead in distributed memory machines. Although a lot of theoretical research has been done concerning the selection of proper tile shapes that reduce processor idle times, there is no complete approach to automatically parallelize nonrectangularly tiled iteration spaces and consequently there are no actual experimental results to verify previous theoretical work on the effect of the tile shape on the overall completion time of a tiled algorithm. This paper presents a complete endtoend framework to generate automatic messagepassing code for tiled iteration spaces. It considers general parallelepiped tiling transformations and convex iteration spaces. We aim to address all problems concerning data parallel code generation efficiently by transforming the initial nonrectangular tile to a rectangular one. In this way, data distribution and the respective communication pattern become simple and straightforward. We have implemented our parallelizing techniques in a tool which automatically generates MPI code and run several benchmarks on a cluster of PCs. Our experimental results show the merit of general parallelepiped tiling transformations, and verify previous theoretical work on schedulingoptimal, nonrectangular tile shapes.
and W.Cai. Timeminimal Tiling when Rise is Larger than Zero
 Parallel Computing
, 2002
"... Abstract. This paper presents a solution to the open problem of finding the optimal tile size to minimise the execution time of a parallelogramshaped iteration space on a distributed memory machine when the rise of the tiled iteration space is larger than zero. Based on a new communication cost mod ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Abstract. This paper presents a solution to the open problem of finding the optimal tile size to minimise the execution time of a parallelogramshaped iteration space on a distributed memory machine when the rise of the tiled iteration space is larger than zero. Based on a new communication cost model, which accounts for computation and communication overlap for tiled programs, the problem is formulated as a discrete nonlinear optimisation problem and the closedform optimal tile size is derived. Our experimental results show that the execution times when optimal tile sizes are used are close to the experimentally best. The proposed technique can be used for hand tuning parallel codes and in optimising compilers.
Parallelization of the Numerical Lyapunov Calculation for the FermiPastaUlam Chain
, 2001
"... In this paper, we present an efficient and simple solution to the parallelization of discrete integration programs of ordinary differential equations (ODE). The main technique used is known as loop tiling. To avoid the overhead due to code complexity and border eects, we introduce redundant tasks ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
In this paper, we present an efficient and simple solution to the parallelization of discrete integration programs of ordinary differential equations (ODE). The main technique used is known as loop tiling. To avoid the overhead due to code complexity and border eects, we introduce redundant tasks and we use non parallelepiped tiles. Thanks both to cache reuse (4.3) and coarse granularity (24.5) , the speedup using 25 processors over the nontiled sequential implementation is larger than 106. We also present
Efficient tiling for an ODE discrete integration program: redundant tasks
"... In this paper, we present an efficient and simple solution to the parallelization of discrete integration programs of ordinary differential equations (ODE). The main technique used is known as loop tiling. To avoid the overhead due to code complexity and border effects, we introduce redundant tasks ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
In this paper, we present an efficient and simple solution to the parallelization of discrete integration programs of ordinary differential equations (ODE). The main technique used is known as loop tiling. To avoid the overhead due to code complexity and border effects, we introduce redundant tasks and we use non parallelepiped tiles. Thanks both to cache reuse (4:3) and coarse granularity (24:5) , the speedup using 25 processors over the nontiled sequential implementation is larger than 106.
Hyperplane Grouping and Pipelined Schedules: How to Execute Tiled Loops Fast on Clusters of SMPs
"... Abstract. This paper proposes a novel approach for the parallel execution of tiled Iteration Spaces onto a cluster of SMP PC nodes. Each SMP node has multiple CPUs and a single memory mapped PCISCI Network Interface Card. We apply a hyperplanebased grouping transformation to the tiled space, so as ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. This paper proposes a novel approach for the parallel execution of tiled Iteration Spaces onto a cluster of SMP PC nodes. Each SMP node has multiple CPUs and a single memory mapped PCISCI Network Interface Card. We apply a hyperplanebased grouping transformation to the tiled space, so as to group together independent neighboring tiles and assign them to the same SMP node. In this way, intranode (intragroup) communication is annihilated. Groups are atomically executed inside each node. Nodes exchange data between successive group computations. We schedule groups much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive atomic group executions. The applied nonblocking schedule resembles a pipelined datapath, where group computation phases are overlapped with communication ones, instead of being interleaved with them. Our experimental results illustrate that the proposed method outperforms previous approaches involving blocking communication or conventional grouping schemes.