| N. Manjikian and T. S. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193--209, 1997. 2.1.1 |
....arranging the memory and ALU operations in the respective units. This model can be found in DSP processors, embedded system and sharedmemory multiprocessors computer systems. For example, separated ALU and memory units exist in TMS320C64x. In the past, a lot of work has been done on loop fusion [4] and loop tiling [8, 2] Loop fusion is used to fuse the consecutive loops into a single loop to exploit the data localityand reduce the additional synchronization. However, the memory reference maybe too much to be hidden eciently even after the loops are fused. Therefore, we use multiple loop ....
Naraig Manjikian and Tarek S. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems,8(2),Feb 1997.
....usage is not reduced) Their focus is in recognizing the opportunity in a scalar loop nest, while ours is in enabling the opportunity in an array language compiler via statement fusion. Many techniques for improving locality by loop transformations have appeared in the literature [CMT94, KM92, MA97, WL91] Much of this work addresses the issue of managing the conflicting goals of improving locality without sacrificing parallelism. This is a far less important issue in an array language compiler, for the compiler can assume that only the loops that it generates need to be parallelized; user ....
Naraig Manjikian and Tarek S. Abdelrahman. Fusion of loops for parallelism and locality. IEEE transactions on parallel and distributed systems, 8(2):193--209, February 1997.
....is a combination of skewing and subsequent permutation. It seeks to improve temporal locality in loop nestings by reducing the iteration distance between subsequent accesses to the same array element [8, 3] Moreover, loop fusion allows to exploit locality of reference across single loop nestings [9]. Often, superior cache performance can be achieved if both the iteration order as well as the memory layout are subject to compiler transformations. Examples are the combination of array transposition with loop permutation [2] or that of array padding with tiling in order to increase tile sizes ....
N. Manjikian and T.S. Abdelrahman. Fusion of Loops for Parallelism and Locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193-209, 1997.
.... dependences by peeling off some iterations of the first loop and then applying fusion on the remaining parts[13] While Porterfield considered only a pair of loops, Manjikian and Abdelrahman later extended peel and jam to find the minimal peeling factor for a group of fusible loops[11]. Also enabled by peel and jam, Song and Li developed a new tiling method that blocks multiple loops within a time step loop[16] However, these methods are not a complete global strategy because they did not address the cases where not all loops in a program are fusible. In addition, peel and jam ....
....mixed results. On SGI Octane, the former was improved by 10 but the latter interacted poorly with the SGI compiler [15] The previous work on loop fusion did not combine it with data transformations with two exceptions. Manjikian and Abdelrahman, who applied padding to reduce cache conflicts[11]. Array padding is less effective than inter array spatial reuse because the latter eliminates cache conflicts by placing simultaneously used data into the same cache block. In fact, SGI compiler has padding as a part of its optimization but still causes serious fusion overhead. In a recent study, ....
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8, 1997.
....fusion preventing dependencies case and no full parallelism is guaranteed. Al Mouhamed s [3] method of loop fusion is based on vertical and horizontal fusion, with fusion not performed if fusion preventing dependencies exist or if the fused loop prevents parallelization. Manjikian and Abdelrahman [17, 18] suggest a shift and peel loop transformation to fuse loops and allow parallel execution. The shifting part of the transformation may fuse loops in the presence of fusion preventing dependencies. However, when the number of peeled iterations exceeds the number of iterations per processor, this ....
....of the multidimensional retiming technique on VLSI systems also has been discussed in [24] and shown to be less complex than other techniques. In this paper, we focus on the loop fusion problem, concentrating on a comparison of the proposed solution with similar techniques recently published [3, 7,13,17,18]. To the authors knowledge, most work on loop fusion has not addressed problems characterized by nested loop (multi level) fusion preventing carried dependencies. In this paper, these problems are solved using the idea of multi dimensional retiming [15,20,25,26] By using the multi dimensional ....
N. Manjikian and T. S. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193--209, February 1997.
....L2 group reuse. These techniques easily generalize to three or more cache levels. 4 Loop Fusion Loop fusion is a transformation where adjacent loops are fused into a single loop containing both loop bodies. It can be used to improve locality directly by bringing together memory references [14, 18, 25], or to enable additional locality optimizations such as loop permutation [19] and array contraction [9] We observe improvements in temporal locality after fusing the loop nests of Figure 2 at the innermost level, obtaining the nest shown in Figure 6. Assuming array sizes exceed the L2 cache ....
....performance. In this paper we extend our padding algorithms to consider multi level caches. A number of researchers have examined techniques related to this paper. Manjikian and Abdelrahman propose a new loop fusion algorithm called shift and peel which expands the applicability of loop fusion [18]. They also propose cache partitioning, a version of MAXPAD which does not take severe conflict misses into account. Singhai and McKinley present a parameterized loop fusion algorithm which considers parallelism and register pressure in addition to reuse [25] In comparison, our fusion algorithm ....
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193--209, February 1997.
....(i.e. memory usage is not reduced) Their focus is in recognizing the opportunity in a scalar loop nest, while ours is in enabling the opportunity in an array language compiler via statement fusion. Many techniques for improving locality by loop transformations have appeared in the literature [5, 16, 19, 25]. Much of this work addresses the issue of managing the conflicting goals of improving locality without sacrificing parallelism. This is a far less important issue in an array language compiler, for the compiler can assume that only the loops that it generates need to be parallelized; user loops ....
Naraig Manjikian and Tarek S. Abdelrahman. Fusion of loops for parallelism and locality. IEEE transactions on parallel and distributed systems, 8(2):193--209, February 1997.
....objectives at once [16] In addition to tiling, researchers working on locality optimizations have considered both computation reordering transformations such as loop permuta Fig. 11. Matrix multiplication: Cache utilization of tiling heuristics tion [9, 17, 25] and loop fission fusion [15, 17]. Scalar replacement replaces array references with scalars, reducing the number of memory references if the compiler later puts the scalar in a register [3] Many cache models have been designed for estimating cache misses to help guide data locality optimizations [8, 9, 17, 25] Earlier models ....
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193--209, February 1997.
....loop carried dependences and loop fusion can be applied profitably. In Figures 1c and 1d, parallel code is shown for a block distribution of a and b, where lb and ub denote the boundaries of the array region owned by the local processor. Enabling parallelization of fused loops is discussed in [MA97] and they provide a solution which only applies to programs with constant dependence distances. Tolerating message latency In order to hide message latency, compilers try to overlap computation with communication. Consider a loop body (possibly containing inner loops) which may contain ....
....Tip95, Kri] we are unaware of any work describing the concept of computing an iteration space slice. However, our work does not address many of the important issues that others have studied, such as complicated or irregular control flow and interprocedural slicing. Manjikian and Abdelraham [MA97] discussed the problem of loop fusion creating loop carried dependences and preventing parallelism. Their solution [MA97] works only in the presence of constant dependence distances and generates code that requires a barrier and will be inefficient if message latency is high. Since their method ....
[Article contains additional citation context not shown here]
Maraig Manjikian and Tarek SS. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, February 1997.
....and Sarkar give an integer programming solution for the weighted fusion problem [17] which produces optimal solutions but does not model parallelism or register constraints, neither is it parameterizable. Manjikian and Abdelrahman consider fusion for parallelism and locality but not together [20, 21]. Moreover, they do not get an optimal solution. They utilize loop shifting and peeling to parallelize loop nests. 7. FUTURE WORK This algorithm does not fully capture cache effects. It is not clear how fusion interacts with other optimizations. There are several loop transformations like ....
Manjikian, N. and Abdelrahman, T. S. (1997 Feb) Fusion of Loops for Parallelism and Locality . IEEE Transactions on Parallel and Distributed Systems, 8. The Computer Journal, Vol. 40, No. 7,
....between references to the same array [9] Most compiler researchers have concentrated on computation reordering transformations. Loop permutation and tiling are the primary optimization techniques [9, 17, 23] though loop fission (distribution) and loop fusion have also been found to be helpful [15, 17]. Coleman and McKinley show how to select tile sizes which avoid conflict misses using the Euclidean algorithm [7] McKinley and Temam perform a study of loop nest oriented cache behavior for scientific programs and conclude that conflict misses cause half of all cache misses and most intra nest ....
....and Kandemir et al. 14] investigate array transpose as a technique for improving data locality in uniprocessors. Manjikian and Abdelrahman perform cache partitioning, spacing out variables as far as possible in a cache, in order to reduce conflict misses in parallel programs after loop fusion [15]. McFarling shows compiler transformations of code sequences can help eliminate conflict misses in the instruction cache [16] Many researchers have also examined the problem of deriving estimates of cache misses in order to help guide data locality optimizations [8, 9, 23] These models typically ....
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193--209, February 1997.
.... shown to reduce conflict misses in the SPEC benchmarks [15] Researchers working on compile time data locality optimizations have concentrated on computation reordering such as loop permutation [7, 17, 23] and tiling [5, 14, 23] though loop fission and fusion have also been found to be helpful [16, 17]. Many cache models have been designed for estimating cache misses to help guide data locality optimizations [6, 7, 17, 23] These models typically can predict only capacity misses because they assume a fully associative cache. Temam et al. present a method for detecting and counting the number of ....
.... contiguous [1] and combining array transpose with loop permutation to improve parallelism and locality [4, 12, 19] Manjikian and Abdelrahman perform cache partitioning, spacing out variables as far as possible in a cache, in order to reduce conflict misses in parallel programs after loop fusion [16]. We improve on their algorithm in this paper. In previous research, we developed padding techniques for eliminating severe cache conflicts in stencils and linear algebra computations [20] In this paper we improve our heuristic to preserve group reuse across outer loop iterations, as well as ....
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193--209, February 1997.
No context found.
N. Manjikian and T. S. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193--209, 1997. 2.1.1
No context found.
Naraig Manjikian and Tarek Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193--209, February 1997.
No context found.
N. Manjikian and T. S. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193--209, 1997. 2.1.1
No context found.
N. Manjikian and T. S. Abdelrahman. Fusion of Loops for Parallelism and Locality. IEEE Transactions of Parallel and Distributed Systems, 8(2):193--209, February 1997.
.... approaches consider basically unit execution time for each iteration and zero communication for each communication step (UET model) Previous attempts of scheduling uniform dependence loops include the free scheduling method introduced in [10] tiling transformation [9, 17, 20] and loop fusion [12, 5]. The problem of scheduling uniform dependence loops is a very special case of scheduling Directed Acyclic Graphs (DAGs) The general DAG scheduling problem is known to be NP complete (see Ullman [19] so many researchers have tackled special cases of the above problem [8, 10] hoping to come up ....
N. Manjikian and Tarek S. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193--209, 1997.
....of partitions such that the loops within each partition can be fused, possibly enabled by loop shifting, and the fused loop remains parallel. Mainly because of di erent objective functions, his problem and ours yield completely di erent complexity. Manjikian and Abdelrahman present shift and peel [18]. They shift the loops in order to enable fusion. None of the works listed above address the issue of minimizing memory requirement for a collection of loops and their techniques are very di erent from ours. 7. CONCLUSION In this paper, we propose to enhance data locality via a memory reduction ....
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193-209, February 1997.
....remote memory on another processor node. Many compiler techniques exist for improving locality in both sequential and parallel scientific programs. Loop transformations (e.g. loop permutation, fusion, tiling) for sequential dense matrix codes with regular memory access patterns has proven useful [19, 27, 48, 49, 57, 55, 71, 76, 87, 88]. Data layout optimizations (e.g. transpose, padding) also help [2, 3, 13, 18, 39, 69, 70] even for irregular [1, 22, 58] and pointer based programs [8, 17] Despite the major advances made in providing software support for improving locality for both sequential and parallel programs, more work ....
....but tiling for iterative PDE solvers is a relatively new area. One reason is that when loops are permuted to exploit temporal and spatial reuse [57, 87] group temporal and spatial reuse in 2D stencils can usually be obtained without tiling, though padding may be necessary to preserve group reuse [55, 70]. As we have already seen, this is no longer the case for 3D stencil codes. Tiling can be used to exploit reuse across outer time step loops [76] but this is not possible in multigrid codes because each time step contains sweeps over a sequence of grids of different sizes. We recently ....
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193--209, February 1997.
....remote memory on another processor node. Many compiler techniques exist for improving locality in both sequential and parallel scientific programs. Loop transformations (e.g. loop permutation, fusion, tiling) for sequential dense matrix codes with regular memory access patterns has proven useful [16, 25, 42, 43, 52, 51, 63, 67, 76, 77]. Data layout optimizations (e.g. transpose, padding) also help [3, 5, 12, 15, 35, 61, 62] even for irregular programs [1, 20, 53] Despite the major advances made in providing software support for improving locality for both sequential and parallel programs, more work remains. In the following ....
....but tiling for iterative PDE solvers is a relatively new area. One reason is that when loops are permuted to exploit temporal and spatial reuse [52, 76] group temporal and spatial reuse in 2D stencils can usually be obtained without tiling, though padding may be necessary to preserve group reuse [51, 62]. As we have already seen, this is no longer the case for 3D stencil codes. Tiling can be used to exploit reuse across outer time step loops [67] but this is not possible in multigrid codes because each time step contains sweeps over a sequence of grids of different sizes. We believe tiling can ....
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193--209, February 1997.
No context found.
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8, 1997.
No context found.
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8, 1997.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC