| N. Manjikian and T. S. Abdelrahman, "Fusion of Loops for Parallelism and Locality ", IEEE Transactions on Parallel and Distributed Systems, Vol. 8, No. 2, pp. 193--209, Feb. 1997. |
.... traditionally done by compilers using the technique of common subexpression elimination [4] Chatterjee et al. consider the optimal alignment of arrays in evaluating array expression on massively parallel machines [2, 3] Much work has been done on improving locality and parallelism by loop fusion [8, 14, 16]. However, this paper considers a different use of loop fusion, which is to reduce array sizes and memory usage of automatically synthesized code containing nested loop structures. Traditional compiler research does not address this use of loop fusion because this problem does not arise with ....
N. Manjikian and T. S. Abdelrahman, Fusion of Loops for Parallelism and Locality, International Conference on Parallel Processing, pp. II:19--28, Oconomowoc, WI, August 1995.
....increase the efficiency of the cache. Traditionally, the issue of loop transformations has been addressed either with a similar, power consumption related, goal [7] or with a different one, such as restructuring of a possibly sequential program to improve execution efficiency on parallel machines [9]. In [10] a technique that involves global data transformations and local loop transformations in order to 1 The formalism allows as worst case, linear dependence of the iterators, i j = P j Gamma1 k=0 ff k i k ; j = 2; dim(I) minimize overhead on distributed shared machines is introduced. ....
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. In International Conference on Parallel Processing, Vol.2: Software, pages 19--28, Boca Raton, USA, Aug. 1995. CRC Press.
....can modify the access patterns to attain high levels of performance. The techniques that focused on access pattern modifications generally target loop nest structures where most of the execution time is spent. Along these lines, techniques such as loop interchange, distribution, fusion and tiling [21, 16, 22, 14, 11, 12, 4, 2, 15] have found their way into commercial compiler products. A common characteristic of these approaches is that they change the execution order of loop iterations by applying some kind CPDC, Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL 60208. e mail: ....
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. In Proc. the 24th International Conference on Parallel Processing (ICPP'95), Oconomowoc, Wisconsin, August 1995.
.... traditionally done by compilers using the technique of common subexpression elimination [4] Chatterjee et al. consider the optimal alignment of arrays in evaluating array expression on massively parallel machines [2, 3] Much work has been done on improving locality and parallelism by loop fusion [8, 14, 16]. However, this paper considers a different use of loop fusion, which is to reduce array sizes and memory usage of automatically synthesized code containing nested loop structures. Traditional compiler research does not address this use of loop fusion because this problem does not arise with ....
N. Manjikian and T. S. Abdelrahman, Fusion of Loops for Parallelism and Locality, International Conference on Parallel Processing, pp. II:19--28, Oconomowoc, WI, August 1995.
....the TLB misses, NSTEP should be large and SLOPE should be small, subject to Properties 1 and 2. Property 1 can be preserved by imposing the working set constraint, i.e. the amount of data accessed within a single tile should not exceed the cache size and the TLB size. Furthermore, array padding [7] is performed such that the accessed data within a tile are evenly allocated within the cache to minimize the set conflicts. In order to accommodate inaccuracy in array padding and the fact that instructions and data compete for the unified (L2) cache, we follow the industrial practice of ....
....set associativity, this may not be true because, within a tile, the references from different arrays may map to the same cache locations, which may cause one portion of the cache to be underutilized and the other portion oversubscribed. We expect that data transformation techniques such as padding [7, 11] can relieve this problem in reality. 5 For directly mapped caches, if the TLB is fully utilized, the tile size should also be Gamma SLOPE, otherwise, it is simply . For all other set associative caches, the tile size should be Gamma SLOPE. J1 I1 J2 I2 A1 L1 A2 L2 Figure 6: Orientation ....
[Article contains additional citation context not shown here]
Naraig Manjikian and Tarek Abdelrahman. Fusion of loops for parallelism and locality. In IEEE Transactions on Parallel and Distributed Systems, volume 8, Feb 1997.
....fusion preventing dependencies case and no full parallelism is guaranteed. Al Mouhamed s [3] method of loop fusion is based on vertical and horizontal fusion, with fusion not performed if fusion preventing dependencies exist or if the fused loop prevents parallelization. Manjikian and Abdelrahman [17, 18] suggest a shift and peel loop transformation to fuse loops and allow parallel execution. The shifting part of the transformation may fuse loops in the presence of fusion preventing dependencies. However, when the number of peeled iterations exceeds the number of iterations per processor, this ....
....of the multidimensional retiming technique on VLSI systems also has been discussed in [24] and shown to be less complex than other techniques. In this paper, we focus on the loop fusion problem, concentrating on a comparison of the proposed solution with similar techniques recently published [3, 7,13,17,18]. To the authors knowledge, most work on loop fusion has not addressed problems characterized by nested loop (multi level) fusion preventing carried dependencies. In this paper, these problems are solved using the idea of multi dimensional retiming [15,20,25,26] By using the multi dimensional ....
[Article contains additional citation context not shown here]
N. Manjikian and T. S. Abdelrahman. Fusion of loops for parallelism and locality. In Proc. of the 24th International Conference on Parallel Processing, volume II, pages 19--28, 1995.
....they require all loops to be conformable (having exactly same header) and because they do not allow reordering of loop nests. Also, they do not try to optimize both for uniprocessors and multiprocessors. Manjikian and Abdelrahman consider fusion for parallelism and locality but not together [MA95]. Moreover, they do not get optimal solution. 9 Future Work Our fusion alogorithm can be extended to handle general problem of reordering statements within a program for various optimizations. Instruction scheduling is an important example. However, as the size of the nodes in the graph grows, ....
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. In Proceedings of the 24th International Conference on Parallel Processing, pages II:19--28, Oconomowoc, WI, August 1995.
....and Sarkar give an integer programming solution for the weighted fusion problem [17] which produces optimal solutions but does not model parallelism or register constraints, neither is it parameterizable. Manjikian and Abdelrahman consider fusion for parallelism and locality but not together [20, 21]. Moreover, they do not get an optimal solution. They utilize loop shifting and peeling to parallelize loop nests. 7. FUTURE WORK This algorithm does not fully capture cache effects. It is not clear how fusion interacts with other optimizations. There are several loop transformations like ....
Manjikian, N. and Abdelrahman, T. (1995 Aug.) Fusion of Loops for Parallelism and Locality . In Proceedings of the 24th International Conference on Parallel Processing , II:19--28. Oconomowoc, WI.
....bounds; and there must be no data dependence between statements belonging to different loops [225, pp. 89 94] Originally, loop fusion was suggested as a means of reducing loop overhead (as early as the mid 1960 s [237] however, the main benefit for today s computers is to improve data reuse [36, 125, 151]. 2.2.2.3 Dependence Breaking Techniques Scalar expansion is a transformation which enables the parallelisation of a loop whose loop carried dependences are only data anti dependences or output dependences [242, pp. 225 229] By definition, these dependences imply the existence of variables ....
N. Manjikian, T. S. Abdelrahman, "Fusion of Loops for Parallelism and Locality", Technical Report CSRI-315, Computer Systems Research Institute, University of Toronto, Feb. 1995; a shorter version is also available in Proceedings of the International Conference on Parallel Processing (Aug. 1995).
....[2, 11] Kennedy and McKinley [4] perform loop fusion to combine a collection of loops and use loop distribution to improve parallelism. However they do not address the case when fusion preventingdependences exist and when the iteration spaces of candidate loops are not identical. Naraig and Tarek [6] suggest a shift and peel loop transformation to fuse loops and allow parallel execution. The shifting part of the transformation may fuse loops in the presence of fusion preventing dependences. However, when the number of peeled iterations exceeds the number of iterations per processor, this ....
....candidate loops, if the 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 Fig. 5: iteration space after illegal loop fusion calculation of S 2 depends on the result from S 1 , but after loop fusion, the execution of S 2 becomes earlier than S 1 , this kind of loop fusion is illegal [6, 11]. Figure 4(b) shows the code after an illegal loop fusion. In this example, c[i] j] depends on b[i] j 2] but b[i] j 2] has not been calculated yet. In figure 5 it is shown the iteration space after the illegal fusion: iteration (0,0) depends on the results from iterations (0,1) and (0,2) however ....
N. Manjikian and T. S. Abdelrahman, " Fusion of Loops for Parallelism and Locality", in 1995 International Conference on Parallel Processing, 1995, pp. II-19-II-28.
No context found.
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. In Proc. 1995.
....32 16 8 Blk Cyc Dyn Remote Local (b) Avg. miss latency for 16 processors Figure 16: Cache misses for tiled Jacobi Fusion exploits the reuse between the two inner loop nests in addition to enabling tiling. Dependences between the inner two loop nests require the application of shift andpeel [9] to enable legal fusion. Once a single loop nest is obtained with fusion, loop skewing is required just as for the SOR loop nest to obtain a fully permutable loop nest. The application of shift and peel to enable fusion results in dependences which require skewing the inner loops by two iterations ....
....are normalized with respect to time obtained with parallel execution of the original code to facilitate comparison. The normalized execution time for fusion of the inner loops without tiling is also shown, since parallel execution of the fused loops is enabled by the shift and peel transformation [9]. Once again, the results for tiling are similar to those obtained for SOR. The dramatic increase in execution time for dynamic self scheduling correlates with the 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Orig Fused Blk Cyc Dyn Normalized execution time Untiled Tiled ....
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. In Proc. 1995 Intl. Conf. on Parallel Processing, pages II19--II28, August 1995.
....loop fusion is straightforward. Candidate loops for fusion are identified, and compatibility in the access patterns across the loop nests is enforced with appropriate code and data transformations. The loop nests are then fused, with any necessary adjustments to ensure that the fusion is legal [11]. Finally, the arrays referenced in the fused loop nest are identified, and the algorithm presented in Figure 4 is used to derive the memory layout for those arrays. In some cases, it is not possible to fuse all candidate loop nests which use the same set of arrays into a single loop nest due to ....
....the code and data for the loop nest sequences. To enable fusion and tiling, subroutine inlining and loop interchange are used where necessary to collect all loop nests together and ensure compatibility. Fusion preventing dependences are overcome using the technique of shifting iteration spaces [11]. Legal tiling following fusion is enabled by loop skewing [18] All code is instrumented to measure execution time and cache misses. The array sizes are large, and hence, the entire data set cannot be contained in the cache. Cache partitionedmemory layouts are obtained using the techniques ....
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. In Proc. 1995 Intl. Conf. on Parallel Processing, August 1995. To appear.
No context found.
N. Manjikian and T. S. Abdelrahman, "Fusion of Loops for Parallelism and Locality ", IEEE Transactions on Parallel and Distributed Systems, Vol. 8, No. 2, pp. 193--209, Feb. 1997.
No context found.
Naraig Manjikian and Tarek S. Abdelrahman. Fusion of loops for parallelism and locality. Proc. of the 24th International Conference on Parallel Processing, Aug. 1995.
No context found.
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. In Proc. International Conference on Parallel Processing, pp. II:19--28, Oconomowoc, WI, August 1995.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC