| S. Carr. Combining optimization for cache and instruction-level parallelism. In Proceedings of the 1996. |
....possible optimizations may find a good sequence of transformations to parallelize a given program [16] None of these works seeks to unify guidance for multiple levels of the memory hierarchy. Multi level unification: Unroll and jam can guide locality and instruction level parallelism in concert [6]. Loop fusion and distribution affect both parallelism and locality [21] These two works do not directly address tiling or the multi level nature of the interactions. For matrix multiply, 4] tiles an arbitrary number of memory levels, one level at a time. It finds the optimal execution time by ....
Steve Carr. Combining optimization for cache and instruction-level parallelism. In PACT '96, pages 238-247, 1996.
....Wolfe and Lam [WL91, p. 34] de ne reuse as a feature inherent to a loop: A data item is reused if the same data is used in multiple iterations in a loop nest. Thus reuse is a measure that is inherent in the computation. They de ne locality as utilized reuse . Several other researchers ( Car96] follow this classi cation. Temam and McKinley [MT98, p. 5] exchange the meaning of the two terms. They say References with locality thus have the potential for reuse in the cache. Reuse is simply a hit in the cache that achieves locality. i.e. they de ne locality as a feature inherent to the ....
Steve Carr. Combining optimization for cache and instruction-level parallelism. In The
....after loop unrolling, we do not want data in the data cache that will be reused to be overwritten before the reuse, thus degrading performance. This can happen e.g. when software pipelining is applied. Steve Carr describes an algorithm for loop scheduling to optimize ILP and cache behavior in [27]. While instruction scheduling can optimize cache behavior, analyzing cache behavior can enhance the instruction scheduler. Cache reuse information can be used to nd a balance between all cache hits and all cache misses assumptions that most schedulers (except the balanced list scheduler, see ....
Carr, S. Combining optimization for cache and instruction-level parallelism. In Proceedings of the
....with a prefetch distance insufficient to overlap their latency fully, or for unprefetched misses. Instruction overhead. Read miss clustering through unroll and jam can exploit scalar replacement, which replaces redundant memory references with register operations and reduces the instruction count [2, 5, 7]. If the redundant references tended to hit in the cache (as seen in previous work [22] scalar replacement can reduce both the unnecessary prefetches resulting from these references and their address generation overhead. Legality limitations of clustering. Unlike clustering, software ....
....to maintain the bandwidth benefits of tiling while also improving prefetching effectiveness. Two works on unroll and jam have particular relevance. Carr has considered prefetches and cache misses while calculating the heuristics used when applying unroll and jam for scalar replacement or locality [5]. However, that work did not seek to improve prefetching, but instead assumed that prefetching was effective given enough hardware resources. Carr et al. have used unroll and jam to improve software pipelining, without considering cache misses [6] That study would reduce floating point stalls in ....
S. Carr. Combining Optimization for Cache and InstructionLevel Parallelism. In Proc. of the Conf. on Parallel Architectures and Compilation Techniques, pages 238--247, Oct. 1996.
....unrolls an outer loop and fuses (jams) the resulting inner loop copies into a single inner loop. Previous work has used unroll and jam for scalar replacement (replacing array memory operations with register accesses) better floating point pipelining, or cache locality [AC72, CCK88, CK94, Nic87, Car96] Using unroll and jam for read miss clustering requires different heuristics, and may help even when the previously studied benefits are unavailable. We prefer to use unroll and jam instead of strip mine and interchange for two reasons. First, unroll and jam allows us to exploit additional ....
....increased parallelism among both prefetches and unprefetched misses. Instruction overhead. Read miss clustering through unroll and jam can exploit scalar replacement, by which redundant memory references are replaced with register operations and the total number of instructions is reduced [AC72, Car96, CK94] If the redundant references tended to hit in the cache (as seen in Chapter 4) scalar replacement can reduce both the unnecessary prefetches resulting from these references and their address generation overhead. Other limitations. The memory parallelism provided by read miss clustering ....
[Article contains additional citation context not shown here]
Steve Carr. Combining Optimization for Cache and Instruction-Level Parallelism. In Proceedings of the IFIP WG 10.3 Working Conference on Parallel Architectures and Compilation Techniques, PACT '96, pages 238--247, October 1996.
....of transformations before any pro ling takes place. Only the transformations with highest ranking are pro led. The transformation that gives rise to the shortest execution time is chosen. We restrict attention to two well known program transformations: loop tiling [5, 8] and unroll and jam [4]. Both transformations are targeted towards cache exploitation. Unroll and jam, moreover, duplicates the loop body to expose more instructions to the hardware that can be executed in parallel. These two transformations, therefore, are highly interdependent and their compound result gives rise to a ....
....rise to a highly irregular optimization space [3] Since the dominant e ect of the transformations is their e ect on cache behavior, static cache models are the prime models of interest. Recently, there have been approaches where the compiler searches the transformation space using static models [4, 13]. These approaches, however, do not use pro le information. Also, there have been several approaches to feedback directed optimization, in which run time information is exploited to alter a program [10] We can distinguish between on line and o line approaches. On line approaches optimize at ....
S. Carr. Combining optimization for cache and instruction level parallelism. In Proc. PACT, pages 238-247, 1996.
....exploitation by dividing loops in small tiles such that the working set of one tile can be kept in the cache. Loop unrolling duplicates the loop body, thereby exposing more instructions to the hardware that can be executed in parallel. Moreover, loop unrolling (and in particular unroll andjam [4] that is considered in the present paper) affects the memory reference pattern of the loop and thereby affects the locality of the loop too. These two transformations, therefore, are highly interdependent and the compound result of applying them in terms of execution time gives rise to a very ....
....6 Related Work Over the past years, many authors have considered limited search techniques for optimization purposes. In particular, for tiling and unrolling, Coleman and McKinley [7] and Lam, Rothberg and Wolf [12] employ a restricted search for tile sizes based on a simple cache model. Carr [4] computes unroll factors in order to minimize the difference in machine and loop balance using a compile time search based on a static model. In contrast to these approaches, the present approach uses actual execution times and moreover considers both loop tiling and unrolling at the same time. ....
S. Carr. Combining optimization for cache and instruction level parallelism. In Proc. PACT'96, pages 238--247, 1996.
....set of each tile fits in the cache thereby exploiting the available locality. It is also important to fully utilize the internal parallelism within modern processors which are capable of issuing several instructions per cycle [8] Loop unrolling is an important transformation for this purpose [3] as it increases the size of the loop body, exposing more instructions for Instruction Level Parallelism (ILP) As we require effective utilization of the memory hierarchy and internal parallelism, we need to combine both of these transformations. In this paper, we address the problem of ....
....purposes. In particular, for tiling and unrolling, Coleman and McKinley [7] and Lam, Rothberg and Wolf [14] employ a restricted search for tile sizes based on a simple cache model. In [4] an improved tile size selection algorithm is presented that also uses a static searching technique. Carr [3] computes several unroll factors and chooses the best in order to minimize the difference in machine and loop balance. In contrast to these approaches, the present approach uses actual execution times and moreover considers both loop tiling and unrolling at the same time. Whaley and Dongarra ....
S. Carr. Combining optimization for cache and instruction level parallelism. In Proc. PACT'96, pages 238--247, 1996.
....Unimodular transformations provide a means to guide loop transformations for parallelism [42] and locality [41] However, this work seeks to unify only improvement enabling transformations such as skewing, interchange, and reversal, and does not consider locality and parallelism in concert. Carr [8] studies criteria to guide locality and instruction level parallelism in concert via unroll2 and jam. Kennedy and McKinley [26] study the extent to which loop fusion and distribution affect both parallelism and locality. However, these latter two works do not directly address tiling or the nature ....
Steve Carr. Combining optimization for cache and instruction-level parallelism. Technical Report TR 95-06, Michigan Technological University, Department of Computer Science, 1995.
....2(a) This transformation unrolls an outer loop and fuses (jams) the resulting inner loop copies into a single inner loop. Previous work has used unroll and jam for scalar replacement (replacing array memory operations with register accesses) better floating point pipelining, or cache locality [3, 4, 5, 6, 10]. Using unroll and jam for read miss clustering requires different heuristics, and may help even when the previously studied benefits are unavailable. We prefer to use unroll and jam instead of strip mine and interchange for two reasons. First, unrolland jam allows us to exploit additional ....
S. Carr, "Combining Optimization for Cache and Instruction-Level Parallelism," in Proceedings of the IFIP WG 10.3 Working Conference on Parallel Architectures and Compilation Techniques, PACT '96, pp. 238--247, October 1996.
....2(a) This transformation unrolls an outer loop and fuses (jams) the resulting inner loop copies into a single inner loop. Previous work has used unroll and jam for scalar replacement (replacing array memory operations with register accesses) better floating point pipelining, or cache locality [1, 2, 3, 4, 14]. Using unroll and jam for read miss clustering requires different heuristics, and may help even when the previously studied benefits are unavailable. We prefer to use unroll and jam instead of strip mine and interchange for two reasons. First, unroll and jam allows us to exploit benefits from ....
S. Carr. Combining Optimization for Cache and InstructionLevel Parallelism. In Proceedings of the IFIP WG 10.3 Working Conference on Parallel Architectures and Compilation Techniques, PACT '96, pages 238--247, October 1996.
....tries to give an analytical expression for this minimum. In [7, 21] we have studied the characteristics of optimization spaces in detail for a variety of benchmarks and platforms and showed that different 1 The variation on loop unrolling that we consider in this paper is unroll and jam [1, 9, 8] whereby an outer loop is unrolled and the inner loops are fused. Epilogue code is not shown here for simplicity. 3 Original Transformed DO I = 1,N DO J = 1,N DO K = 1,N A[I,J] A[I,J] B[I,K] C[K,J] DO JJ = 1,N,TJ DO KK = 1,N,TK DO I = 1,N,U DO J = JJ,MIN(JJ TJ 1,N) DO K = ....
....of Iterations 8 Related Work There are many paper dealing with tile size selection [12, 16, 23, 27, 30] All these selection algorithms use static analysis and models to compute tile sizes, in contrast to the present approach that uses dynamic profiling information. Carr and Kennedy [9] and Carr [8] compute unroll andjam factors in order to minimise the difference in machine and loop balance. Carr computes how much benefit the unroll and jam of a loop has for a range of unroll factors based on static models and searches at compile time to decide which unroll factor has the most benefit. In ....
[Article contains additional citation context not shown here]
S. Carr. Combining optimization for cache and instruction level parallelism. In Proc. PACT'96, pages 238--247, 1996.
....2(a) This transformation unrolls an outer loop and fuses (jams) the resulting inner loop copies into a single inner loop. Previous work has used unroll and jam for scalar replacement (replacing array memory operations with register accesses) better floating point pipelining, or cache locality [1, 2, 3, 4, 11]. Using unroll and jam for read miss clustering requires different heuristics, and may help even when the previously studied benefits are unavailable. We prefer to use unroll and jam instead of strip mine and interchange for two reasons. First, unroll and jam allows us to exploit benefits from ....
S. Carr. Combining Optimization for Cache and InstructionLevel Parallelism. In Proc. of the Conf. on Parallel Architectures and Compilation Techniques, 1996.
....26 7 Related Work Over the past years, many authors have considered limited search techniques for optimization purposes. In particular, for tiling and unrolling, Coleman and McKinley [7] and Lam, Rothberg and Wolf [16] employ a restricted search for tile sizes based on a simple cache model. Carr [4] computes unroll factors in order to minimize the di erence in machine and loop balance. Carr computes how much bene t the unroll and jam of a loop has for a range of unroll factors based on static models and searches at compile time to decide which unroll factor has the most bene t. In contrast ....
S. Carr. Combining optimization for cache and instruction level parallelism. In Proc. PACT'96, pages 238-247, 1996.
....profiles to sharpen constant propagation [2] Our work is unique in that it uses information at the instruction level, and integrates it into a scheduler. Previous work on using instruction level parallelism (ILP) to hide latencies for nonblocking caches has two major differences from this work [4, 6, 8, 10, 12]. First, previous work uses static locality analysis which works very well for regular array accesses. Secondly, these schedulers only differentiates between a hit or a miss. Since we use performance counters, we can improve the schedules of pointer based codes that compilers have difficulty ....
S. Carr. Combining optimization for cache and instruction-level parallelism. In The 1996 International Conference on Parallel Architectures and Compilation Techniques, Boston, MA, October 1996.
....2(a) This transformation unrolls an outer loop and fuses (jams) the resulting inner loop copies into a single inner loop. Previous work has used unroll and jam for scalar replacement (replacing array memory operations with register accesses) better floating point pipelining, or cache locality [1, 2, 3, 4, 11]. Using unroll and jam for read miss clustering requires different heuristics, and may help even when the previously studied benefits are unavailable. We prefer to use unroll and jam instead of strip mine and interchange for two reasons. First, unroll and jam allows us to exploit benefits from ....
S. Carr. Combining Optimization for Cache and InstructionLevel Parallelism. In Proc. of the Conf. on Parallel Architectures and Compilation Techniques, 1996.
....and tiling with the linear loop and data transformations better. Even in a sophisticated commercial compiler like MIPSpro we have found that sometimes loop unrolling and tiling could not improve the performance over linear loop transformations. Therefore, the works such as the one done by Carr [4] for combining optimizations for cache and instructionlevel parallelism are very important. 5 Related Work Significant work related to optimizing cache locality has been done by several research groups. We discuss the most related of this in three categories. Loop transformations Exploiting the ....
S. Carr. Combining optimization for cache and instruction-level parallelism. In Proc. the 1996 International Conference on Parallel Architectures and Compiler Techniques (PACT'96), Boston MA, Oct 1996.
....unreachable code elimination, and describe how to reason about the properties of the resulting framework. In particular, they discuss how the framework for describing optimizations can indicate whether combining the optimizations will be profitable. Other efforts include combining cache and ILP [9] and combining loop transformations [25] There have been several efforts toward unifying transformations into a single mechanism, and applying search techniques to the transformation space. In particular, one framework for unifying loop transformations is based on unimodular matrix theory [41, ....
Steve Carr. Combining optimization for cache and instruction-level parallelism. In Parallel Architectures and Compilation Techniques (PACT), 1996.
....scheduling with II = 3 is not possible due to dependence constraints, we cannot improve the performance by reducing II further and other techniques should be devised. Unroll and jam [3] is a technique which may be used to increase the size of inner loop bodies, thus reducing the loop overhead. In [4], a quantitative approach to the application of this technique is described showing improvements in most cases when applied to the Perfect Benchmarks. Unrolling the j loop once in this example and fusing (or jamming) the resulting two inner loops allows the new inner loop body to be scheduled ....
S. Carr. Combining Optimizations for Cache and Instruction-Level Parallelism. Proceedings of PACT'96.
....possible optimizations may find a good sequence of transformations to parallelize a given program [15] None of these works seeks to unify guidance for multiple levels of the memory hierarchy. Multi level unification: Unroll and jam can guide locality and instruction level parallelism in concert [4]. Loop fusion and distribution affect both parallelism and locality [20] These two works do not directly address tiling or the multi level nature of the interactions. Rather than use multi level cost functions, 30] performs a pruned search on the space of possible combinations of minimizations ....
Steve Carr. Combining optimization for cache and instruction-level parallelism. In PACT '96, pages 238--247, 1996.
No context found.
S. Carr. Combining optimization for cache and instruction-level parallelism. In Proceedings of the 1996.
....of the inner loop. This can be accomplished either by hand or by a compiler. Carr and Kennedy first developed a technique of fully automatic unroll and jam in a compiler with the imprecise assumption of perfect cache performance [7] Later Carr improved this technique by adding cache effects [8]. He predicted cache effects by using a data reuse model based on the data dependence graph (DDG) That data reuse model is complicated and imprecise. This thesis work overcomes these shortcomings by using a linear algebra based data reuse model, which has been developed by Wolf and Lam [14] to ....
....2.2 Unroll And Jam with A Data Reuse Model In order to improve the previous decision procedure which assumes a perfect cache model, Carr uses the data dependence graph to compute the potential cache misses under certain unroll amounts. This is expressed as a prefetch bandwidth requirement P L [8]. By adding the prefetch bandwidth requirement and the machine characteristics on prefetching, Carr improved the loop balance formulation. Since it takes in to account cache effects, the improved loop balance formulation is more precise than the previous one. Given an architecture that has ....
[Article contains additional citation context not shown here]
Carr, S., Combining Optimization for Cache and Instruction-Level Parallelism. Technical Report TR 95-06, Department of Computer Science, Michigan Technological University, August 1995.
....resources effectively, but also with ensuring that the resulting code has a high degree of cache locality. One compiler transformation that is essential for a compiler to meet the above objectives is unroll and jam, or outer loop unrolling. Previous work either has used a dependence based model [7] to compute unroll amounts, significantly increasing the size of the dependence graph, or has applied a more brute force technique [16] In this paper, we present an algorithm that uses a linear algebrabased technique to compute unroll amounts. This technique results in an 84 reduction over ....
.... performance problems through compiler optimization is to match the ratio of memory operations to floating pointoperations in a program loop (loop balance) to the optimum such ratio handled by a target machine (machine balance) with a transformation called unroll and jam (outer loop unrolling) [8, 7]. Unroll andjam has been shown to be effective at lowering the difference between loop balance and machine balance. Speedups on the order of 20 are possible on nested loops while speedups on the order of 2 are frequent [7] Previous work with unroll and jam has used the dependence graph to compute ....
[Article contains additional citation context not shown here]
S. Carr. Combining optimization for cache and instructionlevel parallelism. In Proceedings of the 1996 Conference on Parallel Architectures and Compiler Techniques, pages 238--247, Boston, MA, October 1996.
No context found.
S. Carr. Combining optimization for cache and instruction-level parallelism. In Proc. the 1996.
No context found.
S. Carr, "Combining optimization for cache and instruction-level parallelism," in Proc. of PACT, (Boston, MA, US), pp. 238--247, IEEE, October 1996.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC