| N. Warter, G. Haab, and J. Bockhaus. Enhanced Modulo Scheduling for Loops with Conditional Branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO-25), pages 170--179, Portland, OR, December 1-4 1992. |
....(p18) fnmpy.d.s0 f39=f34,f37 nop.i 0 ; mfi (p19) ldfd f40= r34] p18) fmpy.d.s0 f41=f33,f37 nop.i 0 ; mfi nop.m 0 (p17) fcmp.gt.unc.s0 p21,p20=f38,f6 Figure 2: Instructions with Predicates. An expansion of Spiral Graph is proposed for Enhanced Modulo Scheduling[7] that considers conditional branches on software pipelining. Although there is no description about Spiral Graph under predication available architecture[8] Its expansion method focuses the connectivity between the kernels generated by Enhanced Modulo Scheduling. BACKGROUND Predicated Spiral ....
Nancy J. Warter, Grant E. Haab and John W. Bockhaus: "Enhanced Modulo Scheduling for Loops with Conditional Branches," Proceedings of the 25th Annual International Symposium on Microarchitecture(MICRO-25), pp. 170--192, 1992.
.... software pipelined schedule for a loop [1] Modulo scheduling is a class of software pipelining algorithms that was proposed at the begining of last decade [23] and has been incorporated into some product compilers (e.g. 21, 7] Besides, many research papers have recently appeared on this topic [11, 14, 25, 13, 28, 12, 26, 22, 29, 17]. Modulo scheduling framework relies on generating a schedule for an iteration of the loop such that when this same schedule is repeated at regular intervals, no dependence is violated and no resource usage conflict arises. The interval between the succesive iterations is termed Initiation ....
N.J. Warter, G.E. Haab, and J.W. Bockhaus. Enhanced modulo scheduling for loops with conditional branches. In Proc. International Symposium on Microarchitecture, pages 170--179, December 1992.
....is usually assumed that memory accesses hit in the cache. Hence a cache miss, which has a long latency, stalls the processor, even if there are other instructions that could be executed. Irregular control flow due to (non looping) conditional branches also pose difficulties. While solutions exist [36, 32], they either rely on specialpurpose hardware, or are prone to code explosion. In our approach, called Explicit Dynamic Scheduling (EDS) instruction scheduling is performed in hardware at run time. This allows the instruction schedule to adapt to situations which are difficult to predict at ....
....resulting in a smaller W max for a fixed register file size, and many dynamic scheduling architectures limit the number of look ahead instructions in the dynamic instruction stream [1] not the number of overlapped loop iterations. For the static scheduling experiments, modulo scheduling [30, 36] was used to generate a static schedule. Initially, a static schedule was generated using the cache hit latency cache for the load 6 The simulations were driven by operation level dependence graphs, not generated object where register allocation for a particular machine has already been ....
[Article contains additional citation context not shown here]
Nancy J. Warter, Grant E. Haab, Krishna Subramanian, and John W. Bockhaus. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO-25), pages 170--179, December 1992.
....exploiting instruction level parallelism and a significant body of research has sought more effective scheduling algorithms. Several new directions have been explored: schedulers may not schedule operations in cycle order, focusing initially on operations along critical paths [27] 30] 43] 49] 77][95], they may backtrack to reverse poor scheduling decisions [27] 49] 59] 77] 83] 94] and they may hide long latencies by speculating operations across branches and basic blocks [12] 20] 27] 30] 59] 67] High performance compilers have also used precisely detailed machine models ....
....to be used by one iteration no more than once within each set of times that are congruent modulo II. The scope of modulo scheduling has been widened to a large variety of loops. Loops with conditional statements are handled using hierarchical reduction [55] or IF and Reverse IF conversion [95]. Modulo scheduling has also been extended to 57 a large variety of loops with early exits, such as while loops [87] 89] Furthermore, the code expansion due to modulo scheduling can be eliminated by using special hardware, e.g. support for rotating register files and predicated execution [82] ....
[Article contains additional citation context not shown here]
N. J. Warter, G. E. Haab, K. Subramanian, and J. W. Bockhaus. Enhanced Modulo Scheduling for loops with conditional branches. Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170--179, December 1992.
....into data dependences so that branch constructs have been eliminated. As a result, the converted loop body contains only a single iteration path which can be amenable to local software pipelining 8 . This method has been employed to handle loops with conditional branches in [DHB89] and [WHSB92] DHB89] has a special hardware feature, i.e. predicated execution[RYYT89] PS91] which supports the execution of predicated operations by providing boolean registers for their predicated operands. However, there is no such hardware support in [WHSB92] Thereby, after local software pipelining ....
....with conditional branches in [DHB89] and [WHSB92] DHB89] has a special hardware feature, i.e. predicated execution[RYYT89] PS91] which supports the execution of predicated operations by providing boolean registers for their predicated operands. However, there is no such hardware support in [WHSB92] Thereby, after local software pipelining has been applied to the if converted loop body, the Reverse if conversion [WMHR93] have to convert predicated operations back to explicit conditional structure. Another strategy proposed in [Lam88] SW91] and [WEJS94] is to regard the entire ....
[Article contains additional citation context not shown here]
N.J. Warter, G.E. Haab, K. Subramanian, and J.W. Bockhaus. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170--179, December 1992.
....construction of an overlapping schedule since there are multiple possible execution paths. To address this issue, one dimension is to develop the techniques based on modulo scheduling approach, such as Predicated Modulo Scheduling[DHB89] Hierarchical Reduction[Lam88] Enhanced Modulo Scheduling[WHSB92] GPMB[TCZ 93] GURPR [SW91] and GDESP[WEJS94] etc. Another dimension is to develop multiple II software pipelining techniques for loops containing conditional branches, such as Enhanced Pipeline Percolation Scheduling[Ebc87, EN89] Modulo Scheduling with Multiple IIs[WPP95] GURPR[SDWX87] ....
....scheduling technique addresses this problem through some different approaches developed over the past years. One technique is to convert loops with conditional branches into straight line structure and then perform modulo scheduling. Predicated Modulo Scheduling[DHB89] Enhanced Modulo Scheduling[WHSB92] GPMB[TCZ 93] fall in this category. The other three techniques, Hierarchical Reduction[Lam88] GURPR [SW91] and GDESP[WEJS94] treat the conditional branch in a loop as an integral part so that modulo scheduling is applicable to this loop. The detail of each technique is discussed in turn ....
[Article contains additional citation context not shown here]
N.J. Warter, G.E. Haab, K. Subramanian, and J.W. Bockhaus. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170--179, December 1992.
....branches lead to multiple iteration paths, each of which has its own set of operations to execute. One strategy is to convert loops with conditional branches into straight line structures and then perform modulo scheduling, such as guarded modulo scheduling [DHB89] enhanced modulo scheduling [WHSB92] and GPMB [TCZ 93] Another strategy is to simply treat a branch construct in a loop as an integral part so that modulo scheduling is applicable to the loop. Hierarchical reduction [Lam88] GURPR [SW91] and GDESP [WEJS94] fall in this category. Another alternative is to expose multiple II ....
....to find a pipeline schedule that can support all possible combinations of overlapping iteration paths. One strategy is to convert loops with conditional branches into straight line structures and then perform modulo scheduling, such as guarded modulo scheduling [DHB89] enhanced modulo scheduling [WHSB92] GPMB [TCZ 93] Another strategy is to simply treat a branch construct in a loop as an integral part so that modulo scheduling is applicable to the loop. Hierarchical reduction [Lam88] GURPR [SW91] and GDESP [WEJS94] fall into this category. Another alternative is to expose multiple II ....
[Article contains additional citation context not shown here]
N.J. Warter, G.E. Haab, K. Subramanian, and J.W. Bockhaus. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170--179, December 1992.
....II. We will search for the solution in the context of software pipelining with variable II, meaning that II may differ from one execution path to another. 1. 2 Overview of the Related Work Techniques that schedule loops with conditions can produce single or multiple II [2] 6] 7] 9] 10] 11][12][13] Dealing with multiple II is more complex and many techniques produce a single fixed II [10] 11] 12] Although the essentiality of the notion of paths in loops with conditions has been understood for long, there has been no attempts to formally define this concept in a way that can be used ....
....that II may differ from one execution path to another. 1. 2 Overview of the Related Work Techniques that schedule loops with conditions can produce single or multiple II [2] 6] 7] 9] 10] 11] 12] 13] Dealing with multiple II is more complex and many techniques produce a single fixed II [10] 11][12]. Although the essentiality of the notion of paths in loops with conditions has been understood for long, there has been no attempts to formally define this concept in a way that can be used in scheduling. Moreover, most of the techniques use the term path only to refer to one path of control in ....
Warter N. J., Bockhaus J. W., Haab G. E., Subramanian K., "Enhanced Modulo Scheduling for Loops with Conditional Branches," Proc. 25th Annl Intl Symp. on Microarchitecture (MICRO-25), 1992
.... for a loop [1] Modulo scheduling is a class of software pipelining algorithms that was proposed at the begining of last decade [24] and has been incorporated into some product compilers (e.g. 22] 7] Besides, many research papers have recently appeared on this topic [11] 14] 26] 13] [29], 12] 27] 23] 30] 18] Modulo scheduling framework relies on generating a schedule for an iteration of the loop such that when this same schedule is repeated at regular intervals, no dependence is violated and no resource usage conflict arises. The interval between the successive ....
N.J. Warter, G.E. Haab, and J.W. Bockhaus, "Enhanced Modulo Scheduling for Loops with Conditional Branches," Proc. 25th Int'l Symp. Microarchitecture, pp. 170-179, Dec. 1992.
....remain clustered as in the previous schedule using a smaller II . Inefficiencies in the code are also introduced by scheduling strongly connected components separately. The problems of hierarchical scheduling (originally proposed by [61] are addressed in the Enhanced Modulo Scheduling algorithm [57]. 2.2 Path Algebra Path algebra is an attempt to formulate the software pipelining problem in rigorous mathematical terms [62] In Section 1.6.3, path algebra was used to determine a viable II using the matrix M. This same matrix can also be used to determine a modulo schedule for software ....
....advantages of other techniques discussed in this section, but represents an improvement of known defects. It is an excellent technique that has been implemented in commercial compilers. Many researchers have embraced modulo scheduling for architectures with hardware support for modulo scheduling [15, 28, 34, 44, 43, 42, 45, 53, 57] and have modified the resulting code to work on architectures without hardware support [58] The Cydra 5 work is described in [15, 16] We use the term Predicated Modulo Scheduling to represent this general category of algorithms. In all but [28] the precise method for scheduling operations is ....
[Article contains additional citation context not shown here]
N.J. Warter, G.E. Haab, and J.W. Bockhaus. Enhanced Modulo Scheduling for Loops with Conditional Branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO25) , pages 170--179, Portland, OR, December 1-4 1992. IEEE Computer Society Press.
....to consider the pipelining of iterations between different paths as well as within the same path, while fully supporting speculative execution of operations ahead of branches. A simple solution is to enforce all the paths to have the same, fixed II by eliminating control flows within the loop [2, 3]. This approach, in effect, forces the schedule to take the worst case II, limiting the overall performance. There is a more advanced approach that achieves a variable II by overlapping each individual path tightly with itself and by merging them together [4, 5] Although it is simple to generate ....
....smallest II. To deal with multi path loops, modulo scheduling is extended in two ways. Hierarchical reduction [3] pre schedules the entire if then else structure and replaces this schedule by a single pseudo superoperation with a complex resource requirement. Modulo scheduling with if conversion [2] removes the conditional branches by predication such that an operation below a branch is guarded by a predicate and it commits only when the predicate is true. Since there is no control flows after the transformation, straight line modulo scheduling is applicable. Consequently, the pipelined ....
Nancy J. Warter, Grant E. Haab, and John W. Bockhaus. Enhanced Modulo Scheduling for Loops with Conditional Branches. In Proceedings of the 25th Annual Internationl Symposium on Mircroarchitecture (Micro-25), Dec. 1992.
....#17 st r4, r3) ld r2, r1) sub r3, r3, #4 beqz r3, L1 Epilogue mul r4, r2, #17 st r4, r3) sub r3, r3, #4 st r4, r3) Figure 2: Software pipeline with II = 2, R press = 4 A drawback of modulo scheduling is that it can only handle single basic block loops. To overcome this problem if conversion [12] can be used when predicated execution is supported by the processor. 3.2 Register Pressure As already discussed in the introduction, software pipelining increases the register pressure of a loop: the life ranges of variables defined in an iteration can overlap with life ranges defined in ....
N.J. Warter, G.E. Haab, and J.W. Bockhaus. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170 -- 179, Portland, Oregon, December 1992.
....initiating execution of successive loop iterations. Once that minimum initiation interval is determined, instruction scheduling attempts to match that minimum schedule while respecting resource and dependence constraints. Lam s hierarchical reduction is a modulo scheduling method as is Warter s [17, 18] enhanced modulo scheduling which uses IF conversion to produce a single super block to represent a loop. Rau [14] provides a detailed discussion of an implementation of modulo scheduling. 2.1 Modulo Scheduling Our software pipelining implementation is based upon Iterative Modulo Scheduling and ....
Warter, N., Haab, G., and Bockhaus, J. Enhanced Modulo Scheduling for Loops with Conditional Branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO-25) (Portland, OR, December 1-4 1992), pp. 170--179.
....be extended to and, in fact, are more appropriate for loops with conditionals. For example, under trace scheduling or superblocks [5, 11] it may be more beneficial to use different versions of instruction classes for the frequently taken and not taken paths of the trace. Techniques discussed in [23, 24] can be incorporated in our software pipelining method to handle conditionals. The details of these extensions are beyond the scope of this paper and we plan to investigate them in future. 7 Experimental Results 1. How useful is Theorem 4.2 in obtaining latency sequences that result in Max Init ....
....place an operation closer to either its Estart (earliest start) or Lstart (latest start) time. Lastly, as discussed in Section 6, MS pipeline theory and the proposed software pipelining method are applicable and appropriate for loops with conditionals. They can make use of the methods discussed in [23, 24] to software pipeline loops having conditionals. Our work complements the FSA based methods [16, 18, 1] in that we focus on software pipelining while their methods are applicable to general instruction scheduling. Our MSstate diagram considers all possible initiation sequences and can choose the ....
N. J. Warter, G. E. Haab, J. W. Bockhaus, and K. Subramanian. Enhanced modulo scheduling for loops with conditional branches. In Proc. of the 25th Ann. Intl. Symp. on Microarchitecture, pages 170--179, Portland, OR, Dec. 1--4, 1992.
....prediction accuracy [9] Also, in VLIW architectures, the code size could expand exponentially since a single threaded code needs to include all possible combinations of branch outcomes. This problem is especially serious when a compiler attempts to pipeline a loop with many conditional branches [21]. In superscalar architectures, the processor needs to perform run time dependence checking for both register and memory accesses. The hardware overhead for such dependence checking is very high and can grow quadratically as the size of the instruction window increases [9] In VLIW architectures, ....
Nancy J. Warter, Grant E. Haab, John W. Bockhaus, and Krishna Subramanian. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170--179, December 1--4, 1992.
....the exact achievable lower bound of II is unknown. However, scheduling while unrolling can naturally form a software pipeline with a non integer II. The handling of branches is undoubtedly the greatest weakness of modulo scheduling. Though Hierarchical Reduction[25] and Enhanced Modulo Scheduling[44] can deal with branches without special hardware support, the amount of code expansion involved can be exponential. The major strength of modulo scheduling over scheduling while unrolling techniques is that modulo scheduling does not need to determine the degree of unrolling or search for a ....
....iterations. The following discussion of the recurrence constraint considers data dependences only for a straight line loop body. For a description of how control constructs can be handled in software pipelining and how control dependences are converted to data dependences, the reader can refer to [44] [45] Recurrence Initiation Interval The recurrence initiation interval (RecII) of a loop is defined as the lowest II that satisfies all recurrences of the loop. In order to describe the recurrences of a loop, we need to introduce an important program representation form, namely the ....
[Article contains additional citation context not shown here]
N.J. Warter, G.E. Haab, and J.W. Bockhaus. Enhanced Modulo Scheduling for Loops with Conditional Branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO-25), pages 170--179, Portland, OR, December 1-4 1992.
....initiating execution of successive loop iterations. Once that minimum initiation interval is determined, instruction scheduling attempts to match that minimum schedule while respecting resource and dependence constraints. Lam s hierarchical reduction is a modulo scheduling method as is Warter s [13, 4] enhanced modulo scheduling which uses IF conversion to produce a single superblock to represent a loop. Rau [3] provides a detailed discussion of an implementation of modulo scheduling. Since Rocket s software pipelining, patterned after Warter s enhanced modulo scheduling [13] uses a modulo ....
....as is Warter s [13, 4] enhanced modulo scheduling which uses IF conversion to produce a single superblock to represent a loop. Rau [3] provides a detailed discussion of an implementation of modulo scheduling. Since Rocket s software pipelining, patterned after Warter s enhanced modulo scheduling [13], uses a modulo scheduling algorithm, we shall investigate modulo scheduling in a bit more detail. While Warter s method provides a general framework for our software pipelining, the actual modulo scheduling technique implemented in Rocket closely follows Rau [3] Modulo scheduling assumes that a ....
N. Warter, G. Haab, and J. Bockhaus, "Enhanced Modulo Scheduling for Loops with Conditional Branches," in Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO-25), (Portland, OR), pp. 170--179, December 1-4 1992.
....dramatically. Not only do compilers need to be concerned with finding ILP to utilize machine resources effectively, but they also need to be concerned with ensuring that the resulting code has a high degree of cache locality. Previous work has concentrated either on improving ILP in nested loops [3, 6, 7, 14, 16, 17] or on improving cache performance [9, 15, 18] This paper presents a performance metric that can be used to guide the optimization of nested loops considering the combined effects of ILP, data reuse and latency hiding techniques. We have implemented the technique in a source to source ....
....run on two different architectures) 1. Introduction Modern microprocessors have increased computational power through faster cycle times and multiple instruction issuing. Unfortunately, compilers have not been able to fully utilize these advances. First, techniques like software pipelining[14, 16, 17] may not be able to take full advantage of a target architecture due to inner loop recurrences or mismatches between the resource requirements of a loop and the Copyright 1996 IEEE. Published in Proceedings of PACT 96, October 20 23, 1996, Boston, MA. Personal use of this material is ....
N. Warter, G. Haub, and J. Bockhaus. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th International Symposium on Microarchitecture (MICRO-25), pages 170--179, Portland, OR, December 1992.
....Also, in VLIW architectures, the code size will expand exponentially because a single threaded code that includes all possible combinations of branch conditions needs to be generated. This problem is especially serious when a compiler attempts to software pipeline a loop with conditional branches [18]. In superscalar architectures, the processor needs to perform run time dependence checking for both register and memory accesses. The hardware overhead for such dependence checking is very high and can grow quadratically as the size of the instruction window increases [9] In VLIW architectures, ....
Nancy J. Warter, Grant E. Haab, John W. Bockhaus, and Krishna Subramanian. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170--179, December 1--4, 1992.
....initiating execution of successive loop iterations. Once that minimum initiation interval is determined, instruction scheduling attempts to match that minimum schedule while respecting resource and dependence constraints. Lam s hierarchical reduction is a modulo scheduling method as is Warter s [21, 22] enhanced modulo scheduling which uses IF conversion to produce a single super block to represent a loop. Rau [6] provides a detailed discussion of an implementation of modulo scheduling. 2.5 Modulo Scheduling The software pipelining method used in this paper is based upon Iterative Modulo ....
N. Warter, G. Haab, and J. Bockhaus, "Enhanced Modulo Scheduling for Loops with Conditional Branches," in Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO25) , (Portland, OR), pp. 170--179, December 1-4 1992.
....Software pipelining overlaps operations from different loop iterations in an attempt to fully exploit instruction level parallelism. To be successful on real machines, the function unit constraints of those machines must be taken into account. A variety of software pipelining algorithms [22, 11, 16, 1, 2, 7, 27, 17, 29, 13, 28, 18, 5, 20] have been proposed which operate under resource constraints. An excellent survey of these algorithms can be found in [21] More recently, integer linear programming (ILP) based approaches for finding an optimal periodic schedule on machines with simple resource usage have been proposed [9, 8, 6] ....
....they model simple or complex resource usage, 3) whether the algorithm is one pass, iterative, incremental, or exhaustive search based. In particular, in terms of resource constraints, the work reported in [1, 27, 18, 20] does not consider any resource constraint while the methods reported in [22, 11, 2, 7, 17, 29, 28] deal with function unit constraints but with simple resource usage. Both function unit and register resource constraints are considered in [13, 5] Software pipelining methods for complex usage patterns with limited function units was dealt with by [3, 16, 19] The methods proposed in these works ....
Nancy J. Warter, John W. Bockhaus, Grant E. Haab, and Krishna Subramanian. Enhanced modulo scheduling for loops with conditional branches. In Proc. of the 25th Ann. Intl. Symp. on Microarchitecture, pages 170--179, Portland, Ore., Dec. 1--4, 1992. ACM SIGMICRO and IEEE-CS TC-MICRO.
....overestimates resource requirements, and third, preserving the control structure of the program restricts possible code motions. A second proposal for integrating modulo scheduling with conditional tests is to use if conversion [AKPW83] before modulo scheduling and reverse if conversion [WHB92, WMHR93] after modulo scheduling. When a loop is if converted, the expression of control flow is changed from explicit jumps to guarded operations, where each operation of the original loop is guarded by the predicates of the conditionals that control its execution. In this way, all non trivial ....
N. J. Warter, G. E. Haab, and J. W. Bockhaus. Enhanced Modulo Scheduling for Loops with Conditional Branches. In Proceedings of the 25th International Symposium and Workshop on Microarchitecture (MICRO-25), December 1992.
.... The target machine is a hypothetical VLIW processor similar to Cydrome s Cydra 5 [20, 2] including architectural support for overlapping loops without using code duplication [5] Nevertheless, the scheduling techniques shown in this paper can be directly applied to conventional RISC machines [14, 23], albeit at the expense of code expansion [19] 2.1 Functional Units Functional unit latencies are given in Table 1. The compiler assumes the responsibility for honoring these latencies, scheduling no ops wherever necessary. All functional units are fully pipelined; except for the divider, which ....
....code generation after modulo scheduling a loop; see [19] for details. To address the problem of modulo scheduling loops withbranches on machines without predicated execution, two extensions to modulo scheduling have been developed; namely, hierarchical reduction [9] and enhanced modulo scheduling [23]. In essence, each approach reduces the problem to scheduling branch free loop bodies, at the cost of code expansion. 2.3 Rotating Register Files When modulo scheduling a loop, it is quite common for an operation s result to be live for more than II cycles, thus preventing the operation from ....
N. J. Warter, J. W. Bockhaus, G. E. Haab, and K. Subramanian. Enhancedmodulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170--179, Dec. 1992.
....among predicate values. This information is used to refine dataflow analysis, optimization, scheduling, and allocation in presence of hyperblocks. Additionally, this information is used to conditionally reserve functional units when modulo scheduling under the Reverse IF Conversion scheme [14]. A complementary approach is taken with the Gated Single Assignment (GSA) form [15] where precise predicate information is embedded in the dataflow graph. This approach is used in the Polaris parallelizing compiler to refine data and memory dependence analysis and to aid loop parallelization ....
N. J. Warter, G. E. Haab, and J. W. Bockhaus. Enhanced Modulo Scheduling for loops with conditional branches. MICRO, pages 170--179, Dec. 1992.
....initiating execution of successive loop iterations. Once that minimum initiation interval is determined, instruction scheduling attempts to match that minimum schedule while respecting resource and dependence constraints. Lam s hierarchical reduction is a modulo scheduling method as is Warter s [23, 24] enhanced modulo scheduling which uses IF conversion to produce a single super block to represent a loop. Rau [17] provides a detailed discussion of an implementation of modulo scheduling, while Allan et al. 3] provide a thorough survey of software pipelining methods. 2.3 Modulo Scheduling Since ....
....to represent a loop. Rau [17] provides a detailed discussion of an implementation of modulo scheduling, while Allan et al. 3] provide a thorough survey of software pipelining methods. 2. 3 Modulo Scheduling Since Rocket s software pipelining, patterned after Warter s enhanced modulo scheduling [23], uses a modulo scheduling algorithm, we shall investigate modulo scheduling in a bit more detail. While Warter s method provides a general framework for our software pipelining, the actual modulo scheduling technique implemented in Rocket closely follows Rau [17] Modulo scheduling assumes that a ....
Warter, N., Haab, G., and Bockhaus, J. Enhanced Modulo Scheduling for Loops with Conditional Branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO25) (Portland, OR, December 1-4 1992), pp. 170--179.
....in efficiently exploiting instruction level parallelism and a significant body of research has sought more effective scheduling algorithms. Several new directions have been explored: schedulers may not schedule operations in cycle order, focusing initially on operations along critical paths [1][2][3] 4] 5] 6] they may backtrack to reverse poor scheduling decisions [1] 2] 3] 4] 7] and they may hide long latencies by speculating operations across branches and basic blocks [1] 6] 7] 8] 9] 10] High performance compilers have also used precisely detailed machine models [1] 3] 7] 11] 12] 13] ....
....body of research has sought more effective scheduling algorithms. Several new directions have been explored: schedulers may not schedule operations in cycle order, focusing initially on operations along critical paths [1] 2] 3] 4] 5] 6] they may backtrack to reverse poor scheduling decisions [1][2][3] 4] 7] and they may hide long latencies by speculating operations across branches and basic blocks [1] 6] 7] 8] 9] 10] High performance compilers have also used precisely detailed machine models [1] 3] 7] 11] 12] 13] to better utilize the machine resources of current processors with ....
[Article contains additional citation context not shown here]
N. J. Warter, G. E. Haab, and J. W. Bockhaus. Enhanced Modulo Scheduling for loops with conditional branches. Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170--179, December 1992.
....Software pipelining overlaps operations from different loop iterations in an attempt to fully exploit instruction level parallelism. To be successful on real machines, the function unit constraints of those machines must be taken into account. A variety of software pipelining algorithms [1, 5, 7, 11, 13, 16, 17, 18, 20, 24, 27, 26] have been proposed which operate under resource constraints. An excellent survey of these algorithms can be found in [19] This work was supported by research grants from NSERC (Canada) and MICRONET Network Centers of Excellence (Canada) Integer Linear Programming (ILP) based and other ....
....encounters them. In fact, the ILP computation could be run in the background, so that the user may get non optimal code the first time his her code is compiled, but on later compilations the desired schedule would be in the database. 7 Related Work Software pipelining has been extensively studied [1, 4, 5, 7, 13, 16, 17, 18, 20, 21, 24, 26, 27]. Rau and Fisher provide a comprehensive survey of these works in [19] As stated in [19] software pipelining methods vary in several aspects: 1) whether or not they consider finite resources, 2) whether they model simple 4 Because of the way our compiler performed memory disambiguation, some ....
[Article contains additional citation context not shown here]
N. J. Warter, J. W. Bockhaus, G. E. Haab, and K. Subramanian. Enhanced modulo scheduling for loops with conditional branches. In Proc. of the 25th Ann. Intl. Symp. on Microarchitecture, pages 170--179, Portland, Ore., Dec. 1--4, 1992.
....include multiple control flow paths, loops that are not based on a loop counter, and multiple exits. Several techniques have been developed to allow modulo scheduling of loops with intra iteration control flow such as hierarchical reduction [11] predicated execution [5] and reverse if conversion [21]. The above work has assumed that all of the paths through the loop body are included for scheduling. Including all of the paths can be detrimental to overall loop performance. The presence of unimportant paths with high resource usage or long dependence chains can result in a schedule that ....
N. J. Warter, G. E. Haab, K. Subramanian, and J. W. Bockhaus. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170--179, December 1992.
....constraint, each resource may be used by one iteration at most once within each set of times that are congruent modulo II. The scope of modulo scheduling has been widened to a large variety of loops. Loops with conditional statements are handled using hierarchical reduction [4] or IF conversion [5]. Modulo scheduling has also been extended to a large variety of loops with early exits, such as while loops [6] 7] Furthermore, the code expansion due to modulo scheduling can be eliminated by using special hardware, such as rotating register files and support for predicated execution [8] As ....
N. J. Warter, G. E. Haab, and J. W. Bockhaus. Enhanced Modulo Scheduling for loops with conditional branches. MICRO, pages 170--179, Dec. 1992.
....requires that all usages of any particular resource by a single iteration must be scheduled at distinct times modulo II. The scope of modulo scheduling has been widened to a large variety of loops. Loops with conditional statements are handled using hierarchical reduction [4] or IF conversion [5][6] Loops with conditional exits can also be modulo scheduled [7] Furthermore, the code expansion due to modulo scheduling can be eliminated when using special hardware such as rotating register files and predicated execution [8] As modulo scheduling achieves higher throughput by overlapping ....
N. J. Warter, G. E. Haab, K. Subramanian, and J. W. Bockhaus. Enhanced Modulo Scheduling for loops with conditional branches. Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170--179, December 1992.
....than once at times that are congruent modulo II. As a result, searching for a schedule with a given II is greatly simplified. The scope of modulo scheduling has been widened to a large variety of loops. Loops with conditional statements are handled using hierarchical reduction [4] or if conversion [5][6] Loops with conditional exits can also be modulo scheduled [7] Furthermore, the code expansion due to modulo scheduling can be eliminated when using special hardware, such as rotating register files and predicated execution [8] Since modulo scheduling achieves a high throughput by ....
N. J. Warter, G. E. Haab, and J. W. Bockhaus, "Enhanced modulo scheduling for loops with conditional branches", Proceedings of the 25th Annual International Symposium on Microarchitecture, December 1992.
....of common subexpression elimination and loopinvariant code motion. Granlund and Kenner [20] discuss eliminating conditional branches through the use of a super optimizer, a program which finds optimal instruction sequences for small pieces of code through exhaustive search. Warter et al. [45] describe enhanced modulo scheduling to improve software pipelining in loops with conditional branches. The technique uses if conversion and code replication to transform loops with conditional branches into straight line code before scheduling. After scheduling is performed, however, ....
Nancy J. Warter, Grant E. Haab, and Krishna Subramanian. Enhanced modulo scheduling for loops with conditional branches. MICRO-25 Conference Proceedings, December 1992.
....and forces a resource to be used by one iteration no more than once within each set of times that are congruent modulo II. The scope of modulo scheduling has been widened to a large variety of loops. Loops with conditional statements are handled using hierarchical reduction [4] or IF conversion [5]. Modulo scheduling has also been extended to a large variety of loops with early exits, such as while loops [6] 7] Furthermore, the code expansion due to modulo scheduling can be eliminated by using special hardware, such as rotating register files and predicated execution [8] Since modulo ....
N. J. Warter, G. E. Haab, and J. W. Bockhaus. Enhanced Modulo Scheduling for loops with conditional branches. Proceedings of the 25th Annual International Symposium on Microarchitecture, December 1992.
....register [31, 93] Conditional branches can then be substituted with instructions that set the appropriate flag register. Instructions from both branches of an if statement are issued, but only those from one of the branches are actually enabled for execution. This process is called if conversion [112]. Since all instructions are issued in each iteration, generating a static schedule that overlaps multiple iterations is as easy as for a loop without conditional branches. Trace scheduling assumes that one direction of a conditional branch is taken much more frequently than the other [38] A ....
....branch directions are more frequently executed, which can be obtained from a trace or a profile. A number of purely software techniques also exist for software pipelining both directions of conditional branches. These include hierarchical reduction, GURPR and enhanced modulo scheduling (EMS) [112]. These techniques use if conversion to transform the body of the loop into straight line code for which a static schedule can be easily generated. This static schedule is then regenerated into executable code by replicating instructions and inserting appropriate conditional branches. A ....
[Article contains additional citation context not shown here]
Nancy J. Warter, Grant E. Haab, Krishna Subramanian, and John W. Bockhaus. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO-25), pages 170--179, December 1992.
....in efficiently exploiting instruction level parallelism and a significant body of research has sought more effective scheduling algorithms. Several new directions have been explored: schedulers may not schedule operations in cycle order, focusing initially on operations along critical paths [1][2][3] 4] 5] 6] they may backtrack to reverse poor scheduling decisions [1] 2] 3] 4] 7] and they may hide long latencies by speculating operations across branches and basic blocks [1] 6] 7] 8] 9] 10] High performance compilers have also used precisely detailed machine models [1] 3] 7] 11] 12] 13] ....
....body of research has sought more effective scheduling algorithms. Several new directions have been explored: schedulers may not schedule operations in cycle order, focusing initially on operations along critical paths [1] 2] 3] 4] 5] 6] they may backtrack to reverse poor scheduling decisions [1][2][3] 4] 7] and they may hide long latencies by speculating operations across branches and basic blocks [1] 6] 7] 8] 9] 10] High performance compilers have also used precisely detailed machine models [1] 3] 7] 11] 12] 13] to better utilize the machine resources of current processors with ....
[Article contains additional citation context not shown here]
N. J. Warter, G. E. Haab, K. Subramanian, and J. W. Bockhaus. Enhanced Modulo Scheduling for loops with conditional branches. Proc. of the 25th Annual International Symposium on Microarchitecture, pages 170--179, Dec. 1992.
....quickly [16] Also, in VLIW architectures, the code size will expand exponentially because a single threaded code needs to include all possible combinations of branch conditions. This problem is especially serious when a compiler attempts to software pipeline a loop with many conditional branches [50]. In superscalar architectures, the processor needs to perform run time dependence checking for both register and memory accesses. The hardware overhead for such dependence checking is very high and can grow quadratically as the size of the instruction window increases [16] In VLIW architectures, ....
Nancy J. Warter, Grant E. Haab, John W. Bockhaus, and Krishna Subramanian. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170--179, December 1--4, 1992.
....along with modifications to the original algorithm which are designed to improve its performance. Chapter 4 presents the experimental results of the original GURPR algorithm along with three modified versions and compares them with the results of one other method, Enhanced Modulo Scheduling [9]. Chapter 5 provides a conclusion. 4 2. RELATED WORK The idea of software pipelining has generated many viable algorithms. Each is slightly different in the way it handles conditional branches, recurrence relations, and resource constraints. Several of the techniques also require additional ....
....technique is that it requires the resource usage of the conditional construct to be the sum of the resource usages of each path, rather than using the union operator as with hierarchical reduction. A technique that combines the best of the previous two methods is called Enhanced Modulo Scheduling [9]. Enhanced Modulo Scheduling uses if conversion to transform the loop into straight line code but then converts it back into multiple paths after software pipelining, thus removing the need for predicated execution hardware support. Since the technique converts the code back into multiple paths, ....
[Article contains additional citation context not shown here]
N. Warter, J. Bockhaus, G. Haab, and K. Subramanian, "Enhanced modulo scheduling for loops with conditional branches," Tech. Rep., Center for Reliable and HighPerformance Computing, University of Illinois, Urbana, IL, 1992.
....these proposed features cannot be used since they require changing the instruction set. For such architectures, techniques are needed that do not assume any special hardware support. 56 CHAPTER 4 ENHANCED MODULO SCHEDULING This chapter presents the Enhanced Modulo Scheduling (EMS) technique [61]. As Figure 4.1 shows, EMS is Modulo Scheduling with ICTs. This approach to Modulo Scheduling is referred to as enhanced for loops with conditional branches because, EMS (1) allows operations to be scheduled independently, 2) uses the most constrained resource along any path to form the lower ....
N. J. Warter, J. W. Bockhaus, G. E. Haab, and K. Subramanian, "Enhanced Modulo Scheduling for loops with conditional branches," in Proceedings of the 25th International Symposium on Microarchitecture, pp. 170--179, November 1992.
....in the same manner as those without. However, without RIC, Modulo Scheduling for processors without PE support have relied on techniques such as Hierarchical Reduction, which apply prescheduling to remove conditional branches [16] Prescheduling limits the effectiveness of Modulo Scheduling [17][18] Furthermore, Hierarchical Reduction can only be applied to structured loop bodies whereas the ICTs can be applied to any acyclic loop body. In this section we present the Enhanced Modulo Scheduling technique (EMS) which uses the ICTs to simplify scheduling. For further details about EMS, ....
....Furthermore, Hierarchical Reduction can only be applied to structured loop bodies whereas the ICTs can be applied to any acyclic loop body. In this section we present the Enhanced Modulo Scheduling technique (EMS) which uses the ICTs to simplify scheduling. For further details about EMS, refer to [17]. Figure 8 shows the hyperblock after software pipelining. The target machine is a VLIW processor with two operation slots. Effectively, two iterations of the hyperblock (the schedule from Figure 6(b) have been overlapped. Note that the operations within each instruction have been reordered to ....
[Article contains additional citation context not shown here]
N. J. Warter, J. W. Bockhaus, G. E. Haab, and K. Subramanian, "Enhanced Modulo Scheduling for loops with conditional branches," in Proceedings of the 25th International Symposium on Microarchitecture, pp. 170--179, November 1992.
....occurs. Thus, explicit branch instructions other than A are not shown, but, their corresponding control flow arcs are shown. 3 In this simple example, there are only two stages in the software pipeline until the steady state execution is reached. Typically, the number of stages is higher (in [15] the average number of stages is five) Thus, it can be costly to recover from a mispredicted branch using a software resolution technique. It would be ideal if the execution could jump out of the pipeline, execute the taken path code, and jump back into the pipeline. In order to do so, the ....
....technique that schedules loops whose body consists of a simple basic block [10] In order to schedule loops with conditional constructs, the conditional constructs must be converted into straight line code. If conversion can be used to convert conditional constructs into straightline code [21] 12][15]. In this paper we discuss modulo scheduling with if conversion assuming predicated hardware support [9] 12] As discussed in Section 3.2, predicated hardware support allows for efficient hazard resolution. When the branch is taken (mispredicted) the taken path code is executed and control ....
N. J. Warter, G. E. Haab, K. Subramanian, and J. W. Bockhaus, "Enhanced modulo scheduling for loops with conditional branches," in Proceedings of the 25th Annual International Symposium on Microarchitecture, pp. 170--179, December 1992.
No context found.
N. Warter, G. Haab, and J. Bockhaus. Enhanced Modulo Scheduling for Loops with Conditional Branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO-25), pages 170--179, Portland, OR, December 1-4 1992.
No context found.
N.J. Warter, G.E. Haab, K. Subramanian, and J.W. Backhaus. Enhanced modulo scheduling for loops with conditional branches. In Proc. 25th Intl. Symp. on Microarchitecture, pp. 170-179, Dec. 1992.
No context found.
N. Warter, G. Haab, K. Subramanian, and J. Bockhaus. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170--179, 1992.
No context found.
N. Warter, et al., "Enhanced Modulo Scheduling for Loop with Conditional Branches", Proc. 25th Ann. Int. Symp. Microarchitecture, pp. 170-179, 1992.
No context found.
N.J. Warter, J.W. Bockhaus, G.E. Haab, and K. Subramanian. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture, 170-179, December 1992.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC