| J.A. Fisher, "Global code generation for instruction-level parallelism: Trace scheduling-2," Technical Report HPL-93-43, Hewlett Packard Laboratories, June 1993. |
....to inline is dicult to answer in a machine independent way. The SEA runtime is an extension of the architecture, so it is an appropriate level at which to make these decisions. Finally, several studies have shown bene ts to scheduling code across multiple basic blocks (e.g. trace scheduling [Fis93, Wal91] This is a data dependent code transformation[Fis81] and so seems appropriate for an SEA, where pro le information can be provided at runtime, and the transformations can be aggressive without worrying about penalizing cases where the program behavior might be di erent. To make a ....
....cant issue restrictions on some instruction types and instruction pairs. The static scheduling of the 21164 shows o the code scheduling algorithms of the SSMT runtime, but we believe that the SSMT runtime would also be bene cial (perhaps more so) for dynamically scheduled processors. Studies [Fis93] have shown that software techniques which increase the mix of non dependent instructions in the hardware s limited instruction window improve processor utilization. Benchmark Time Sec 16 int reg SEA mandel 38.6 0.008 3.6 compress 24.9 2.7 16.1 swim 46.6 6.0 2.8 alvinn 25.8 9.9 ....
Joseph A. Fisher. Global code generation for instruction-level parallelism:trace scheduling-2. In HP Laboratories Technical Report HRL-93-43, June 1993.
....loops [7] There has been some prior work in alleviating the effects of control dependences. The use of speculative execution is one such technique. Speculative execution identifies operations whose side effects are reversible and moves these operations above branches on which they depend [8 12]. Speculative execution can significantly accelerate program performance but does not address the problem of parallel execution of branches and other non speculative operations with problematic control dependences. Researchers have also considered moving branches across branches [13, 14] However, ....
J. A. Fisher. Global code generation for instruction-level parallelism: trace scheduling-2. Technical Report HPL-93- 43, Hewlett-Packard Laboratories, Palo Alto CA, 1993.
....Ten years after J. Fisher introduced TS, he extended his algorithm to include nonlinear code motions. These are motions that are missed in the original TS scheduling algorithm because a trace contains only one direction of if then else structures. The new algorithm, named Trace Scheduling 2 [38] (TC2) enables the motion of code before a conditional jump from both directions at the same time and is more considerate of code coming from less likely executed paths. Code motions from a block to one of its dominators, albeit across conditional branches, are the most important ones becoming ....
....code placed on edges in the CFG, instead of placing it in basic blocks. Once the pro les have been generated, we can ask ourselves how well program executions are statically predictable. In other words, in the context of instruction scheduling, how useful are pro ling data for scheduling. In [38], Fisher discusses this topic based on trace scheduling 2 experiments in the Hewlett Packard research laboratories. Some notions have to be de ned rst: inherent predictability A program is said to be inherently unpredictable if the best possible a priori prediction of the direction jumps ....
Fisher, J. Global code generation for instruction-level parallelism: Trace scheduling-2. Tech. Rep. HPL-93-43, Hewlett-Packard, June 1993.
....of an operation be the sum over all superblock exits, E, of the product of the profiled weight of E and the height of the operation with respect to E. Though the Elcor compiler supports several priority functions, all the evaluations reported here are based on the weighted height priority function [5]. 3.2 Conventional Cycle Scheduler Before we describe the main scheduling loop of the Cycle scheduler, we present some concepts and data structures. The CurrentCycle is the cycle in which operations are being scheduled currently by the scheduler. The CurrentCycle is initially set to 0, and ....
....within a basic block is limited and not sufficient for modern EPIC processors. Global schedulers use a larger scheduling region and perform code motion between basic blocks. The trace scheduler constructs a trace consisting of a linear chain of basic blocks with multiple entries and exits [5, 12]. Global schedulers restrict the scheduling regions to reduce the complexity of inserting compensation code in the side entries exits due to code motion [13, 14] The schedulers described in this paper use the superblock [15] and hyperblock [16] as the global scheduling region. Wavefront ....
J.A. Fisher, "Global code generation for instruction-level parallelism: Trace scheduling-2," Technical Report HPL-9343, Hewlett Packard Laboratories, June 1993.
....entry point. The advantage of scheduling per extended block is that operations can be moved over basic block boundaries and that interblock latencies are taken into account. Combining global code layout with extended basic block scheduling results in an approach very similar to trace scheduling [8], 9] H. Value Pro ling It sometimes happens that important optimizations are not applicable because some (necessary or sucient) conditions are not ful lled. The reason might be that the conditions do not always hold, that the implemented analyses are not precise enough to detect that they in ....
J.A. Fisher, \Global code generation for instruction-level parallelism: Trace scheduling-2," Tech. Rep. HPL-93-43, HewlettPackard, June 1993.
....was developed with the help of modified percolation scheduling [13] Tree traversal scheduling targets wider issue ILP architectures such as EPIC [21] style machines. Tree traversal scheduling also has many similarities with other popular global scheduling techniques, such as trace scheduling [4,5], superblock scheduling [3] and hyperblock scheduling [9,10] As mentioned previously, the heuristics used for the candidate operation selection in the TTS algorithm were motivated by the profile variation studies on superblocks [6] In addition, superblock scheduling also uses tail duplication as ....
J. A. Fisher, "Global code generation for instruction level parallelism: Trace Scheduling-2," Tech. Rep. HPL-93-43, Hewlett-Packard Laboratories, June 1993
....multiple branches in one clock cycle. The initial work in VLIW architectures was based on a single path scheduling algorithm called trace scheduling [2] The Trace Scheduling 2 algorithm is an extension of the original trace scheduling algorithm that schedules along multiple paths simultaneously [10]. Hyperblock scheduling also schedules multiple paths in parallel [3] by removing branches from the instruction stream entirely through if conversion. 5 Concluding remarks and acknowledgements There are issues related to treegions that merit further research. The use of if conversion and tail ....
J. A. Fisher, "Global code generation for instruction-level parallelism: Trace Scheduling-2," Tech. Rep. HPL-93-43, Hewlett-Packard Laboratories, June 1993.
....much parallelism is available in typical applications. Machines providing a high degree of multiple issue would be of little use if applications did not display that much parallelism. The available parallelism depends strongly on how hard we are willing to work to find it. Recent studies studies [4, 5, 6, 13, 14, 15, 16, 17] have led to a growing consensus that high levels of parallelism are available only by doing speculative execution, in which we can issue an instruction whose data dependencies are satisfied even though its control dependencies are not. That is, we issue a potential future instruction early even ....
....Figure 1 shows the harmonic 1 This approach to a missed prediction ignores the possibility that code could be moved from after the point where the paths rejoin to a position before the paths split apart. Recognizing such opportunities is difficult in hardware but feasible in a software scheduler [4, 14]. 5 SPECULATIVE EXECUTION AND INSTRUCTION LEVEL PARALLELISM mean of the success rates of three predictors for twelve SPEC92 benchmarks, as the predictor size varies. The two bit counter predictor does best for predictor sizes up to 512 bits. A predictor built by combining a counter predictor ....
Joseph A. Fisher. Global code generation for instruction-level parallelism: trace scheduling2. Technical Report HPL-93-43, Hewlett-Packard Laboratories, June 1993.
....features judiciously. Instruction scheduling plays a major role in the usage of such features to extract ILP. Bharadwaj, Menezes, McKinsey Several instruction scheduling techniques have been described in the literature to perform scheduling across basic block boundaries. Trace scheduling [2] [3], superblock and hyperblock scheduling [4] 5] operate on regions known as traces, which consist of a contiguous set of basic blocks. In contrast, treegion scheduling [6] uses a decision tree subgraph of a program s control flow graph as a scheduling region. Ebcioglu s VLIW scheduling [7] arranges ....
J. A. Fisher, "Global code generation for instruction-level parallelism: Trace scheduling-2," Technical Report HPL-93-43, Hewlett-Packard Laboratories, June 1993.
....to inline is difficult to answer in a machine independent way. The SEA runtime is an extension of the architecture, so it is an appropriate level at which to make these decisions. Finally, several studies have shown benefits to scheduling code across multiple basic blocks (e.g. trace scheduling [Fis93, Wal91] This is a data dependent code transformation[Fis81] and so seems appropriate for an SEA, where profile information can be provided at runtime, and the transformations can be aggressive without worrying about penalizing cases where the program behavior might be different. To make a ....
....issue restrictions on some instruction types and instruction pairs. The static scheduling of the 21164 shows off the code scheduling algorithms of the SSMT runtime, but we believe that the SSMT runtime would also be beneficial (perhaps more so) for dynamically scheduled processors. Studies [Fis93] have shown that software techniques which increase the mix of non dependent instructions in the hardware s limited instruction window improve processor utilization. DRAFT Do not distribute or cite. 14 Benchmark Time Sec 16 int reg SEA mandel 38.6 0.008 3.6 compress 24.9 2.7 16.1 ....
Joseph A. Fisher. Global code generation for instruction-level parallelism:trace scheduling-2. In HP Laboratories Technical Report HRL-93-43, June 1993.
....For the best performance one does not want to restrict to single control flow path scheduling units without join points. However, from an engineering complexity point of view, join points are hard to handle since complex bookkeeping is required when operations are scheduled above a join point [11, 12]. It is not clear whether including join points is worth the effort that could otherwise be spent on other aspects of the scheduler. For the TriMedia scheduler we have chosen decision trees as the scheduling unit mainly because it is a good trade off between performance and engineering ....
J. A. Fisher, "Global Code Generation for Instruction-Level Parallelism: Trace Scheduling-2," Tech. Rep. HPL-93-43, Hewlett Packard Computer Systems Laboratory, Palo Alto, CA, June 1993.
....have different probabilities of being taken, the advantages or disadvantages associated with speculating instructions above branches are not equal. Fisher proposed a heuristic called speculative yield that can be used to help assign instruction priorities in the presence of conditional branches [37]. This heuristic uses branch 31 probabilities to determine the benefit of speculating an instruction from either path of a conditional branch. Fisher s use of this heuristic assumes the scheduling is being performed as the trace is being generated. In contrast, superblock formation and ....
J. A. Fisher, "Global code generation for instruction-level parallelism: Trache scheduling2, " Tech. Rep. HPL-93-43, Hewlett Packard Computer Research Center, June 1993.
....: y z x : y z a : x y Speculation x : x1 x : y z x1 : y z a : x y Renaming and compensation compensation copy Figure 17: Speculation and register renaming. execution. In contrast, branches (and merges) are treated differently since the corresponding overheads can be prohibitive [12]. Consequently, scheduling technology frequently limits the relative reordering of these instructions completely. We will retain this restriction in our subsequent discussion. 5.1.2 Feasible global acyclic schedule Let us now make precise the notion of a global acyclic schedule. Consider an ....
Joseph Fisher. Global code generation for instruction-level parallelism:trace scheduling-2. Technical report, HP Labs, 1991.
....see that when fanout is followed by good branch prediction, the fanout does not buy us much. Without branch prediction, on the other hand, even modest amounts of fanout are quite rewarding: adding fanout across 4 branches to the Fair model is about as good as adding Fair branch prediction. Fisher [Fis91] has proposed using fanout in conjunction with profiled branch prediction. In this scheme they are both under software control: the profile gives us information that helps us to decide whether to explore a given branch using the fanout capability or using a prediction. This is possible because a ....
J. A. Fisher. Global code generation For instruction-level parallelism: trace scheduling2. Technical Report #HPL-93-43, Hewlett-Packard Laboratories, Palo Alto, California, 1993.
....important execution paths can have their run time minimized. However, with limited execution resources, situations arise where one path will execute faster, only if another path gets delayed. Fisher proposed the use of speculative yield to determine the profitability of speculating an instruction [8]. The speculative yield is an expected value function which is defined between basic blocks. It is the probability that an operation scheduled in basic block i produces useful work (meaning that its original basic block j executes when basic block i executes) Its use with dependence height has ....
....with exit 1 would be one greater than the values shown in the figure. 2.2. Previous Work and Other Heuristics DEPENDENCE HEIGHT AND SPECULATIVE YIELD Fisher suggests that the multiplication of dependence height by speculative yield is a good candidate for the scheduling priority function [8]. The appeal is that dependence height is commonly used as the priority function, and speculative yield allows it to take into account the probability of taken branches. Bringmann utilized this concept in a list scheduler for superblocks [2] In his method, a static heuristic utilizes the exit ....
J. A. Fisher. Global code generation for instructionlevel parallelism: Trace scheduling-2. Technical Report HPL-93-43, Hewlett-Packard Laboratory, 1501 Page Mill Road, Palo Alto, CA 94304, June 1993.
....the use of loop splitting to propagate operand types and hence aid optimization in the context of a dynamically typed language. Code replication is fundamental in producing the long instruction sequences useful for compiling efficiently to VLIW architectures [2, 10, 11, 18] Trace scheduling [11, 12] and super blocks [7] are two techniques which rely on some amount of growth in code size to improve performance in the common case. In both algorithms, basic blocks along frequently executed paths are merged at compile time to improve instruction scheduling. Patch up code is added to the ....
Joseph A. Fisher. Global code generation for instruction-level parallelism: Trace scheduling-2. Technical Report HPL-93-43, Hewlett-Packard Laboratories, June 1993.
....if conversion, eliminates the branches via hardware support for predicated execution, allowing instructions to be moved outside of their basic blocks, see [4] for instance. The other approach does not require hardware support, and involves the formation of larger code regions such as traces, [7], and super blocks, 14] A super block consists of a sequence of basic blocks strung together, with conditional exits at the branch points that separate the basic blocks. Super blocks are typically formed as follows. Given is a code region with branch probabilities available at each branch in the ....
....information. For instance, super blocks are typically scheduled with heuristics that are oblivious to the profile information, although there are techniques that use profile information as an addendum to classical scheduling techniques, e.g. the speculative yield technique of [3] following [7]. Example 1 examines several of these techniques. As region formation algorithms become more sophisticated and produce non linear code regions encompassing balanced branches, they will be less effective in digesting the profile information. As a result, it will be increasingly important that the ....
J. A. Fisher. Global code generation for instruction level parallelism. Tech. Rep. HPL-93-43, Hewlett Packard Labs, June 1993.
No context found.
J.A. Fisher, "Global code generation for instruction-level parallelism: Trace scheduling-2," Technical Report HPL-93-43, Hewlett Packard Laboratories, June 1993.
No context found.
Joseph A. Fisher, Global code generation for instruction-level parallelism: Trace Scheduling2. Tech. Rep. HPL-93-43, Hewlett-Packard Laboratories, June 1993.
No context found.
Joseph A. Fisher, \Global Code Generation For Instruction-Level Parallelism: Trace Scheduling-2", Hewlett-Packard Technical Report, Palo Alto, California, HPL-9343, Jun. 1993
No context found.
J. Fisher, Global Code Generation for Instruction-Level Parallelism: Trace scheduling-2, Hewlett Packard Laboratories Technical Report HPL-93-43, June 1993.
No context found.
J. A. Fisher, "Global Code Generation for Instruction-Level Parallelism: Trace Scheduling-2," Tech. Rep. HPL-93-43, Hewlett Packard Computer Systems Laboratory, Palo Alto, CA, June 1993.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC