| W.-m. W. Hwu, P. P. Chang, Efficient Instruction Sequencing with Inline Target Insertion, 41(12), 1992, pp. 1537-1551. |
....less optimal than the unrolled code; if the loop typically executes few iterations, loop unrolling can be detrimental to performance. One way to overcome this problem is to add code specially designed to execute the loop a constant few iterations (Hwu calls this type of structure a superblock in [75]) There is also a secondary cost of loop unrolling in some architectures caused by the additional cache misses due to the increased code size [115] 116] 40] 171] The efficiency of loop unrolling quickly drops in relation to the size of original loop inefficiency and the unroll count. It is easy ....
W.-m. W. Hwu, P. P. Chang, Efficient Instruction Sequencing with Inline Target Insertion, 41(12), 1992, pp. 1537-1551.
....the advantages of scheduling superblocks especially on superpipelined superscalar processors. Reference [10] shows the importance of function inlining in compiling C programs. Reference [11] shows how instruction placement may be improved after function inlining has been performed. Reference [12] is a later version of the report [11] published as an article. Reference [13] describes some of the early work of Po Hua Chang and Wen Mei Hwu applying trace selection to large C programs. The trace scheduling technology was later incorporated into the IMPACT I compiler. Reference [14] shows ....
W. W. Hwu and P. P. Chang, "Efficient instruction sequencing with inline target insertion," IEEE Transactions on Computers, accepted for publication.
....technical report showing the advantages of scheduling code prior to register allocation. Reference [9] shows the advantages of scheduling superblocks especially on superpipelined superscalar processors. Reference [10] shows the importance of function inlining in compiling C programs. Reference [11] shows how instruction placement may be improved after function inlining has been performed. Reference [12] is a later version of the report [11] published as an article. Reference [13] describes some of the early work of Po Hua Chang and Wen Mei Hwu applying trace selection to large C programs. ....
....especially on superpipelined superscalar processors. Reference [10] shows the importance of function inlining in compiling C programs. Reference [11] shows how instruction placement may be improved after function inlining has been performed. Reference [12] is a later version of the report [11] published as an article. Reference [13] describes some of the early work of Po Hua Chang and Wen Mei Hwu applying trace selection to large C programs. The trace scheduling technology was later incorporated into the IMPACT I compiler. Reference [14] shows how compiler technology may be used to ....
W. W. Hwu and P. P. Chang, "Efficient instruction sequencing with inline target insertion," Tech. Rep. CSG-123, Center for Reliable and High-Performance Computing, University of Illinois, Urbana, IL, May 1990. 36
.... such as loop unrolling, register renaming, and critical path reduction, have been successful in removing register dependences within applications [1] Aggressive branch handling techniques, such as branch target insertion, are utilized to allow the execution of multiple branches per cycle [2]. The combination of these optimizations gives the code scheduler more freedom to reorder instructions. Unfortunately, the amount of static instruction reordering may be severely restricted due to dependences between memory instructions. Because memory references often occur on program critical ....
W. W. Hwu and P. P. Chang, "Efficient instruction sequencing with inline target insertion," IEEE Transactions on Computers, Dec. 1992.
....solutions such as a buffer into which future instructions can be prefetched. Much time has been invested in researching the various techniques of instruction prefetching[6] Highly accurate branch prediction schemes, both static and dynamic, have been developed to make this process effective[7, 20, 12, 16]. A large latency in accessing the data cache presents a more difficult problem. Write buffers can eliminate the bottleneck in storing data to the memory system[6] but the loading of data cannot be effectively buffered in this way because the results are desired immediately. One method of hiding ....
W.-M. Hwu and P. P. Chang, "Efficient instruction sequencing with inline target insertion," IEEE Transactions on Computers, 1992. Accepted for Publication.
....may be inserted into the Hcode which is then reverse translated to C and compiled. The resulting executable will produce a profile database [3] which is merged back into the Hcode representation. In addition to program execution profiling, profile guided code layout and function inline expansion [4] may be performed at the Hcode level. After processing is completed at the Hcode representation level, the code is translated to the Lcode format. Lcode is a machine independent assemblylike representation similar to many load store RISC instruction sets. The Lcode intermediate format will be ....
W. W. Hwu and P. P. Chang, "Efficient instruction sequencing with inline target insertion," IEEE Transactions on Computers, vol. 41, pp. 1537-51, December 1992.
No context found.
W. W. Hwu and P. P. Chang, "Efficient Instruction Sequencing with Inline Target Insertion", Technical Report CSG-103, Center for Reliable and High-Performance Computing, University of Illinois, Urbana-Champaign, 1990.
....set is a superset of the MIPS R2000 instruction set with additional branching modes [15] Table 2 shows the instruction latencies. Instructions are issued in order. Read after write hazards are handled by stalling the instruction unit pipeline. The microarchitecture uses a squashing branch scheme [14] and profile based branch prediction. For the base processor, one branch slot is allocated by the compiler for each predicted taken branch. The processor has 64 integer registers and 32 floatingpoint registers. 7 The superscalar version of this processor fetches multiple instructions into an ....
....simultaneously is called the issue rate. The superscalar processor also contains multiple function units. In this study, unless otherwise specified, every instruction can be executed from every instruction slot. When the issue rate is greater than one, the number of branch slots increases [14]. The superpipelined version of this processor has deeper pipelining for each function unit. If the number of pipeline stages is increased by a factor P, the clock cycle is reduced by approximately the same factor. The latency in clock cycles is longer, but in real time it is the same as the base ....
[Article contains additional citation context not shown here]
W. W. Hwu and P. P. Chang, "Efficient Instruction Sequencing with Inline Target Insertion", Coordinated Science Laboratory Report, UILU-ENG-90-2215, CSG-123, May, 1990.
....are handled by stalling the instructionTable 3: Instruction latencies. Function Latency integer ALU 1 barrel shifter 1 integer multiply 3 integer divide 25 load 2 store FP ALU 3 FP conversion 3 FP multiply 4 FP divide 25 unit pipeline. The microarchitecture uses a squashing branch scheme [27] and profile based branch prediction. Branch prediction is used to layout the superblocks such that the branches are likely not taken. If the branch is taken, the instruction(s) following the branch is squashed. If the branch is predicted taken, the base processor has one branch delay slot. The ....
....rate. The superscalar processor also contains multiple function units. In this study, unless otherwise specified, we assume uniform function units where every instruction can be executed from every instruction slot. When the issue rate is greater than one, the number of branch slots increases [27]. The superpipelined version of this processor has deeper pipelining for each function unit. If the number of pipeline stages is increased by a factor P, the clock cycle is reduced by approximately the same factor. The latency in clock cycles is longer, but in real time it is the same as the base ....
[Article contains additional citation context not shown here]
W. W. Hwu and P. P. Chang, "Efficient instruction sequencing with inline target insertion," Tech. Rep. CSG-123, Center for Reliable and High-Performance Computing, University of Illinois, Urbana, IL, May 1990.
No context found.
W. W. Hwu and Pohua P. Chang, "Efficient Instruction Sequencing with Inline Target Insertion", Coordinated Science Laboratory Report, UILU-ENG-902215, CSG-123, May, 1990.
....there is at most one store or load per cycle, the performance of a fourissue machine approaches that of a two issue machine. The IMPACT I C compiler is designed to support multiple branch operations per cycle. We have developed a variant of the squashing branch, called inline target insertion [Hwu 90] Chang 89.1] which allows concurrent execution of branch operations. Furthermore, inline target insertion allows branch operations to be fetched from branch slots and independent of the length of the control unit pipeline, only one program counter needs to be saved in order to return from an ....
W. W. Hwu and P. P. Chang, "Efficient Instruction Sequencing with Inline Target Insertion", Coordinated Science Laboratory Report, UILU-ENG-90-2215, CSG-123, May, 1990.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC