| A. Gonzalez, Jordi Tubella and Carlos Molina. "Trace-Level Reuse". Proceedings of the 1999. |
....the values then allows all of the instructions in the reuse unit to be skipped. The reuse unit must be large enough so that the execution time saved by reuse can offset this reuse detection latency. Examples of reuse units include a single instruction [29, 31] a basic block [14, 15, 16] a trace [10, 26], and a function [22, 25] Reuse at the single instruction level keeps the inputs and results of previously executed instructions in a hardware buffer and tries to skip the execution of instructions that are subsequently re executed with the same inputs. In a pipelined superscalar processor, the ....
....are subsequently re executed with the same inputs. In a pipelined superscalar processor, the execution of an instruction could take as little as one cycle, which limits the effectiveness of instruction reuse for these low latency instructions. Reuse at very large granularities, such as a trace [10], introduces the problem of insufficient opportunities to reuse results, since more instructions typically require more input values to repeat in order for these instructions to be skipped. Hence, a reuse granularity between an instruction and a trace could be very attractive. A natural choice is ....
[Article contains additional citation context not shown here]
A. Gonzalez, Jordi Tubella and Carlos Molina. "Trace-Level Reuse". Proceedings of the
....with specialization, a technique that requires run time code generation. 1 Introduction Recently, a number of studies have demonstrated that programs exhibit significant value locality, the phenomenon that a small number of values occur repeatedly in the same register or memory location [6, 8, 11, 17, 21]. Microarchitectural techniques exploiting value locality follow one of two paradigms: value prediction or computation reuse. While prediction based techniques improve performance by breaking data dependences [11, 12, 18] reuse based techniques improve performance by reducing computation latency ....
....constitutes one of the principal problems of reuse. A number of hardware techniques have been proposed to identify and exploit coarse grained reuse: linking datadependent instructions in a h w table [21] detecting reuse at the granularity of basic blocks [8] and trace level reuse [6]. More recently, a hybrid reuse technique using a combination of software and hardware was proposed by Connors and Hwu [3] wherein a compiler identifies reuse regions by consulting an off line value profile. The hardware is then responsible for recording the execution instances of these regions ....
A. Gonzalez, J. Tubella, and C. Molina. Trace-Level Reuse. In Proceedings of the the International Conference on Parallel Processing, September 1999.
....with specialization, a technique that requires run time code generation. 1 Introduction Recently, a number of studies have demonstrated that programs exhibit significant value locality, the phenomenon that a small number of values occur repeatedly in the same register or memory location [6, 8, 11, 17, 21]. Microarchitectural techniques exploiting value locality follow one of two paradigms: value prediction or computation reuse. While prediction based techniques improve performance by breaking data dependences [11, 12, 18] reuse based techniques improve performance by reducing computation latency ....
....one of the principal problems of implementing reuse. A number of hardware techniques have been proposed for identifying and exploiting coarse grained reuse: linking data dependent instructions in a hardware table [21] detecting reuse at the granularity of basic blocks [8] and trace level reuse [6]. More recently, a hybrid reuse technique using a combination of software and hardware was proposed by Connors and Hwu [3] wherein a compiler identifies reuse regions by consulting an off line value profile. The hardware is then responsible for recording the execution instances of these regions ....
A. Gonzalez, J. Tubella, and C. Molina. Trace-Level Reuse. In Proceedings of the the International Conference on Parallel Processing, September 1999.
....is needed for measuring the profiling overhead. 5. 1 Example Application: Value Profiling Recently, a number of studies have demonstrated that programs exhibit significant value locality, the phenomenon that a small number of values occur repeatedly in the same register or memory location [23, 27, 31, 38, 41]. In the compiler domain, it has been known for some time that value locality can be used to speed up programs by exploiting the fixed invariant inputs. Partial evaluation [28] data specialization [30] DyC [24] Tempo [12] C [37] and code specialization using value profiles [36] are different ....
A. Gonzalez, J. Tubella, and C. Molina. Trace-Level Reuse. In Proceedings of the the International Conference on Parallel Processing, September 1999.
....blocks with short locality histories. DTM manages value locality dynamically, by allocating Memo Table T entries as necessary to keep new trace instances (with different input contexts) and by releasing entries not frequently used. This results in a better utilization of hardware resources. In [17], Gonzalez et al. study the potential of value reuse at the trace level. A memory structure to store trace information, the Reuse Trace Memory (RTM) is described. However, some issues are not fully addressed, for example: how to incorporate the RTM into a real microarchitecture, how to keep the ....
....the actual reuse ranges from 11 to 48 for Sn d and from 14 to 31 for block reuse. On the average (harmonic mean across the benchmark programs common to the three studies) DTM reuses 44 of the dynamic instructions, compared to 19 and 22 by Sn d and block reuse, respectively. According to [17], the average (geometric mean) number of redundant instructions executed by the SPEC95 integer benchmarks accounts for 83 of the dynamic instructions. From Figure 9, recall that DTM and Sn d performed equally for the go program and that Sn d outperformed DTM for the vortex program. But, in ....
A. Gonzalez, J. Tubella, C. Molina, Trace-Level Reuse, Proc. of the International Conference on Parallel Processing, 1999, pp. 30--37.
....and is not cost effective for basic blocks with short locality histories. DTM manages value locality dynamically, by allocating Memo Table T entries as necessary to keep new trace instances and by releasing entries in a LRU basis. This results in a better utilization of hardware resources. In [10], Gonzalez et al. propose a trace reuse mechanism which employs a Reuse Trace Memory (RTM) to keep trace information. However, the issues of how the RTM is incorporated into a real microarchitecture are not considered. The work provides the speedup upper bounds achieved with an infinite RTM, but ....
....the actual reuse ranges from 11 to 48 for Sn d and from 14 to 31 for Block Reuse. On the average (harmonic mean across the benchmark programs common to the three studies) DTM reuses 44 of the dynamic instructions, compared to 19 and 22 by Sn d and Block Reuse, respectively. According to [10], the average (geometric mean) number of redundant instructions executed by the SPEC95 integer benchmarks accounts for 83 of the dynamic instructions. From Figure 8, recall that DTM and Sn d performed equally for the go program and that Sn d outperformed DTM for the vortex program. But, in ....
A. Gonzalez, J. Tubella, C. Molina, Trace-Level Reuse, Proc. of the International Conference on Parallel Processing, 1999, pp. 30--37.
....This approach complicates the mechanism for reducing the A stream, however. For the A stream to make correct forward progress, the effects of removed, value predictable computation must be emulated by updating the state of the A stream with values directly, similar to block trace computation reuse [9,8,6] but without the reuse test. This is why we focused initially on the special cases of ineffectual and branch predictable computation: this computation can be literally removed (i.e. replaced with nothing) and only the program counter needs to be updated to skip instructions. 3. ....
....as predictions. SRT improves on AR SMT in a variety of ways, including a formal and systematic treatment of SMT applied to fault tolerance (e.g. spheres of replication) Researchers have demonstrated a significant amount of redundancy, repetition, and predictability in general purpose programs [6,8,9,12,13,14,16,24,25,29]. This prior research forms a basis for creating the shorter program in slipstream processors. A technical report [21] showed 1) it is possible to ideally construct significantly reduced programs that produce correct final output, and 2) AR SMT is a convenient execution model to exploit this ....
A. Gonzlez, J. Tubella, and C. Molina. Trace-Level Reuse. Int'l Conf. on Parallel Processing, Sep. 1999.
....Others explored the related area of computation reuse, introducing methods to nonspeculatively re use previously computed results. This area was explored both at the granularity of individual instructions [16] 17] as well as at coarser granularities of groups of instructions [4][3]. A third path of research attacked the observed value locality through profiling in order to allow the compiler to make value locality based optimizations [1] Several recent works have also explored the important execution characteristics of Java programs, at the bytecode level as well as at ....
.... the general notion of value predictability and the finite context method for value prediction in [15] Computation reuse was explored at the individual instruction level by Sodani and Sohi in [16] and [17] Others extended computation reuse to the level of blocks [4] and traces of instructions [3]. Calder et al. formalized value profiling and the Invariance M metric in [1] analyzing procedure parameters and load instruction result values in the SPEC benchmarks. Kalamatianos and Chaiken used this approach to characterize value locality in parameter values in Windows NT applications [8] ....
A. Gonzlez, J. Tubella, and C. Molina, "TraceLevel Reuse," Technical Report UPC-DAC-199847, Universitat Politecnica de Catalunya.
....Since the reuse detection process cannot start before all inputs to a reuse unit are ready, the reuse unit must be large enough so that the execution time saved by reuse can offset the reuse detection latency. Examples of reuse units include a single instruction [19] a basic block [8] a trace [5], and a function [16] A basic block can be viewed as a superinstruction that has some set of upward exposed inputs and produces some set of live output values [8] By definition, the number of instructions, as well as the number of inputs and outputs of a basic block, can be unlimited, since the ....
A. Gonzalez, Jordi Tubella and Carlos Molina. "Trace-Level Reuse". Proceedings of the 1999 International Conference on Parallel Processing (ICPP'99), Japan, September, 1999.
....as predictions. SRT improves on AR SMT in a variety of ways, including a formal and systematic treatment of SMT applied to fault tolerance (e.g. spheres of replication) Researchers have demonstrated a significant amount of redundancy, repetition, and predictability in general purpose programs [6,9,10,17,18,19,30,32]. This prior research forms a basis for creating the shorter program in slipstream processors. A technical report [25] showed 1) it is possible to ideally construct significantly reduced programs that produce correct final output, and 2) AR SMT is a convenient execution model to exploit this ....
A. Gonzlez, J. Tubella, and C. Molina. Trace-Level Reuse. Int'l Conf. on Parallel Processing, Sep. 1999.
....in the processor pipeline it can only start when all inputs to a reuse unit are ready. Thus, the reuse unit must be large enough so that the execution time saved by reuse can offset the reuse detection latency. Examples of reuse units include a single instruction, a basic block [15] a trace [11, 28], and a function [27] 1.4.1 Instruction Level Reuse Instruction reuse skips the execution of a single instruction for each reuse detection process, i.e. when the operands to an instruction are ready and these operand values are the same as the previously saved ones, the execution of the ....
....is indexed by instruction address. In addition, the SRC does not support the propagation of instruction skipping. 2.5 Value Reuse Techniques Several techniques have been proposed to dynamically reuse values produced by instructions. These include dynamic instruction reuse [33] trace level reuse [11], value cache [14] result cache [27] and primitive function reuse [27] Dynamic instruction reuse [33] stores the input operands and the output result of each instruction to eliminate the need for re executing an instruction when its operands are the same as the last time the instruction was ....
[Article contains additional citation context not shown here]
A. Gonzalez, Jordi Tubella and Carlos Molina. "Trace-Level Reuse". Proceedings of the 1999 International Conference on Parallel Processing (ICPP'99), Japan, September, 1999.
....A stream is left with a less predictable subset of the program but it can provide timely and accurate branch value predictions to the R stream by virtue of running ahead. Researchers have demonstrated a tremendous amount of redundancy, repetition, and predictability in general purpose programs [5,7,8,13,24,26]. This prior research forms a basis for creating the shorter program in the cooperating threads architecture. Speculative multithreading architectures [1,6,18,27,28,29] speed up a single program by dividing it into speculatively parallel threads. The speculation model uses one architectural ....
A. Gonzlez, J. Tubella, and C. Molina. Trace-Level Reuse. Intl. Conf. on Parallel Processing, September 1999.
....in the processor pipeline it can only start when all inputs to a reuse unit are ready. Thus, the reuse unit must be large enough so that the execution time saved by reuse can offset the reuse detection latency. Examples of reuse units include a single instruction, a basic block [15] a trace [11, 28], and a function [27] 1.4.1 Instruction Level Reuse Instruction reuse skips the execution of a single instruction for each reuse detection process, i.e. when the operands to an instruction are ready and these operand values are the same as the previously saved ones, the execution of the ....
....is indexed by instruction address. In addition, the SRC does not support the propagation of instruction skipping. 2.5 Value Reuse Techniques Several techniques have been proposed to dynamically reuse values produced by instructions. These include dynamic instruction reuse [33] trace level reuse [11], value cache [14] result cache [27] and primitive function reuse [27] Dynamic instruction reuse [33] stores the input operands and the output result of each instruction to eliminate the need for re executing an instruction when its operands are the same as the last time the instruction was ....
[Article contains additional citation context not shown here]
A. Gonzalez, Jordi Tubella and Carlos Molina. "Trace-Level Reuse". Proceedings of the 1999 International Conference on Parallel Processing (ICPP'99), Japan, September, 1999.
....processor pipeline it can start only when all of the inputs to a reuse unit are ready. Thus, the reuse unit must be large enough so that the execution time saved by reuse can offset the reuse detection latency. Examples of reuse units include a single instruction [21] a basic block [9] a trace [5, 18], and a function [17] A basic block can be viewed as a superinstruction that has some set of upward exposed inputs and produces some set of live output values [7, 8, 9] The number of instructions, as well as the number of inputs and outputs of a basic block, can be unlimited since the end of a ....
....of the instruction. Different entries in the value buffer are also linked based on data dependence information to provide reuse chaining. 3 2. 2 Value Reuse Techniques In addition to predicting values, several techniques have been proposed to dynamically reuse values produced by instructions [5, 9, 17, 21]. Dynamic instruction reuse [21] stores the input operands and the output result of each instruction to eliminate the need for re executing an instruction when its operands are the same as the last time the instruction was executed. This approach was introduced to make use of the squashed ....
[Article contains additional citation context not shown here]
A. Gonzalez, Jordi Tubella and Carlos Molina. "Trace-Level Reuse". Proceedings of the 1999 International Conference on Parallel Processing (ICPP'99), Japan, September, 1999.
....implementation [7] They proposed a highly interleaved prediction table with a fast distribution network to support high instruction issue rate. To skip the execution of sequences of instructions by reusing the results of their prior executions, basic block reuse [10] and trace level reuse [8] schemes are proposed. But their focus is more on skipping instruction execution than on predicting the values. Reinman et al. propose a Fetch Target Buffer (FTB) for fast instruction delivery [16] They decouple the FTB from the instruction fetch and decode stages for a more scalable design. 3. ....
A.Gonzalez, J.Tubella, and C.Molina, "Trace-Level Reuse", Proceedings of the International Conference on Parallel Processing, Sept. 1999
No context found.
A. Gonzlez, J. Tubella and C. Molina, "Trace level Reuse". In Proceedings of the International Conference on Parallel Processing, 1999.
No context found.
A. Gonzalez, Jordi Tubella and Carlos Molina. "Trace-Level Reuse". Proceedings of the 1999.
No context found.
A. Gonzalez, Jordi Tubella, and Carlos Molina. Trace-level reuse. ICPP, 1999.
No context found.
A. Gonzalez, J. Tubella, and C. Molina. Trace-Level Reuse. In Proceedings of the International Conference on Parallel Processing, September 1999. 183
No context found.
A. Gonzalez, J. Tubella, and C. Molina. Trace-level reuse. In 28th ICPP, p. 30--37, 1999, IEEE CS.
No context found.
A. Gonzalez, Jordi Tubella and Carlos Molina. "Trace-Level Reuse". Proceedings of the 1999.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC