| Y. Chou and J. P. Shen. Instruction path coprocessors. In The 27th Annual International Symposium on Computer Architecture, pages 270--281, June 2000. |
....routing) on small trace fragments. The small trace lengths rendered compiler optimizations such as the ones considered in this paper largely ineffective for boosting performance. Subsequent work generalized the construction and optimization hardware with a separate co processor to deal with traces [2]. The rePLay framework pushes in the direction of aggressive trace optimization by providing support for speculative optimizations on long, atomic instruction traces. Along with rePLay, there have been other recent proposals for hardware assisted dynamic optimization, all of which offer different ....
Y. Chou and J. P. Shen. Instruction path coprocessors. In Computer Architecture, 2000.
....information. Also, the VMM can be given the capability of reading and writing implementation state. This information can be used for collecting additional dynamic program information or for putting software into the dynamic optimization loop (for example with a softwaredirected trace optimization [9]) This feature can also be used for saving and restoring parts of the implementation state, as we consider in this paper. 2.1 Microarchitecture The microarchitecture of a VMM implementation is in Fig. 2. In the figure, co designed VM hardware has been integrated into a generic pipeline. The ....
Yuan Chou and John Paul Shen, "Instruction Path Coprocessors," Proc. of the 27th Intl. Sym. on Computer architecture, pp. 270-281, 2000.
....however, requires the addition of implementation dependent code to the OS. One could also consider microcode in place of VMM software. The microcode can reside in ROM, but there must still be some hidden memory for maintaining data structures such as the phase table. A special purpose co processor [20] is another good candidate for managing the hardware configuration. It has the advantage of saving optimization time overhead at the expense of additional hardware. In the most straightforward implementation, working set signatures are collected by hardware, and then the raw signature data is ....
Y. Chou and J. Shen, "Instruction Path Coprocessors," Proc. of the 27th Intl. Sym. on Computer architecture, 2000, pp. 270-281.
....will involve a straightforward mapping of instructions. Consequently, the emphasis during translation is on identifying instruction inter dependences and on making register assignments that reduce intra processor communication. Binary translation can be performed either by a special co processor [6] or by the main processor, itself. 2 Instruction Set and Microarchitecture We begin with a brief description of the ILDP instruction set we propose. Following that is a description of the proposed microarchitecture. Then we discuss the specific features of both. 2.1 Instruction set overview ....
Yuan Chou and J. Shen, "Instruction Path Coprocessors", 27 Int. Symp. on Computer Architecture, pp. 270-279, Jun 2000.
....Examples of helper engines include the pre load engine of Roth and Sohi [9] where pointer chasing can be performed by a special processing unit. Another is the branch engine of Reinman et al. 10] An even more advanced helper engine is the instruction co processor described by Chou and Shen [11]. Helper engines have also been proposed for garbage collection [12] and dynamic correctness checking [13] pre branch Simple Pipeline preconstruction engine pre optimization engine pre load engine performance collecting Fig. 4. A heterogeneous ILDP chip architecture. 4. Managing ILDP: ....
Yuan Chou and J. P. Shen, "Instruction Path Coprocessors," 27th Int. Symposium on Computer Architecture, pp. 270-281, June 2000.
.... current rePLay microarchitecture does not save optimized frames into persistent storage, it is possible to write useful frames into an unused section of the the application s code segment, in a similar fashion to that proposed in [16] Previous approaches to hardware based dynamic optimization [8, 12, 4] have focused on simple microarchitectural optimizations. Such optimizations tuned a small trace of instructions (typically around 16 instructions) to the particulars of the execution microarchitecture. Example optimizations included instruction fusion (e.g. combining shifts with adds) cluster ....
....stream. In rePLay, frames are necessarily atomic to facilitate dynamic optimization. This fact allows far more aggressive optimization over an architecture that does not guarantee atomicity. Also, rePLay represents an effective implementation of a general Instruction Path Co Processor, or ICOP [8, 12, 4]. ICOP frameworks provide programmable hardware support for trace formation and dynamic optimization. A few, preliminary investigations into hardware support for dynamic optimization have been made [6, 17, 8, 4] The notion of a frame is similar to other types of optimization regions, such as ....
[Article contains additional citation context not shown here]
Y. Chou and J. P. Shen. Instruction path coprocessors. In Computer Architecture, 2000.
....information. Also, the VMM can be given the capability of reading and writing implementation state. This information can be used for collecting additional dynamic program information or for putting software into the dynamic optimization loop (for example with a software directed trace optimization [9]) This feature can also be used for saving and restoring parts of the implementation state, as we consider in this paper. 3 Saving and Restoring Implementation Context As stated in the introduction, many of the techniques for dynamic performance enhancement use a large amount of implementation ....
Yuan Chou and John Paul Shen, "Instruction Path Coprocessors," Proc. of the 27th Intl. Sym. on Computer architecture, pp. 270-281, 2000.
....of chaining slices in which a slice can, in essence, re spawn itself. A similar chaining mechanism was proposed by Zilles and Sohi in [16] based on hand optimized slices. Finally, an Instruction Path Coprocessor could potentially be used to support dynamic extraction and execution of slices [3]. This study also appears in our recent technical report [18] To the best of our knowledge, no other work on the locality characteristics of the slice stream of mispredicted branches or loads that miss exists. Moreover, in their majority, most aforementioned slice based execution models rely on ....
Y. Chou and J. Shen. Instruction path coprocessors. In Proc. 27th Intl. Symposium on Computer Architecture, pages 270281, June 2000.
....coprocessor was introduced as a way to provide hardware support to speed up interpretive applications by executing the interpreter mapping concurrently with the CPU using a coprocessor. A programmable instruction path coprocessor is used to perform dynamic transformations to the object code in [18]. The proposed architecture may alternatively be viewed as a translate coprocessor. Along the style of decoupled access execute architectures [69] it 72 can constitute a decoupled translate execute architecture [61] with the translate processor performing the conversion from bytecode to native ....
Yuan Chou and John Paul Shen. Instruction Path Coprocessors. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 270--281, June 2000.
....relocation of PEIs and branches while completely preserving precise exceptions. This new speculation model is designed so that it could be utilized by hardware only run time optimizers, as well as by software only systems like Dynamo [2] and software driven hardware coprocessor optimizers [5] where complete data flow information is usually either too costly or infeasible to derive. Precise speculation adheres to both the ordering and the liveness requirements for maintaining precise exceptions. Under the Precise Speculation model, PEIs and branches are for all practical purposes ....
Y. Chou and J. P. Shen. Instruction path coprocessors. In Proceedings 27nd Annual International Symposium on Computer Architecture, pages 270--281, June 2000.
....### ######### was introduced as a waytoprovide hardware support to speed up interpretive applications by executing the interpreter mapping concurrently with the CPU using a coprocessor. A programmable instruction path coprocessor is used to perform dynamic transformations to the object code in [14]. The proposed architecture may alternatively be viewed as a translate coprocessor. Along the style of decoupled accessexecute architectures [15] it can constitute a decoupled translate execute architecture, with the translate processor performing the conversion from bytecode to native sequence. ....
Y. Chou and J. P. Shen, \Instruction Path Coprocessors," in ########### ## ### #### ###### ############# ######### ## ######## ############, pp. 270-281, June 2000.
....fields. Other research has also explored the use of back end instruction processing on behalf of the main thread. The trace cache fill unit [14] groups committed instructions together into program traces, possibly applying transformations on those instructions [7] Instruction Path Coprocessors [5] have been proposed as a software controlled back end processor. Previously proposed hardware prefetchers [4, 10, 9] including those targeted at irregular memory access patterns [9, 15] all rely on pattern based and history based predictability to enable accurate prefetching. Sundaramoorthy et ....
Y. Chou and J. Shen. Instruction path coprocessors. In 27th Annual International Symposium on Computer Architecture, pages 270--281, June 2000.
....Recently, there is a proposal to perform run time program re layout in hardware [25] We believe that in the quest for ever higher performance, increasingly sophisticated hardware code modification techniques will be needed in the future. An Instruction Path Coprocessor (I COP) proposed in [3], is a programmable on chip coprocessor that allows these hardware code modifications to be implemented in software much like microcode. An I COP is analogous to a datapath coprocessor, except that it operates on the core processor s instructions themselves. The programmable nature of an I COP ....
....I COP implementations. Fourth, it makes it possible to modify and upgrade the machine simply by changing I COP code without changing the hardware. We believe an I COP can potentially be a valuable addition to the microarchitect s toolbox. In evaluating the feasibility of the I COP concept, [3] showed that an I COP programmed to implement trace construction and trace optimizations achieves good performance. The longer latency (as compared to hardwired logic) that the programmable I COP takes to perform the code modifications had little impact on performance because the I COP is located ....
[Article contains additional citation context not shown here]
Y. Chou and J. Shen, "Instruction Path Coprocessors," in Proc. of 27th International Symposium on Computer Architecture, June 2000.
....of the instruction. For example, instruction decode, register read, and register renaming are all necessary to instruction execution, but do not contribute to the final data result. Second, Turboscalar facilitates the dynamic transformation of the object code via an Instruction Path Coprocessor [3]. Finally, through dynamic code transformation, the Turboscalar microarchitecture can implement optimized execution cores that still efficiently execute legacy code. Ideally the execution core of a new microarchitecture is optimized for the new code base of a target machine, however this ....
....once to train the hot pipeline, for repeated subsequent executions. Once a group of instructions is gathered and all optimizations are complete it is inserted into the dynamic instruction cache for execution in the hot pipeline. The recently introduced Instruction path Coprocessor (I COP) [3] can be an effective and efficient way to implement the optimizing back end of the Turboscalar microarchitecture. 4 A Turboscalar Implementation The fundamental contribution of Turboscalar is the interaction of the hot cold pipelines, the dynamic instruction cache and the optimizing back end. ....
Y. Chou and J. Shen, "Instruction Path Coprocessors" To appear in Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000.
....I COP implementation of this optimization, we simply insert additional lines of I COP code to the basic fill unit code. In Section 4.3, we show that we can achieve additional performance over the basic fill unit code while using the same I COP. The additional I COP code comprises 423 instructions [31]. 3.3 Data Prefetch Trace Optimizations While the trace cache delivers good instruction fetch bandwidth, the performance of the core processor may still be limited by long latency operations, especially loads that miss the first level (L1) data cache. To alleviate this problem, data prefetching ....
....recorded in the RPT. The prefetch instruction is only inserted in the trace if there is an empty slot in that particular trace cache line. This ensures that the prefetch instructions do not consume extra fetch bandwidth. The I COP code for implementing stride prefetching contains 130 instructions [31]. 3.3.2 Linked Data Structure Prefetching Linked data structures (LDS) also called recursive data structures, include linked lists, trees and graphs etc. where individual nodes are dynamically allocated from the heap and linked together through pointers to form the overall structure. Several ....
[Article contains additional citation context not shown here]
Y. Chou and J. Shen, "Instruction Path Coprocessors", CMuART Tech. Report, Carnegie Mellon Univ., March 2000.
No context found.
Y. Chou and J. P. Shen. Instruction path coprocessors. In The 27th Annual International Symposium on Computer Architecture, pages 270--281, June 2000.
No context found.
Yuan Chou, John. P. Shen, "Instruction Path Coprocessors, Computer Architecture, pp. 270-281, Jun. 2000.
No context found.
Y. Chou and J. Shen. Instruction path coprocessors. In 27th Annual International Symposium on Computer Architecture, 2000.
No context found.
Y. Chou and J. Shen, "Instruction Path Coprocessors," Proc. of the 27th Intl. Sym. on Computer architecture, 2000, pp. 270-281.
No context found.
Yuan Chou and John. P. Shen, "Instruction Path Coprocessors," Proc. the 27 Int. Symp. Computer Architecture, Jun 2000.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC