| PATEL, S. J.; FRIENDLY, D. H.; PATT, Y. N.. Critical Issues Regarding the Trace Cache Fetch Mechanism. Technical Report CSE-TR-335-97, University of Michigan, 1997. |
....A collapsing buffer [7] is an example of this approach. Studies have shown that this approach is also unable to deliver high fetch throughput [20] A third way is to observe the dynamic execution order as the program executes and cache instructions in their dynamic execution order. A trace cache [14, 16, 20] is an example of this approach. Accessing a single entry in a trace cache returns multiple instructions that were not necessarily contiguous in the static program, thereby allowing a sequential fetch unit to achieve a high fetch throughput. This approach uses additional storage resources that ....
S. J. Patel, D. H. Friendly, and Y. N. Patt. Critical Issues Regarding the Trace Cache Fetch Mechanism. Technical Report CSE-TR-335-97, Department of Electrical Engineering and Computer Science, University of Michigan, May 1997.
.... can be divided into two categories: a) augmenting the branch predictor to predict multiple branches per cycle [21] and the instruction cache to supply multiple discontinuous lines per cycle [4] and (b) storing instructions in dynamic execution order in the cache (i.e. using a trace cache) [11,12,16]. The first solution makes the branch predictor and the cache more complex, potentially increasing the cycle time; the second solution leads to inefficient use of cache space, potentially increasing cache miss rates. Both classes of solutions work by fetching a large number of contiguous ....
....he granularity of individual instructions we divide the instruction stream into coarser units called traces. A trace is a dynamic sequence of instructions in program order, potentially spanning control instructions. Control flow can be predicted on the granularity of traces using a trace predictor [11,16]. It has been demonstrated that trace predictors can achieve equivalent or higher prediction accuracies than conventional branch predictors [7] Once future control flow can be predicted at trace granularity, multiple traces can be fetched concurrently by using multiple instruction sequencers. ....
[Article contains additional citation context not shown here]
S. J. Patel, D. H. Friendly, and Y. N. Patt. Critical Issues Regarding the Trace Cache Fetch Mechanism. Technical Report CSE-TR-335-97, Department of Electrical Engineering and Computer Science, University of Michigan, May 1997.
....goal of providing a stronger basis for dynamic customization of the application s instruction stream to its run time behavior. The di#erence between a trace cache mechanism and the frame mechanism is largely in the way the two are delineated. Traces are formed once a certain number of branches [3, 12] (based on branch predictor bandwidth) or instructions (typically tied to the processor s fetch width, and therefore usually cache line size) is reached. Frame construction, on the other hand, terminates frames at unbiased (nonpro5 Fetch Address Generator Next Cache Branch Predictor = ....
S. J. Patel, D. H. Friendly, and Y. N. Patt, "Critical issues regarding the trace cache fetch mechanism," University of Michigan, Ann Arbor, Michigan, Technical Report CSE-TR-33597, May 1997.
....also consider the integration of preconstruction with another trace specific mechanism (preprocessing) to produce a high performance frontend. When combined, preconstruction and trace preprocessing produce an average speedup of 14 for the SPECint95 benchmarks. 1. Introduction Trace caches [10][9] have been proposed as a mechanism to enable low latency, high bandwidth instruction fetching. Trace caches store programs in a representation that is a hybrid of the static program representation and the dynamic instruction stream. Traces are snapshots of short segments of the dynamic instruction ....
....instructions are provided from the trace cache, yielding a contiguous block of dynamic instructions that may correspond to noncontiguous blocks of code from the static representation. Previous work has shown the potential benefit of adding trace caches to traditional processor cores [10][9], and of developing processors specifically around the trace cache [11] 4] 8] The latter approach provides reduced complexity and localized communication, as well as the ability to optimize programs dynamically. The dynamic behavior of traces, which enables the trace cache to provide high ....
[Article contains additional citation context not shown here]
S. Patel, D. Friendly and Y. Patt, "Critical Issues Regarding the Trace Cache Fetch Mechanism." University of Michigan Technical Report CSE-TR335 -97, 1997.
....basic blocks within a single trace cache line. A start trace address plus multiple branch predictions are used to access the trace cache. If the trace cache holds the trace of instructions, all instructions are delivered aligned to the processor core in a single access. Patel et al. [25], extended the organization of the trace cache to include associativity, partial matching of trace cache lines, and path associativity. 3 The Decoupled Front end To provide a decoupled front end, a Fetch Target Queue (FTQ) is used to bridge the gap between the branch predictor and the ....
S. Patel, D. Friendly, and Y. Patt. Critical issues regarding the trace cache fetch mechanism. CSE-TR-335-97, University of Michigan, May 1997. 31
....jess mpeg mtrt delta eqn idl rich Percent 9.2 Figure 5.8: Percent Increase in IPC Using a 32 entry VSQ 53 Solution for Wide Issue Processors 6. 1 Load Tokens The instruction trace cache has been shown to increase the instruction bandwidth, allowing issue rates previously not attainable [43, 45, 49]. As instruction fetch techniques improve, aggressive issue and execution techniques can be employed to exploit the available instruction stream. Data speculation has been presented as one method to increase processor throughput. Value prediction is a way to relax some data ow restrictions [20, ....
....all instructions following a misspeculated load depend on that value. The tradeo s and implementation of these two methods are further discussed in [32, 57] 6.2 Analyzing the Load Tokens 6.2. 1 Wide Issue Microprocessor Model The trace cache implementation is based largely on the discussions in [43]. A ll unit collects instructions at issue time as in [43, 49] Upon completing a trace cache line (also called a trace cache entry) the ll unit can then write the formatted trace cache line into the trace cache. Several additions to this basic trace cache architecture are required: the virtual ....
[Article contains additional citation context not shown here]
S. J. Patel, D. H. Friendly, and Y. N. Patt. Critical issues regarding the trace cache fetch mechanism. Technical report, UniversityofMichigan, 1997. 84
....instructions and translate them into another internal format. For example, the Intel P6 [2] decoders translate the x86 instructions into an internal format called uops. The uops are then executed by the fast execution core. A third example is a recently proposed idea called the trace cache [3][4][5] The trace cache buffers a dynamic sequence of instructions that can span multiple basic blocks, and stores them as a trace. The fetch unit can fetch from the trace cache the entire trace in one cycle without requiring a multiported I cache nor instruction alignment and collapsing logic. ....
....code on a programmable engine. However, in this paper, we show that this is not a problem. 1.3 Potential I COP Applications To highlight the potential of the I COP concept, we now briefly describe some possible I COP applications. 1.3. 1 Trace Construction and Optimization The trace cache [3][4][5] stores frequently executed sequences of instructions into physically contiguous storage locations, thus allowing high bandwidth instruction fetch without significantly increasing fetch stage complexity and latency. This dynamic regrouping of instructions is performed by a hardware structure ....
[Article contains additional citation context not shown here]
S. Patel, D. Friendly and Y. Patt, "Critical Issues Regarding the Trace Cache Fetch Mechanism", Technical Report CSE-TR-335-97, University of Michigan, May 1997.
....mapped to the PipeRench computation model described in Section 2.2. They are then written in the DIL language and compiled by the DIL compiler to produce the configuration bits used to configure the physical stripes of the PipeRench I COP at run time. 3.2. 1 Trace Construction The trace cache [2][7][8] stores frequently executed sequences of instructions in physically contiguous storage locations, thus allowing high bandwidth instruction fetch without multiple cache ports nor instruction alignment logic. This dynamic regrouping of instructions is performed by a hardware structure called the ....
....of instructions is performed by a hardware structure called the fill unit which is located at the back end of the machine. A trace comprises not only of regrouped instructions but also the outcomes of the branches in the trace, the exit addresses of the trace (to facilitate partial matching [7]) and the type of the last instruction in the trace. In our I COP implementation, logic associated with the fill buffer examines its first 16 entries and determines the end of a new trace. It then copies those instructions from the fill buffer to the I COP memory and inserts a task into the task ....
[Article contains additional citation context not shown here]
S. Patel, D. Friendly and Y. Patt, "Critical Issues Regarding the Trace Cache Fetch Mechanism," Technical Report CSETR -335-97, University of Michigan, May 1997.
....the real goal in these strategies is to improve instruction fetch bandwidth and preferably take branch prediction off the critical path. Recent research has focused on trace caches as a mechanism to capture a long stream of sequential instructions that can be easily fetched at peak bandwidth [18, 15]. Branch prediction guides the trace selection in the instruction fetch engine, at times predicting multiple branches per cycle. A more radical approach is the Fetch Target Buffer (FTB) proposed by Reinman, et al. 16] The FTB stores the addresses of predicted blocks of instructions and is ....
S. J. Patel, D. H. Friendly, and Y. N. Patt. Critical issues regarding the trace cache fetch mechanism. Technical Report CSE-TR-335-97, Department of Electrical Engineering and Computer Science, The University of Michigan, May 1997.
.... hes may beenc7( tered in a c#7Oz8,C branc hpredicC## arc hiteccC must be able toreczR nize andpredic multiplebranc hes perc#D9 [8] 13] 16] Thetrac ca he has been proposed as amec hanism for providingincng,R7 bandwidth by allowing the proco,RR tofetc hac7OO multiplebranc hes in a singlece, R [18]. Thetrac ca he works inc,#DR# with a multiplebranc h predic,C andtrac ca he lines are c, #RzR,CR by the filllogic To simplifyfetc h unit design for implementation, pracn, supersc)zD proco,#zR employ onefetc h unit and onlyfetc h oneinstruc,CR cn he line perc(zR) So,we propose a Range Assocso,# ....
....by using di#erent extents ofcz#RDO,C)88 They also attempt toincRD) the tree depth byinc(Oz,C threecee,8R8Rz, branc hes toinc(z#, thefetc h size. Thetrac cc he has been proposed as amec hanism for providingincg,z8R bandwidth by allowing the proco, 7 tofetc hac88O multiplebranc hes in a single csin [18]. Sinc the heart of thetrac ca he is its ability tofetc h multiplebasic bloc kseac hc8z# e#ec7O e multiple bloc kbranc hpredic,R isc,z7O)# to its performancR Trac cc he lines arec,##RRO,C) by the fill unit. The fill unit will attempt to maximize the size of the segment bycR bining newly ....
S.J. Patel, D.H. Friendly, and Y.N. Patt, "Critical issue regarding the trace cache fetchm echanism3 Technical Report CEE-TR-335-97, Dept. of Electronics Engineering and ComwOflO Science, University of Michigan, 1997.
....use steering based on the basic block relationships of instructions, or how the micro instructions are broken down from macro instructions from the ISA. Examine the information that can be collected at the trace cache fill unit and used to simplify later issues of the packet. Patel s work in [11] indicates that machine performance is not significantly impacted by large fill unit latencies and so the fill unit can be used to gather and collate rather complex information. Examine the implementation issues in more detail, i.e. attempt to quantify how much hardware a certain configurations ....
S. J. Patel, D. H. Friendly, and Y. N. Patt, "Critical Issues Regarding the Trace Cache Fetch Mechanism," Draft Technical Report, Advanced Computer Architecture Laboratory, University of Michigan. March 25, 1997.
....higher level of traces. Traces are predicted using a next trace predictor (Jacobson et al. 1997) which implicitly predicts multiple branches each cycle with only a single trace prediction. Traces themselves are stored in a trace cache (Johnson, 1994; Peleg Weiser, 1995; Rotenberg et al. 1996; Patel et al. 1997) for lowlatency, high bandwidth instruction fetching. Traces are also efficiently dispatched and renamed as a unit (Vajapeyam Mitra, 1997) intra trace values are pre renamed in the trace cache, so only inter trace values (live in and live out registers) need to be dynamically renamed. The ....
....manually inserted, FGCI like trace selection hints conveyed in the benchmark binaries; PEs are managed in a fifo queue so CGCI is not explicitly exploited. Other related work includes trace selection studies for trace caches and trace processors (Peleg Weiser, 1995; Rotenberg et al. 1996, 1997; Patel et al. 1997, 1998) trace selection for compilers (Fisher, 1981; Hwu Chang, 1988) and task selection for multiscalar processors (Vijaykumar, 1998; Vijaykumar Sohi, 1998) 1.3 Paper Organization Section 2 describes the trace processor s novel window management, i.e. support for instruction ....
Patel, S., Friendly, D., & Patt, Y. (1997). Critical issues regarding the trace cache fetch mechanism. Tech. rep. CSE-TR-335-97, Department of Electrical Engineering and Computer Science, University of Michigan.
....units, no memory parallelism, and no branch speculation. It turns out that we can modify the time stamping rules to handle many more interesting variations. We implemented time stamping rules that model the delays induced by a limited fetch width from an infinite instruction or trace cache [14, 11] with a hybrid branch predictor [7] and misprediction penalties, by a limited window size with three different window refilling policies (wrap around, compressing, and flushing) by a limited number of specialized, pipelined functional units assigned to oldest requesting instructions, and by an ....
....processor that we simulated, called , approximates a processor that we believe could be built using our redesigned superscalar components and other recent advances. The processor wakes up, schedules, and issues instructions from a single 128 instructions reordering buffer. Using a trace cache[14, 11], the processor fetches an unaligned dynamic sequence of eight instructions at a time. Fetching of instructions only incurs delays when a mispredicted branch is encountered, rather than on every branch. The can issue up to twenty instructions at a time. The functional units, their numbers, and ....
[Article contains additional citation context not shown here]
Sanjay Jeram Patel, Daniel Holmes Friendly, and Yale N. Patt. Critical issues regarding the trace cache fetch mechanism. Technical Report CSE-TR-335-97, Computer Science and Engineering, University of Michigan, 7 May 1997. http://www.eecs.umich.edu/HPS/hps tracecache.html.
....predicts multiple branches in a single cycle. To greatly simplify the process of fetching multiple, possibly noncontiguous basic blocks in a single cycle, the instructions that form a trace are stored together as a single contiguous unit in a special instruction cache, called the trace cache [41,75,80,71]. A conventional instruction cache distributes instructions from the same trace among multiple, noncontiguous cache lines, and requires several cycles to assemble the trace. The distinction between a conventional instruction cache and the trace cache is shown in Figure 1 2. Figure 1 2: ....
....Rotenberg, Bennett, and Smith [79,80] motivate the concept with comparisons to other high bandwidth fetch mechanisms (branch address cache and collapsing buffer, Section 2.1.1) both in terms of complexity and performance, and define some of the trace cache design space. Patel, Friendly, and Patt [71] expand upon and present detailed evaluations of this design space, arguing for a more prominent role of the trace cache. Two trace cache papers appear in a special issue on cache memories of the Transactions on Computers [73,84] 25 2.2 Processor paradigms 2.2.1 Multiscalar paradigm 2.2.1.1 ....
[Article contains additional citation context not shown here]
S. Patel, D. Friendly, and Y. Patt. Critical Issues Regarding the Trace Cache Fetch Mechanism. Technical Report CSE-TR-335-97, Department of Electrical Engineering and Computer Science, University of Michigan - Ann Arbor, 1997.
....Therefore, I have chosen to time share the instruction fetch engine between the A stream and R stream. This is where AR SMT diverges from recent SMT proposals [5] This decision is also justified with the advent of low latency, high bandwidth instruction fetch mechanisms such as trace caches [8,9,10,11]. Although instruction fetch is multiplexed between the two streams, when a given stream does access the fetch unit, it receives a large group of instructions. Instruction fetch and dispatch are really part of the same pipeline, and so dispatch is treated similarly. That is, the entire frontend ....
S. Patel, D. Friendly, and Y. Patt. Critical issues regarding the trace cache fetch mechanism. Technical Report CSE-TR-335-97, University of Michigan, EECS Department, 1997.
....instructions and translate them into another internal format. For example, the Intel P6 [2] decoders translate the x86 instructions into an internal format called uops. The uops are then executed by the fast execution core. A third example is a recently proposed idea called the trace cache [3][4][5] The trace cache buffers a dynamic sequence of instructions that can span multiple basic blocks, and stores them as a trace. The fetch unit can fetch from the trace cache the entire trace in one cycle without requiring a multiported Icache nor instruction alignment and collapsing logic. ....
....on a programmable engine. However, in this paper, we show that this is not a serious problem. 1.3 Potential I COP applications To highlight the potential of the I COP concept, we now briefly describe some possible I COP applications. 1.3. 1 Trace Construction and Optimization The trace cache [3][4][5] stores frequently executed sequences of instructions into physically contiguous storage locations, thus allowing high bandwidth instruction fetch without significantly increasing fetch stage complexity and latency. This dynamic regrouping of instructions is performed by a hardware structure ....
[Article contains additional citation context not shown here]
S. Patel, D. Friendly and Y. Patt, "Critical Issues Regarding the Trace Cache Fetch Mechanism," Technical Report CSETR -335-97, University of Michigan, May 1997.
....the real goal in these strategies is to improve instruction fetch bandwidth and preferably take branch prediction off the critical path. Recent research has focused on trace caches as a mechanism to capture a long stream of sequential instructions that can be easily fetched at peak bandwidth [51, 45]. Branch prediction guides the trace selection in the instruction fetch engine, at times predicting multiple branches per cycle. A more radical approach is the Fetch Target Buffer (FTB) proposed by Reinman, et al. 48] The FTB stores the addresses of predicted blocks of instructions and is ....
Sanjay J. Patel, Daniel H. Friendly, and Yale N. Patt. Critical issues regarding the trace cache fetch mechanism. Technical Report CSE-TR-335-97, Department of Electrical Engineering and Computer Science, The University of Michigan, May 1997. 142
....the real goal in these strategies is to improve instruction fetch bandwidth and preferably take branch prediction off the critical path. Recent research has focused on trace caches as a mechanism to capture a long stream of sequential instructions that can be easily fetched at peak bandwidth [51, 45]. Branch prediction guides the trace selection in the instruction fetch engine, at times predicting multiple branches per cycle. A more radical approach is the Fetch Target Buffer (FTB) proposed by Reinman, et al. 48] The FTB stores the addresses of predicted blocks of instructions and is ....
Sanjay J. Patel, Daniel H. Friendly, and Yale N. Patt. Critical issues regarding the trace cache fetch mechanism. Technical Report CSE-TR-335-97, Department of Electrical Engineering and Computer Science, The University of Michigan, May 1997. 142
....must be properly aligned and merged before they are supplied for execution. Such a solution adds considerable logic complexity in an already critical execution path of the processor. Either cycle time will be a ected or extra pipeline stages will be required. Recently proposed, the trace cache [20, 11, 22, 19] overcomes this bandwidth hurdle without requiring excessive logic complexity in the instruction delivery path. Like an instruction cache, the trace cache is accessed using the Program Counter. Unlike an instruction cache, 2 a trace cache line contains instructions as they appear in execution ....
....al [22] They presented a thorough comparison between the trace cache scheme and several hardware based high bandwidth fetch schemes and showed the advantage of using a trace cache, both in performance and latency. Extensions and analysis of the trace cache mechanism were proposed by Patel et al. [19, 6, 18] and trace cache implications on processor design were presented in [23] A similar approach to caching dynamic instruction groups was presented in the DIF cache by Nair and Hopkins [17] 3 The Trace Cache Fetch Mechanism We divide the trace cache fetch mechanism into four major components: a ....
[Article contains additional citation context not shown here]
S. J. Patel, D. H. Friendly, and Y. N. Patt, \Critical issues regarding the trace cache fetch mechanism," Technical Report CSE-TR-335-97, University of Michigan Technical Report, May 1997.
....work is the initial research performed on the trace cache by several groups. Its initial incarnations were developed by Melvin and Patt [8] Peleg and Weiser [12] and Johnson [7] The concept was demonstrated by Rotenberg et al. 14] to be a low latency fetch device and developed by Patel et al. [11, 10] to be a very high bandwidth device. Franklin and Smotherman [3] as well as Nair and Hopkins [9] have explored the run time manipulation of the code stream by the fill unit. In both cases the fill unit is used to dynamically retarget a scalar instruction stream into pre scheduled instruction ....
S. J. Patel, D. H. Friendly, and Y. N. Patt. Critical issues regarding the trace cache fetch mechanism. Technical Report CSE-TR-335-97, University of Michigan Technical Report, May 1997.
No context found.
PATEL, S. J.; FRIENDLY, D. H.; PATT, Y. N.. Critical Issues Regarding the Trace Cache Fetch Mechanism. Technical Report CSE-TR-335-97, University of Michigan, 1997.
No context found.
S. Patel, D. Friendly and Y. Patt, " Critical Issues Regarding the Trace Cache Fetch Mechanism." University of Michigan Technical Report CSE-TR-335-97, 1997.
No context found.
S. Patel, D. Friendly,andY. Patt. Critical issues regarding the trace cache fetch mechanism. CSE-TR-335-97, University of Michigan, May 1997.
No context found.
Sanjay J. Patel, Daniel h. Friendly, and Yale N. Patt. Critical issues regarding the trace cache fetch mechanism. Technical report, University of Michigan, May 1997.
No context found.
S.J. Patel, D.H. Friendly, and Y.N. Patt, Critical Issues Regarding the Trace Cache Fetch Mechanism, Tech. Report CSE-TR-335-97, Univ. of Michigan, Ann Arbor, 1997.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC