33 citations found. Retrieving documents...
L. Gwennap, "Intel's P6 uses decoupled superscalar design," Microprocessor Report, Vol. 9, No. 2, pp. 9--15, 1995.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents

Queue Machines: Hardware Compilation in Hardware - Schmit, Levine, Ylvisaker   (Correct)

....queue machine model. Conversion of binaries from one form to another is a topic of intense recent research. The Intel Pentium Pro and all subsequent Intel 32 bit processors have converted the external binary to a sequence of finer grained RISC like instructions for execution on the internal core [11]. Transmeta also performs conversion of Intel 32 bit code to an internal representation [10] In a broader sense, many microarchitectural innovations are essentially embedding some compiler optimization in hardware. For example, a trace cache [13] reorders instructions and stores the decoded ....

L. Gwennap, "Intel's P6 Uses Decoupled Superscalar Design," Microprocessor Report, Vol. 9, Issue 2, Feb 1995.


A High-Bandwidth Memory Pipeline for Wide Issue Processors - Cho, Yew, Lee (2001)   (1 citation)  (Correct)

....locality, instruction level parallelism, runtime stack, data stream partitioning, multiported data cache. I INTRODUCTION ECHNOLOGICAL and architectural innovations have en abled development of powerful microprocessors that can execute several instructions concurrently at a very high clock rate [11], 36] 12] These processors select and execute independent instructions at runtime, assisted by hardware mechanisms for control speculation, register renaming, and data flow execution [15] With ample onchip hardware resources that will become available within a few years, researchers are ....

L. Gwennap, "intel's P6 Uses Decoupled Superscalar Design," Microprocessor Report, vol. 9, no. 2, Feb. 1995.


Computing Power into the 21st Century: Moore's Law and Beyond - Mitchell (1998)   (Correct)

....simple tasks as quickly as possible. There is no doubt that RISC won. The Intel Pentium II, which is officially a member of the x86 CISC architecture family, actually translates each x86 instruction into a series of RISC like micro operations, which are then performed by a RISC core processor [11]. But the ideas of RISC may have reached their limits. The first RISC processors used their transistors for caches and large, fast, pipelined, hardwired ALUs. Current RISC machines have so many functional units and such high frequencies that even the superscalar ALU is a small fraction of the ....

Linley Gwennap. "Intel's P6 Uses Decoupled Superscalar Design" Microprocessor Reports. 9, 2 (19 Feb 95).


Instruction Path Coprocessors - Chou, Shen (2000)   (16 citations)  (Correct)

....of the decode stage. The instructions appended with predecode bits can be viewed as a modified form of the original object code. Another example is the use of sophisticated decoders that take the original instructions and translate them into another internal format. For example, the Intel P6 [2] decoders translate the x86 instructions into an internal format called uops. The uops are then executed by the fast execution core. A third example is a recently proposed idea called the trace cache [3] 4] 5] The trace cache buffers a dynamic sequence of instructions that can span multiple basic ....

Linley Gwennap, "Intel's P6 Uses Decoupled Superscalar Design", in Microprocessor Report, Vol 9, Issue 2, Feb 1995.


Towards A Simplified Database Workload For Computer.. - Keeton, Patterson (2000)   (1 citation)  (Correct)

....occur only after all previous instructions have been retired, and all of the instruction s constituent ops have completed. The Pentium Pro retires up to three ops per clock cycle, yielding a theoretical minimum cycles per op (CPI) of 0.33. More information on the Pentium Pro can be found in [6] [11] [13] 21] 34] Measurements were performed using the Pentium Pro hardware counters [13] We present aggregate (user operating system) activity, factoring out the idle loop. On the uniprocessor, this technique is possible because NT implements the idle loop using the HALT instruction. The event ....

L. Gwennap. "Intel's P6 uses decoupled superscalar design." Microprocessor Report, 9(2):9-15, 1995.


Computer Architecture Support for Database Applications - Keeton (1999)   (3 citations)  (Correct)

....completed. The Pentium Pro retires up to three ops per clock cycle, yielding a theoretical minimum cycles per op (CPI) of 0.33. Table 2 5 summarizes the characteristics of the Pentium Pro caches. More detailed descriptions of the Pentium Pro s architectural features can be found in [15] 25] [39] [50] 76] We will also present additional details in subsequent sections, when discussing our measurement results. 2.5.2. Potential Sources of Pentium Pro Stalls In practice, the 0.33 theoretical minimum CPI is seldom achieved, due to stalls from cache misses, oversubscription of certain ....

....If these misses hit in the L2 cache, the processor experiences an additional four cycle latency. If they miss in the L2 cache, the processor will experience a delay of tens to one hundred cycles to access memory. In addition, a branch misprediction can cause delays of 11 or more cycles. [39] suggests that the typical branch misprediction penalty is 15 cycles. Two types of branch mispredictions can occur. First, a branch can be found in the branch target buffer (BTB) but with an incorrect predicted target address. The second type of misprediction occurs when the branch misses the ....

[Article contains additional citation context not shown here]

L. Gwennap. "Intel's P6 uses decoupled superscalar design," Microprocessor Report, 9(2):9-15, 1995.


PipeRench Implementation of the Instruction Path Coprocessor - Yuan Chou Pazhani (2000)   (1 citation)  (Correct)

....approach to solve this problem is to add hardware in the microarchitecture to dynamically modify the object code into an internal format that can be more efficiently processed by fast execution cores. We refer to this general approach as hardware code modification. For example, the Intel P6 [1] decoders translate the x86 instructions into an internal format called uops that are then executed by the execution core. Another example is the trace cache [2] which rearranges the ordering of instructions so that frequently executed sequences of instructions are stored in contiguous locations. ....

Linley Gwennap, "Intel's P6 Uses Decoupled Superscalar Design," in Microprocessor Report, Vol. 9, Issue 2, February 1995.


Restricted Dual Path Execution - Heil, Farrens, Smith, Tyson   (Correct)

....a mispredicted branch is immediately followed by a second, the distance is one. # # 5.2. Varying Branch Misprediction Penalty There is a trend toward deeper processor pipelining; for example, the latest generation DEC processor, the 21264 [Kell96] and the latest Intel processor, the Pentium Pro [Gwen95], are both more deeply pipelined than their predecessors. As levels of processor pipelining increases, it is likely that the cycles lost for each mispredicted branch will also increase. Wider instruction fetching, register renaming and larger register files will also contribute to higher branch ....

L. Gwennap, "Intel's P6 Uses Decoupled Superscalar Design", Microprocessor Report (February 1995).


Performance Issues in Correlated Branch Prediction Schemes - Gloy, Smith, Young (1995)   (1 citation)  (Correct)

....and thus link time optimizations are easier for us to implement than compile time optimizations. 4 fall through paths of branches such that each branch falls through more frequently than it takes. If correctly predicted taken branches still result in a misfetch penalty (as in the DEC Alpha 21164 [10]) this branch alignment step results in fewer cycles lost due to misfetch penalties and an increase in the average length of straight line executed code which improves spatial locality. To offset the cache effects of code expansion in SCBP, we have implemented each of these code layout techniques ....

....R2000like machine, which has less than one cycle of branch misprediction penalty (depending on how the branch delay slot is filled) verified this hypothesis, and so we concentrated our efforts on the next generation of machine models. Recently announced processors, like the DEC Alpha 21164 [10] and the Intel P6 [11] have implementations more favorable to trading cache misses for mispredictions. The 21164 has a five cycle mispredict penalty and a one cycle misfetch penalty (penalty for correctly predicted taken branches) It incorporates a small 8KB L1 instruction cache and a 96KB ....

L. Gwennap, "Intel's P6 Uses Decoupled Superscalar Design," Microprocessor Report, MicroDesign Resources, 9(2), Feb. 16, 1995.


Speculative Updates of Local and Global Branch History.. - Skadron, Martonosi.. (1998)   (11 citations)  (Correct)

.... the table of 2 bit counters described by Smith [2] and found in several recent processors [3, 4] Two level schemes now appear in many high performance processors: the AMD K6 and Athlon [5, 6] and the soon to be released UltraSPARC III [7] use gshare [8] predictors; the PentiumPro Pentium II [9] also uses a two level scheme, but of a confidential nature; and the Alpha 21264 [10, 11] uses two different two level predictors in conjunction with a selector that chooses between the components [8, 12] Most branch prediction studies use instruction level simulations that assume the correct ....

L. Gwennap, "Intel's P6 uses decoupled superscalar design," Microprocessor Report, pp. 9--15, Feb. 16, 1995.


A Cache Line Fill Circuit for a Micropipelined.. - Mehra Garside Rmehra (1995)   (1 citation)  (Correct)

....may in principle proceed independently of the line fetch process, although cache access conflicts between these processes must be resolved. Simpler mechanisms resolve any conflicts by stalling the processor until the line fetch is complete; more sophisticated systems employ hit under miss [6] which allows the line fetch to proceed in the background whilst the processor continues in parallel. It is apparent that case (3) requires its own cache miss processing and line fetch. If there is contention for resources it must either abandon the current line fetch or wait until it has ....

....to request data from sequential addresses. The demand for greater performance has led to decoupled architectures becoming more common (PA8000, MIPS R10000, HaL R1, P6) many of which utilise non blocking caches in order to retain memory bandwidth during cache miss processing (HaL [7] P6 [6]) A decoupled architecture is one where the CPU is fed instructions from a prefetch buffer large enough to hide some or all of the latency from a line fetch. This requires that the prefetch buffer can subsequently be refilled when the line fetch is complete so the cache must be able to supply ....

Linley Gwennap "Intel's P6 Uses Decoupled SuperScalar Design", Microprocessor Report, Vol. 9 Number 2, February 16 1995.


Instruction Path Coprocessors - Chou, Shen (2000)   (16 citations)  (Correct)

....of the decode stage. The instructions appended with predecode bits can be viewed as a modified form of the original object code. Another example is the use of sophisticated decoders that take the original instructions and translate them into another internal format. For example, the Intel P6 [2] decoders translate the x86 instructions into an internal format called uops. The uops are then executed by the fast execution core. A third example is a recently proposed idea called the trace cache [3] 4] 5] The trace cache buffers a dynamic sequence of instructions that can span multiple basic ....

Linley Gwennap, "Intel's P6 Uses Decoupled Superscalar Design," in Microprocessor Report, Vol 9, Issue 2, Feb 1995.


Multiple-Block Ahead Branch Predictors - Seznec, Jourdan, Sainrat, Michaud (1996)   (32 citations)  (Correct)

....block starting address. By the end of the cycle, the starting address of the next instruction block must be generated. In some of the processors, the I cache access time is longer than the cycle time, leading to a pipeline structure depicted in figure 6 (a) 1 . For instance, the Intel PentiumPro [8] features a pipelined I cache access completed within two and a half cycles. As far as the current instruction block address is used to predict the next instruction block, either the instruction address generator can compute the starting address of the next instruction block in a single cycle as ....

L. Gwennap, "Intel's P6 Uses Decoupled Superscalar Design," Microprocessor Report, Vol. 9 Num. 2, 1995.


Miss Path Speculative Scheduling For High Issue Rates - Banerjia, Sathaye, Menezes, ..   (Correct)

....Entries in the reorder buffer maintain the original destination register id, so that at the result is written to an architected register at writeback. Miss path scheduling can be used for CISC architectures as well as RISC architectures. As is done in contemporary multiple issue CISC processors [27], a CISC instruction can be decomposed into multiple RISC like instructions (micro ops) 1] These instructions can then be individually scheduled. Care must be taken so that the newly created micro ops are ordered such that the dependencies between them are honored. Exception handling is achieved ....

L. Gwennap, "Intel's P6 uses decoupled superscalar design," Microprocessor Report, vol. 9, Feb. 1995.


Memory-System Design Considerations For Dynamically-Scheduled.. - Farkas (1997)   (34 citations)  (Correct)

....1985] which is essentially a FIFO queue that is used to force speculative results to commit in order. Reorder buffers and reservation stations are used to implement dynamic scheduling in the PowerPC 604 processor [Song 1994] the PowerPC 620 processor [IBM 1994] the Intel Pentium Pro processor [Gwennap 1995], and the AMD K5 processor [Christie 1996] The AMD K6 processor [Halfhill 1996] previously the NexGen 686 processor) and the HP PA 8000 processor [Gwennap 1994b] also use a reorder buffer, but for these two processors the reorder buffer also provides the functionality of reservation stations. A ....

....of an exception can dramatically change the utilization of all resources. For dynamically scheduled processors built within other frameworks, a similar set of factors exist. For example, items 1, and 3 to 6 above apply to the PA 8000 [Gwennap 1994b] the PowerPC 604 [Song 1994] and the Intel P6 [Gwennap 1995] processors, which are built within the framework of reorder buffers. Because the state of a dynamically scheduled processor cannot be statically predicted, the compiler cannot make any decisions based on the relative order in which instructions will be processed. In other words, the inner ....

Gwennap, L. (1995). Intel's P6 Uses Decoupled Superscalar Design. Microprocessor Reports, 9(2):9-- 15.


Improving Pointer-Based Codes Through Cache-Conscious Data.. - Chilimbi, Larus, Hill (1998)   (14 citations)  (Correct)

....imbalance [35] 10000 2 disparity in memory access costs. Figure 2 shows the opportunity cost (in potential instruction executions) of referencing data at various levels of a memory hierarchy. The 1980 cost (1 4) is for a VAX 11 780 [14] and the 1997 cost (1 256) is for an UltraSparc 2 [20]. The difference between a cache hit and miss is now almost two orders of magnitude. As a result, many programs performance is dominated by memory references. Moreover, the large cost disparity undercuts the fundamental random access memory (RAM) model used by most programmers to design data ....

Linley Gwennap. "Intel's P6 uses decoupled superscalar design." Microprocessor Report, 9(2):9--15, Oct. 16 1996.


Scholarly Paper in Computer Engineering - Non-Thesis Option For   (Correct)

....introduction of two level schemes [19] the prediction accuracy of dynamic branch predictors has been pushed above 90 . As a result, two level dynamic branch predictors have been incorporated in several recent high performance microprocessors. Perhaps the best known examples are the Pentium Pro [5] and Alpha 21264 [6] Among two level predictors, those using global history schemes have been shown to yield the best performance for integer benchmarks [21] However, to achieve high levels of accuracy, current dynamic branch predictors require con ENEE M.S. Non Thesis Otion Scholarly ....

Gwennap, L. "Intel's P6 Uses Decoupled Superscalar Design," Microprocessor Report, Vol. 9, No.2, Feb. 16, 1995.


The AMULET2e Cache System - Garside Temple Mehra (1996)   (3 citations)  (Correct)

....they must return results to the processor, preferably with as little delay as possible. Providing that the line fetch process and the processor do not conflict over cache accesses it is clear that case (2) may in principle proceed without impediment. This process known as hit under miss [9] is sometimes employed in current synchronous designs although many systems resolve potential conflicts by stalling the processor until the line fetch is complete. It is apparent that case (3) requires its own cache miss processing and line fetch. If a line fetch is still in progress there ....

Linley Gwennap "Intel's P6 Uses Decoupled SuperScalar Design", Microprocessor Report, Vol. 9 Number 2, February 16 1995.


Formal Methods in System Design, 20, 159--186, 2002 c - Verification Of Out-Of-Order   (Correct)

No context found.

L. Gwennap, "Intel's P6 uses decoupled superscalar design," Microprocessor Report, Vol. 9, No. 2, pp. 9--15, 1995.


Kimberly Keeton - David Patterson Yong   (Correct)

No context found.

L. Gwennap. "Intel's P6 uses decoupled superscalar design." Microprocessor Report, 9(2):9-15, 1995.


The Effects of Mispredicted-Path Execution on Branch - Prediction Structures..   (Correct)

No context found.

L. Gwennap, "Intel's P6 Uses Decoupled Superscalar Design," Microprocessor Report, Vol. 9 Num. 2, 1995.


Dynamic Binary Translation for Accumulator-Oriented Architectures - Kim, Smith (2003)   (1 citation)  (Correct)

No context found.

Linley Gwennap, "Intel's P6 Uses Decoupled Superscalar Design," Microprocessor Report, Feb. 16, 1995.


Improving data caching for software MPEG video decompression - Feng, Sechrest (1996)   (5 citations)  (Correct)

No context found.

L. Gwennap,"Intel's P6 Uses Decoupled Superscalar Design", Microprocessor Report, Feb. 16, 1995, pp. 915.


Data References for: "Branch Effect Reduction Techniques" - Uht, Sindagi, al. (1997)   (Correct)

No context found.

L. Gwennap, "Intel's P6 Uses Decoupled Superscalar Design," Microprocessor Report, vol. 9, pp. 9--15, February 16, 1995.

First 50 documents

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC