| GWENNAP, L. MIPS R10000 uses decoupled architecture. Microprocessor Report 8, 14 (October 1994). |
....is especially well suited to applications that are inherently multithreaded, such as database and Web servers, as well as multiprogrammed and parallel scientific workloads. At the hardware level, SMT is a straightforward extension of modern, out of order superscalars, such as the MIPS R10000 [32] or the Alpha 21264 [33] SMT duplicates the register file, program counter, subroutine stack and internal processor registers of a superscalar to hold the state of multiple threads (we call the set of hardware resources that contains the state of a thread a context) In addition to duplicating ....
GWENNAP, L. MIPS R10000 uses decoupled architecture. Microprocessor Report 8, 14 (October 1994).
....from multiple paths in limited ways. For example, the MIPS R10000 requires a one cycle delay to decode and predict branches. This cycle is used to fetch instructions sequentially following the branch. If the branch is predicted taken, the extra fetched instructions are stored in a Resume Cache [5] in a partially decoded state. If the branch is discovered to be not taken, then the sequential instructions are quickly recovered from the Resume Cache, eliminating one cycle from the misprediction penalty. The IBM POWER1 processor statically predicts all conditional branches not taken. However, ....
....implementations departs from the simulated model, or where a departure seems justifiable, are mentioned. 4.1 Overview The basic pipeline of the proposed implementation, shown in Fig. 15, is be similar to that used in current superscalar microprocessors, for example the MIPS R10000 or DEC 21264 [5, 7]. Instructions from both paths are fetched in the first pipeline stage. Next instructions are decoded. Buffers are added before the decode stage to allow temporary separation of the two instruction paths. Each cycle, instructions from only one path are released from the decode stage to the ....
[Article contains additional citation context not shown here]
L. Gwennap. MIPS R10000 Uses Decoupled Architecture. Microprocessor Report, pages 18-22, October 1994.
....Some superscalar processors use a clustered organization to simplify issue logic, 3] whereby instructions are directed to separate clusters with independent register files and functional units. Additionally some designs are incorporating deep queues to decouple instruction fetch from execution [4]. These designs are being forced to take on some of the attributes of decoupled architectures as complexity becomes unmanageable. Historically, decoupled architectures have not gained wide appeal due to the difficulties associated with being able to decouple programs effectively. In this paper, ....
L. Gwennap. Mips R10000 uses decoupled architecture. Microprocessor Report, October 1994.
....architectures are beginning to use more complexity effective designs [26] Some superscalar processors [22] use a clustered organization to simplify issue logic, whereby instructions are directed to separate clusters with independent register files and functional units. Additionally some designs [18] are incorporating deep queues to decouple instruction fetch from execution. These designs are being forced to take on some of the attributes of decoupled architectures as complexity becomes unmanageable. More traditional decoupled architectures have primarily been targeted at scientific codes. ....
L. Gwennap. Mips r10000 uses decoupled architecture. Microprocessor Report, October 1994.
.... has established SMT as effective in increasing throughput on a variety of workloads, while still providing good performance for singlethreaded applications [41, 22, 23, 21, 45] At the hardware level, SMT is a straightforward extension of modern, out of order superscalars, such as the MIPS R10000 [15] or the Alpha 21264 [16] SMT duplicates the register file, program counter, subroutine stack and internal processor registers of a superscalar to hold the state of multiple threads (we call the set of hardware resources that contains the state of a thread a context) In addition to duplicating ....
L. Gwennap. MIPS R10000 uses decoupled architecture, October 24 1994.
....of functional units, one tries to make the common case fast and simple. That often results in functional units with variable latency, i.e. the latency depends on the operation and the operands. For instance, many floatingpoint units allow for extra cycles in case of a high precision result [17, 16] or in case of an exception [48] The latency and resources of an operation then only became apparent while the operation is being processed. Thus, variable latency introduces two scheduling problems. The global scheduler governs the interaction between the functional units (FUs) and the ....
L. Gwennap. MIPS R10000 uses decoupled architecture. Microprocessor Report, 8(14):18-- 22, 1994.
....However, more recent microprocessors have backed away from this search mechanism, and have incorporated map tables, that record a pointer to the most recent rename of each architected register in a single direct mapped table. The Intel Pentium Pro [7] IBM RIOS [2] HP PA8000 [9] and MIPS R10000 [5] all use map tables. Map table designs have also been investigated in the research community [8] 10] There are two common ways to implement renaming with a map table. The Pentium Pro and HP parts store the rename registers separate from the architected registers in a combined RRF and reorder ....
L. Gwennap. MIPS R10000 Uses Decoupled Architecture. Microprocessor Report, Vol. 8, No. 14, October 24, 1994.
....the removal of instructions from the processor. Resolution of an exception can dramatically change the utilization of all resources. For dynamically scheduled processors built within other frameworks, a similar set of factors exist. For example, items 1, and 3 to 6 above apply to the PA 8000 [Gwennap 1994b] the PowerPC 604 [Song 1994] and the Intel P6 [Gwennap 1995] processors, which are built within the framework of reorder buffers. Because the state of a dynamically scheduled processor cannot be statically predicted, the compiler cannot make any decisions based on the relative order in which instructions will ....
Gwennap, L. (1994a). MIPS R10000 Uses Decoupled Architecture. MicroprocessorReports, 8(14):18-- 22.
....environment, our benchmark suite, and our conventions in reporting experimental results. Simulation Environment. To evaluate our optimizations we used the SimpleScalar tool set [2] The detailed out of order processor simulator was modified to support MIPS R10000 style register renaming [10] Parameter Value Issue Width 4 Inst. Window 64 Func. Units 4 int (2 mul div) 2 fp (1 mul div) Cache Ports 2 (fully independent) L1 D Cache 64KB, 4 way, 1 cycle latency L1 I Cache 64KB, 4 way, 1 cycle latency L2 Cache 512KB, 4 way, 8 cycle latency Branch 16 bit history, BTB, 256K entry Predictor ....
....512KB, 4 way, 8 cycle latency Branch 16 bit history, BTB, 256K entry Predictor combinational gshare bimod Figure 2: Machine configuration. Machine parameters used in our simulations. The values were chosen to be representative of current high performance uniprocessors such as the MIPS R10000 [10] and DEC Alpha 21264 [11] Dynamic Call Mem Saves Benchmark Inst Inst Inst Restores compress 0:5 Theta 10 9 0.7 9.5 0.0 go 0:4 Theta 10 9 1.1 28.8 4.0 ijpeg 0:6 Theta 10 9 0.6 27.1 4.5 perl 1:0 Theta 10 9 1.3 47.1 9.2 gcc 1:0 Theta 10 9 1.2 40.1 12.7 vortex 1:0 ....
Linley Gwennap. Mips R10000 uses decoupled architecture. MicroProcessor Report, pages 18--22, October 24 1994.
....created by wider issue widths and faster clock speeds has forced many recent designs to increase the number of pipeline stages between instruction fetch and execute. Stages which we collectively call decode stages. For example, the DEC 21164 [Gwe94a] has three decode stages and the MIPS R10000 [Gwe94b] has two. Adding more decode stages increases the mispredicted branch penalty, however, architects have compensated for this penalty by increasing branch prediction accuracy through such means as larger branch target buffers or more effective predictors, e.g. two level adaptive. Given extra ....
L. Gwennap. MIPS R10000 uses decoupled architecture. Microprocessor Report, 8(14):18--22, October 1994.
....using timing simulation. We show that the advanced scheme can offload from 9 to 41 of the total dynamic instructions to the floating point subsystem. In doing so, speedups from 3 to 23 are achieved over a conventional microarchitecture. 1 Introduction Most current superscalar processors [17, 18, 16, 4] are based on the microarchitecture shown in Figure 1. The instruction fetch unit reads multiple instructions from the instruction cache, decodes them, and places them in instruction buffers for execution by the integer and floating point subsystems. The integer subsystem contains the integer ....
Linley Gwennap. MIPS R10000 Uses Decoupled Architecture. Microprocessor Report, 8(14), October 1994.
....of functional units, one tries to make the common case fast and simple. That often results in functional units with variable latency, i.e. the latency depends on the operation and the operands. For instance, many floating point units allow for extra cycles in case of a high precision result [1, 2] or in case of an exception [3] Those functional units (FUs) introduce two scheduling problems, a global and a local one. The global scheduler, which governs the interaction between the FUs and the remaining data paths, does not know the latency of an instruction. However, schedulers which are ....
L. Gwennap. MIPS R10000 uses decoupled architecture. Microprocessor Report, 8(14):18--22, 1994.
....decode created by wide issue and faster clock speeds has forced many recent designs to increase the number of pipeline stages between instruction fetch and execute. Stages collectively referred to as decode stages. For example, the DEC 21164 [Gwe94a] has three decode stages and the MIPS R10000 [Gwe94b] has two. Adding more decode stages increases the mispredicted branch penalty, however, architects have compensated for this penalty by increasing branch prediction accuracy through such means as larger branch target buffers or more effective predictors, 41 PC I Cache IR F A C Offset D Cache ....
L. Gwennap. MIPS R10000 uses decoupled architecture. Microprocessor Report, 8(14):18-- 22, October 1994.
....is to make the common case fast and simple, and to add extra cycles for the rare cases. The functional unit then has a variable latency, i.e. its latency depends on the operation and the operands. Many floating point units, for instance, allow for extra cycles in case of a high precision result [Gwe98, Gwe94] or in case of an exception [Sun92] The latency of an operation and the resources required by the operation are no longer known at decode time; they only became apparent while the operation is being processed in an functional unit. Thus, variable latency introduces two scheduling problems, a ....
L. Gwennap. MIPS R10000 uses decoupled architecture. Microprocessor Report, 8(14):18--22, 1994.
No context found.
L. Gwennap. MIPS R10000 Uses Decoupled Architecture. Microprocessor Report, October 1994.
No context found.
L. Gwennap. MIPS r10000 uses decoupled architecture. Microprocessor Report, 8(14), October 24 1994.
No context found.
L. Gwennap. MIPS r10000 uses decoupled architecture. Microprocessor Report, 8(14), October 24 1994.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC