Results 1 - 10
of
35
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays
- in Proceedings of the 29th Annual International Symposium on Computer Architecture
, 2002
"... Microprocessor clock frequency has improved by nearly 40% annually over the past decade. This improvement has been provided, in equal measure, by smaller technologies and deeper pipelines. From our study of the SPEC 2000 bench-marks, we find that for a high-performance architecture imple-mented in l ..."
Abstract
-
Cited by 100 (14 self)
- Add to MetaCart
Microprocessor clock frequency has improved by nearly 40% annually over the past decade. This improvement has been provided, in equal measure, by smaller technologies and deeper pipelines. From our study of the SPEC 2000 bench-marks, we find that for a high-performance architecture imple-mented in lOOnm technology, the optimal clock period is ap-proximately 8fan-out-of-four ( F04) inverter delays for integer benchmarks, comprised of 6 F04 of useful work and an over-head of about 2 F04. The optimal clock period for floating-point benchmarks is 6F04. We find these optimal points to be insensitive to latch and clock skew overheads. Our study indi-cates that further pipelining can at best improve performance of integer programs by a factor of 2 over current designs. At these high clock frequencies it will be difficult to design the instruction issue window to operate in a single cycle. Con-sequently, we propose and evaluate a high-frequency design called a segmented instruction window. 1
A Large, Fast Instruction Window for Tolerating Cache Misses
"... Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately, naively scaling conventional window designs can significantly degrade clock cycle ti ..."
Abstract
-
Cited by 94 (1 self)
- Add to MetaCart
Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately, naively scaling conventional window designs can significantly degrade clock cycle time, undermining the benefits of increased parallelism. This paper presents a new instruction window design targeted at achieving the latency tolerance of large windows with the clock cycle time of small windows. The key observation is that instructions dependent on a long latency operation (e.g., cache miss) cannot execute until that source operation completes. These instructions are moved out of the conventional, small, issue queue to a much larger waiting instruction buffer (WIB). When the long latency operation completes, the instructions are reinserted into the issue queue. In this paper, we focus specifically on load cache misses and their dependent instructions. Simulations reveal that, for an 8-way processor, a 2K-entry WIB with a 32entry issue queue can achieve speedups of 20%, 84%, and 50% over a conventional 32-entry issue queue for a subset of the SPEC CINT2000, SPEC CFP2000, and Olden benchmarks, respectively.
Cyclone: A broadcast-free dynamic instruction scheduler with selective replay
- In Proceedings of the 30th annual International Symposium on Computer Architecture
"... To achieve high instruction throughput, instruction schedulers must be capable of producing high-quality schedules that maximize functional unit utilization while at the same time enabling fast instruction issue logic. Many solutions exist to the scheduling problem, ranging from compile-time to run- ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
To achieve high instruction throughput, instruction schedulers must be capable of producing high-quality schedules that maximize functional unit utilization while at the same time enabling fast instruction issue logic. Many solutions exist to the scheduling problem, ranging from compile-time to run-time approaches. Compile-time solutions feature fast and simple hardware, but at the expense of conservative schedules. Dynamic schedulers produce high-quality schedules that incorporate run-time information and dependence speculation, but implementing these schedulers requires complex circuits that can slow processor clock speeds. In this paper, we present the Cyclone scheduler, a novel design that captures the benefits of both compileand run-time scheduling. Our approach utilizes a listbased single-pass instruction scheduling algorithm, implemented by hardware at run-time in the front end of the processor pipeline. Once scheduled, instructions are injected into a timed queue that orchestrates their entry into execution. To accommodate branch and load/store dependence speculation, the Cyclone scheduler supports a simple selective replay mechanism. We implement this technique by overloading instruction register forwarding to also detect instructions dependent on incorrectly scheduled operations. Detailed simulation analyses suggest that with sufficient queue width, the Cyclone scheduler can rival the instruction throughput of similarly wide monolithic dynamic schedulers. Furthermore, the circuit complexity of the Cyclone scheduler is much more favorable than a broadcast-based scheduler, as our approach requires no global control signals. 1
Optimizing pipelines for power and performance
- in International Symposium on Microarchitecture (MICRO35), Nov. 2002. Selected as one of the four Best IBM Research Papers in Computer Science, Electrical Engineering and Math published in
, 2002
"... During the concept phase and definition of next generation high-end processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPI-centric view alone in early-stage definition studies. One of the fundamental issues co ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
During the concept phase and definition of next generation high-end processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPI-centric view alone in early-stage definition studies. One of the fundamental issues confronting the architect at this stage is the choice of pipeline depth and target frequency. In this paper we present an optimization methodology that starts with an analytical power-performance model to derive optimal pipeline depth for a superscalar processor. The results are validated and further refined using detailed simulation based analysis. As part of the power-modeling methodology, we have developed equations that model the variation of energy as a function of pipeline depth. Our results using a set of SPEC2000 applications show that when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of these energy models. 1
Half-Price Architecture
- In Proceedings of the International Symposium on Computer Architecture
, 2003
"... Current-generation microprocessors are designed to process instructions with one and two source operands at equal cost. Handling two source operands requires multiple ports for each instruction in structures--such as the register file and wakeup logic--which are often in the processor critical timi ..."
Abstract
-
Cited by 32 (2 self)
- Add to MetaCart
Current-generation microprocessors are designed to process instructions with one and two source operands at equal cost. Handling two source operands requires multiple ports for each instruction in structures--such as the register file and wakeup logic--which are often in the processor critical timing paths. We argue that these structures are overdesigned since only a small fraction of instructions require two source operands to be processed simultaneously. [n this paper, we propose the half-price architecture that judiciously removes this overdesign by restricting the processor capability to handle two source operands in certain timing-critical cases. Two techniques are proposed and evaluated: one for the wakeup logic is sequential wakeup, which decouples half of the tag matching logic from the wakeup bus to reduce the load capacitance of the bus. The other technique for the register file is sequential register access, which halves the register read ports by sequentially accessing two values using a single port when needed. We show that a pipeline that optimizes scheduling and register access for a single operand achieves nearly the same performance as an ideal base machine that fully handles two operands, with 2.2% (worst case 4.8%) IPC degradation.
Register Write Specialization Register Read Specialization: A Path to Complexity-Effective Wide-Issue Superscalar Processors
, 2002
"... With the continuous shrinking of transistor size, processor designers are facing new difficulties to achieve high clock frequency. The register file read time, the wake up and selection logic traversal delay and the bypass network transit delay with also their respective power consumptions constitut ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
With the continuous shrinking of transistor size, processor designers are facing new difficulties to achieve high clock frequency. The register file read time, the wake up and selection logic traversal delay and the bypass network transit delay with also their respective power consumptions constitute major difficulties for the design of wide issue superscalar processors.
Macro-op scheduling: Relaxing scheduling loop constraints
- In Proceedings of the International Symposium on Microarchitecture
, 2003
"... Ensuring back-to-back execution of dependent instructions in a conventional out-of-order processor requires scheduling logic that wakes up and selects instructions at the same rate as they are executed. To sustain high performance, integer ALU instructions typically have singlecycle latency, consequ ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
Ensuring back-to-back execution of dependent instructions in a conventional out-of-order processor requires scheduling logic that wakes up and selects instructions at the same rate as they are executed. To sustain high performance, integer ALU instructions typically have singlecycle latency, consequently requiring scheduling logic with the same single-cycle latency. Prior proposals have advocated the use of speculation in either the wakeup or select phases to enable pipelining of scheduling logic to achieve higher clock frequency. In contrast, this paper proposes macro-op scheduling, which systematically removes instructions with single-cycle latency from the machine by combining them into macro-ops, and performs nonspeculative pipelined scheduling of multi-cycle operations. Macroop
A complexity-effective approach to alu bandwidth enhancement for instruction-level temporal redundancy
- In Proceedings of the International Symposium on Computer Architecture (ISCA
, 2004
"... Previous proposals for implementing instruction-level temporal redundancy in out-of-order cores have reported a performance degradation of upto 45 % in certain applications compared to an execution which does not have any temporal redundancy. An important contributor to this problem is the insuffici ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
Previous proposals for implementing instruction-level temporal redundancy in out-of-order cores have reported a performance degradation of upto 45 % in certain applications compared to an execution which does not have any temporal redundancy. An important contributor to this problem is the insufficient number of ALUs for handling the amplified load injected into the core. At the same time, increasing the number of ALUs can increase the complexity of the issue logic, which has been pointed out to be one of the most timing critical components of the processor. This paper proposes a novel extension of a prior idea on instruction reuse to ease ALU bandwidth requirements in a complexity-effective way by exploiting certain interesting properties of a dual (temporally redundant) instruction stream. We present microarchitectural extensions necessary for implementing an instruction reuse buffer (IRB) and integrating this with the issue logic of a dual instruction stream superscalar core, and conduct extensive evaluations to demonstrate how well it can alleviate the ALU bandwidth problem. We show that on the average we can gain back nearly 50% of the IPC loss that occurred due to ALU bandwidth limitations for an instruction-level temporally redundant superscalar execution, and 23 % of the overall IPC loss. Keywords: Complexity-effective design, Instruction Reuse, Temporal Redundancy.
Integrated analysis of power and performance for pipelined microprocessors
- IEEE Transactions on Computers
, 2004
"... been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be ..."
Abstract
-
Cited by 18 (8 self)
- Add to MetaCart
been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). Copies may be requested from IBM T. J. Watson Research Center, P.
Using dynamic binary translation to fuse dependent instructions
- In CGO ’04: Proceedings of the international symposium on Code generation and optimization
, 2004
"... Instruction scheduling hardware can be simplified and easily pipelined if pairs of dependent instructions are fused so they share a single instruction scheduling slot. We study an implementation of the x86 ISA that dynamically translates x86 code to an underlying ISA that supports instruction fusing ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
Instruction scheduling hardware can be simplified and easily pipelined if pairs of dependent instructions are fused so they share a single instruction scheduling slot. We study an implementation of the x86 ISA that dynamically translates x86 code to an underlying ISA that supports instruction fusing. A microarchitecture that is co-designed with the fused instruction set completes the implementation. In this paper, we focus on the dynamic binary translator for such a co-designed x86 virtual machine. The dynamic binary translator first cracks x86 instructions belonging to hot superblocks into RISC-style micro-operations, and then uses heuristics to fuse together pairs of dependent micro-operations. Experimental results with SPEC2000 integer benchmarks demonstrate that: (1) the fused ISA with dynamic binary translation reduces the number of scheduling decisions by about 30 % versus a conventional implementation that uses hardware cracking into RISC micro-operations; (2) an instruction scheduling slot needs only hold two source register fields even though it may hold two instructions; (3) translations generated in the proposed ISA consume about 30 % less storage than a corresponding fixed-length RISC-style ISA. 1.

