Results 1 - 10
of
96
A Large, Fast Instruction Window for Tolerating Cache Misses
"... Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately, naively scaling conventional window designs can significantly degrade clock cycle ti ..."
Abstract
-
Cited by 109 (1 self)
- Add to MetaCart
Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately, naively scaling conventional window designs can significantly degrade clock cycle time, undermining the benefits of increased parallelism. This paper presents a new instruction window design targeted at achieving the latency tolerance of large windows with the clock cycle time of small windows. The key observation is that instructions dependent on a long latency operation (e.g., cache miss) cannot execute until that source operation completes. These instructions are moved out of the conventional, small, issue queue to a much larger waiting instruction buffer (WIB). When the long latency operation completes, the instructions are reinserted into the issue queue. In this paper, we focus specifically on load cache misses and their dependent instructions. Simulations reveal that, for an 8-way processor, a 2K-entry WIB with a 32entry issue queue can achieve speedups of 20%, 84%, and 50% over a conventional 32-entry issue queue for a subset of the SPEC CINT2000, SPEC CFP2000, and Olden benchmarks, respectively.
Continual flow pipelines
- In International Conference on Architectural Support for Programming Languages and Operating Systems
, 2004
"... Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides high single-thread performance and enables multip ..."
Abstract
-
Cited by 107 (0 self)
- Add to MetaCart
(Show Context)
Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides high single-thread performance and enables multiple of these to be placed on the same die for high throughput while dynamically adapting for future applications? Conventional approaches for high single-thread performance rely on large and complex cores to sustain a large instruction window for memory tolerance, making them unsuitable for multi-core chips. We present Continual Flow Pipelines (CFP) as a new nonblocking processor pipeline architecture that achieves the performance of a large instruction window without requiring cycle-critical structures such as the scheduler and register file to be large. We show that to achieve benefits of a large instruction window, inefficiencies in management of both the scheduler and register file must be addressed, and we propose a unified solution. The non-blocking property of CFP keeps key processor structures affecting cycle time and power (scheduler, register file), and die size (second level cache) small. The memory latency-tolerant CFP core allows multiple cores on a single die while outperforming current processor cores for single-thread applications.
Reducing register ports for higher speed and lower energy
- In Proceedings of the International Symposium on Microarchitecture
, 2002
"... The key issues for register file design in high-performance processors are access time and energy. While previous work has focused on reducing the number of registers, we propose to reduce the number of register ports through two proposals, one for reads and the other for writes. For reads, we propo ..."
Abstract
-
Cited by 90 (2 self)
- Add to MetaCart
(Show Context)
The key issues for register file design in high-performance processors are access time and energy. While previous work has focused on reducing the number of registers, we propose to reduce the number of register ports through two proposals, one for reads and the other for writes. For reads, we propose bypass hint to reduce register port requirements by avoiding unnecessary register file reads for cases where values are bypassed. Current processors are unable to avoid these unnecessary reads due to timing constraints. For writes, we use register file banking. Current banking schemes assign different banks to instructions that are renamed together, which does not necessarily avoid conflicts among instructions that writeback together. We use decoupled rename, a technique which separates dependence and physical tagging of register operands. Decoupled rename allows us to perform physical register allocation just before writeback, avoiding bank conflicts. Our results show that combining bypass hint and write banking, our 1cycle register file with 6 read ports, and two 4-write-ported banks achieves a 9 % processor energy-delay savings over a system using a perfectly-pipelined, 2-cycle register file with 16 read ports and 8 write ports. 1
Banked Multiported Register Files for High-Frequency Superscalar Microprocessors
- In International Symposium on Computer Architecture
, 2003
"... Multiported register files are a critical component of high-performance superscalar microprocessors. Conventional multiported structures can consume significant power and die area. We examine the designs of banked multiported register files that employ multiple interleaved banks of fewer ported regi ..."
Abstract
-
Cited by 64 (5 self)
- Add to MetaCart
Multiported register files are a critical component of high-performance superscalar microprocessors. Conventional multiported structures can consume significant power and die area. We examine the designs of banked multiported register files that employ multiple interleaved banks of fewer ported register cells to reduce power and area. Banked register files designs have been shown to provide sufficient bandwidth for a superscalar machine, but previous designs had complex control structures that would likely limit cycle time and add to design complexity. We develop a banked register file with much simpler and faster control logic while only slightly increasing the number of ports per bank. We present area, delay, and energy numbers extracted from layouts of the banked register file. For a four-issue superscalar processor, we show that we can reduce area by a factor of three, access time by 20%, and energy by 40%, while decreasing IPC by less than 5%.
Dual-core execution: building a highly scalable single-thread instruction window
, 2005
"... Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores on a singl ..."
Abstract
-
Cited by 54 (3 self)
- Add to MetaCart
(Show Context)
Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores on a single chip collaboratively to achieve high performance for single-thread memoryintensive workloads while maintaining the flexibility to support multithreaded applications. The proposed execution paradigm, dual-core execution, consists of two superscalar cores (a front and back processor) coupled with a queue. The front processor fetches and preprocesses instruction streams and retires processed instructions into the queue for the back processor to consume. The front processor executes instructions as usual except for cache-missing loads, which produce an invalid value instead of blocking the pipeline. As a result, the front processor runs far ahead to warm up the data caches and fix branch mispredictions for the back processor. In-flight instructions are distributed in the front processor, the queue, and the back processor, forming a very large instruction window for single-thread out-oforder execution. The proposed architecture incurs only minor hardware changes and does not require any large centralized structures such as large register files, issue queues, load/store queues, or reorder buffers. Experimental results show remarkable latency hiding capabilities of the proposed architecture, even outperforming more complex single-thread processors with much larger instruction windows than the front or back processor. 1.
Increasing processor performance through early register release
- in Proceedings of ICCD
, 2004
"... Modern superscalar microprocessors need sizable register files to support large number of in-flight instructions for exploiting ILP. An alternative to building large register files is to use smaller number of registers, but manage them more effectively. More efficient management of registers can als ..."
Abstract
-
Cited by 37 (4 self)
- Add to MetaCart
(Show Context)
Modern superscalar microprocessors need sizable register files to support large number of in-flight instructions for exploiting ILP. An alternative to building large register files is to use smaller number of registers, but manage them more effectively. More efficient management of registers can also result in higher performance if the reduction of the register file size is not the goal. Traditional register file management mechanisms deallocate a physical register only when the next instruction with the same destination architectural register commits. We propose two complementary techniques for deallocating the register immediately after the instruction producing the register’s value commits itself, without waiting for the commitment of the next instruction with the same destination. Our design relies on the use of a checkpointed register file (CRF), where a local shadow copy of each bitcell is used to temporarily save the early deallocated register values should they be needed to recover from branch mispredictions or to reconstruct the precise state after exceptions or interrupts. The proposed techniques outperform the previously proposed schemes for early deallocation of registers. For the register-constrained datapath configurations, our techniques result in up to 35% performance increase with 23.3 % increase on the average across SPEC2000 benchmarks. 1.
Reducing register ports using delayed writeback queues and operand prefetch
- in: Proceedings of International Conference on Supercomputing, ICS-2003
"... Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
(Show Context)
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
Register Write Specialization Register Read Specialization: A Path to Complexity-Effective Wide-Issue Superscalar Processors
, 2002
"... With the continuous shrinking of transistor size, processor designers are facing new difficulties to achieve high clock frequency. The register file read time, the wake up and selection logic traversal delay and the bypass network transit delay with also their respective power consumptions constitut ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
(Show Context)
With the continuous shrinking of transistor size, processor designers are facing new difficulties to achieve high clock frequency. The register file read time, the wake up and selection logic traversal delay and the bypass network transit delay with also their respective power consumptions constitute major difficulties for the design of wide issue superscalar processors.
Use-based register caching with decoupled indexing
- In Proceedings of the 31st International Symposium on Computer Architecture (ISCA’04). ACM
, 2004
"... Wide, deep pipelines need many physical registers to hold the results of in-flight instructions. Simultaneously, high clock frequencies prohibit using large register files and bypass networks without a significant performance penalty. Previously proposed techniques using register caching to reduce t ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
(Show Context)
Wide, deep pipelines need many physical registers to hold the results of in-flight instructions. Simultaneously, high clock frequencies prohibit using large register files and bypass networks without a significant performance penalty. Previously proposed techniques using register caching to reduce this penalty suffer from several problems including poor insertion and replacement decisions and the need for a fully-associative cache for good performance. We present novel mechanisms for managing and indexing register caches that address these problems using knowledge of the number of consumers of each register value. The insertion policy reduces pollution by not caching a register value when all of its predicted consumers are satisfied by the bypass network. The replacement policy selects register cache entries with the fewest remaining uses (often zero), lowering the miss rate. We also introduce a new, general method of mapping physical registers to register cache sets that improves the performance of set-associative cache organizations by reducing conflicts. Our results indicate that a 64-entry, two-way set associative cache using these techniques outperforms multi-cycle monolithic register files and previously proposed hierarchical register files. 1.
Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication
, 2004
"... In the modern era of wire-dominated architectures, specific effort must be made to reduce needless communication within out-of-order pipelines while still maintaining binary compatibility. To ease pressure on highly-connected elements such as the issue logic and bypass network, we propose the dynami ..."
Abstract
-
Cited by 25 (5 self)
- Add to MetaCart
In the modern era of wire-dominated architectures, specific effort must be made to reduce needless communication within out-of-order pipelines while still maintaining binary compatibility. To ease pressure on highly-connected elements such as the issue logic and bypass network, we propose the dynamic detection and speculative execution of instruction strands--linear chains of dependent instructions without intermediate fan-out. The hardware required for detecting these chains is simple and resides off the critical path of the pipeline, and the execution targets are the normal ALUs with a self-bypass mode. By collapsing these strings of dependencies into atomic macro-instructions, the efficiency of the issue queue and reorder buffer can be increased. Our results show that over 25% of all dynamic ALU instructions can be grouped, decreasing both the average reorder buffer occupancy and issue queue occupancy by over a third. Additionally, these strands have several properties which make them amenable to simple performance optimizations. Our experiments show average IPC increases of 17% on a four-wide machine and 20% on an eight-wide machine in Spec2000int and Mediabench applications. Finally, strands ease the IPC penalties of multicycle issue and bypass by reducing dependency pressures, providing opportunity for clock frequency gains as well.