Results 1 - 10
of
29
Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching
- In Proceedings of the 29th International Symposium on Microarchitecture
, 1996
"... to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: ..."
Abstract
-
Cited by 265 (11 self)
- Add to MetaCart
to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact:
Highly Accurate Data Value Prediction using Hybrid Predictors
, 1997
"... Data dependences (data flow constraints) present a major hurdle to the amount of instruction-level parallelism that can be exploited from a program. Recent work has suggested that the limits imposed by data dependences can be overcome to some extent with the use of data value prediction. That is, wh ..."
Abstract
-
Cited by 182 (3 self)
- Add to MetaCart
Data dependences (data flow constraints) present a major hurdle to the amount of instruction-level parallelism that can be exploited from a program. Recent work has suggested that the limits imposed by data dependences can be overcome to some extent with the use of data value prediction. That is, when an instruction is fetched, its result can be predicted so that subsequent instructions that depend on the result can use this pre- dicted value. l/Vhen the correct result becomes avail- able, all instructions that are data dependent on that prediction can be validated. This paper investigates a variety of techniques to carry out highly accurate data value predictions. The first technique investigates the potential of monitoring the strides by which the results produced by different instances of an instruction change. The second technique investigates the potential of pattern-based two-level prediction schemes. Simulation results of these two schemes show improvements over the existing method of predicting the last outcome. In particular, some benchmarks show improvement with the stride-based predictor and others show improvement with the pattern-based predictor. To do uniformly well across benchmarks, we combine these two predictors to form a hybrid predictor. Simulation analysis of the hybrid predictor shows its overall prediction accuracy to be better than that of the component predictors across all benchmarks.
Path-based next trace prediction
- In Proceedings of the 30th International Symposium on Microarchitecture
, 1997
"... Carolina. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other work ..."
Abstract
-
Cited by 81 (11 self)
- Add to MetaCart
Carolina. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact:
Exploiting Instruction Level Parallelism in Processors by Caching Scheduled Groups
, 1996
"... Modern processors employ a large amount of hardware to dynamically detect parallelism in single-threaded programs and maintain the sequential semantics implied by these programs. The complexity of some of this hardware diminishes the gains due to parallelism because of longer clock period or increas ..."
Abstract
-
Cited by 65 (0 self)
- Add to MetaCart
Modern processors employ a large amount of hardware to dynamically detect parallelism in single-threaded programs and maintain the sequential semantics implied by these programs. The complexity of some of this hardware diminishes the gains due to parallelism because of longer clock period or increased pipeline latency of the machine. In this paper we propose a processor implementation which dynamically schedules groups of instructions while executing them on a fast simple engine and caches them for repeated execution on a fast VLIW-type engine. Our experiments show that scheduling groups spanning several basic blocks and caching these scheduled groups results in significant performance gain over fill buffer approaches for a standard VLIW cache. This concept, which we call DIF (Dynamic Instruction Formatting), unifies and extends principles underlying several schemes being proposed today to reduce superscalar processor complexity. This paper examines various issues in designing such a p...
Multiple-Block Ahead Branch Predictors
, 1996
"... A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. This paper presents a novel costeffective mechanism called the two-block ahead branch predictor. Information from the current instruction block is not used for predicting ..."
Abstract
-
Cited by 61 (5 self)
- Add to MetaCart
A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. This paper presents a novel costeffective mechanism called the two-block ahead branch predictor. Information from the current instruction block is not used for predicting the address of the next instruction block, but rather for predicting the block following the next instruction block. This approach overcomes the instruction fetch bottleneck exhibited by wide-dispatch "brainiac" processors by enabling them to efficiently predict addresses of two instruction blocks in a single cycle. Furthermore, pipelining the branch prediction process can also be done by means of our predictor for "speed demon" processors to achieve higher clock rate or to improve the prediction accuracy by means of bigger prediction structures. Moreover, and unlike the previously-proposed multiple predictor schemes, multiple-block ahead branch predictors can use any of the branch predictio...
The Effect of Instruction Fetch Bandwidth on Value Prediction
- in 25th Annual International Symposium on Computer Architecture
, 1998
"... Value prediction attempts to eliminate true-data dependencies by dynamically predicting the outcome values of instructions and executing true-data dependent instructions based on that prediction. In this paper we attempt to understand the limitations of using this paradigm in realistic machines. We ..."
Abstract
-
Cited by 43 (3 self)
- Add to MetaCart
Value prediction attempts to eliminate true-data dependencies by dynamically predicting the outcome values of instructions and executing true-data dependent instructions based on that prediction. In this paper we attempt to understand the limitations of using this paradigm in realistic machines. We show that the instruction-fetch bandwidth and the issue rate have a very significant impact on the efficiency of value prediction. In addition, we study how recent techniques to improve the instruction-fetch rate affect the efficiency of value prediction and its hardware organization. 1. Introduction The fast growing density of gates on a silicon die, allows modern microprocessors to increasingly employ multiple execution units that are capable of executing several instructions in parallel. Most of the recent microprocessor architectures assume sequential programs as an input and a parallel execution model, where the hardware is expected to extract the parallelism at run-time out of the ins...
A Trace Cache Microarchitecture and Evaluation
- IEEE Transactions on Computers
, 1999
"... As the instruction issue width of superscalar processors increases, instruction fetch bandwidth requirements will also increase. It will eventually become necessary to fetch multiple basic blocks per clock cycle. Conventional instruction caches hinder this effort because long instruction sequences a ..."
Abstract
-
Cited by 37 (3 self)
- Add to MetaCart
As the instruction issue width of superscalar processors increases, instruction fetch bandwidth requirements will also increase. It will eventually become necessary to fetch multiple basic blocks per clock cycle. Conventional instruction caches hinder this effort because long instruction sequences are not always in contiguous cache locations. Trace caches overcome this limitation by caching traces of the dynamic instruction stream, so instructions that are otherwise noncontiguous appear contiguous. In this paper we present and evaluate a microarchitecture incorporating a trace cache. The microarchitecture provides high instruction fetch bandwidth with low latency by explicitly sequencing through the program at the higher level of traces, both in terms of (1) control flow prediction and (2) instruction supply. For the SPEC95 integer benchmarks, trace-level sequencing improves performance from 15 % to 35 % over an otherwise equally-sophisticated, but contiguous multipleblock fetch mechanism. Most of this performance improvement is due to the trace cache. However, for one benchmark whose performance is limited by branch mispredictions, the performance gain is due almost entirely to improved prediction accuracy.
Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures
- In Proceedings of the 29th International Symposium on Microarchitecture
, 1996
"... To exploit larger amounts of instruction level parallelism, processors are being built with wider issue widths and larger numbers of functional units. Instruction fetch rate must also be increased in order to effectively exploit the performance potential of such processors. Block-structured ISAs pro ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
To exploit larger amounts of instruction level parallelism, processors are being built with wider issue widths and larger numbers of functional units. Instruction fetch rate must also be increased in order to effectively exploit the performance potential of such processors. Block-structured ISAs provide an effective means of increasing the instruction fetch rate. We define an optimization, called block enlargement, that can be applied to a block-structured ISA to increase the instruction fetch rate of a processor that implements that ISA. We have constructed a compiler that generates block-structured ISA code, and a simulator that models the execution of that code on a block-structured ISA processor. We show that for the SPECint95 benchmarks, the blockstructured ISA processor executing enlarged atomic blocks outperforms a conventional ISA processor by 12% while using simpler microarchitectural mechanisms to support wideissue and dynamic scheduling. 1. Introduction To achieve higher le...
An Empirical Study of Decentralized ILP Execution Models
- In 8th International Conference on Architectural Support for Programming Languages and Operating Systems
, 1998
"... Recent fascination for dynamic scheduling as a means for exploiting instruction-level parallelism has introduced significant interest in the scalability aspects of dynamic scheduling hardware. In order to overcome the scalability problems of centralized hardware schedulers, many decentralized execut ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
Recent fascination for dynamic scheduling as a means for exploiting instruction-level parallelism has introduced significant interest in the scalability aspects of dynamic scheduling hardware. In order to overcome the scalability problems of centralized hardware schedulers, many decentralized execution models are being proposed and investigated recently. The crux of all these models is to split the instruction window across multiple processing elements (PEs) that do independent scheduling of instructions. The decentralized execution models proposed so far can be grouped under 3 categories, based on the criterion used for assigning an instruction to a particular PE. They are: (i) execution unit dependence based decentralization (EDD), (ii) control dependence based decentralization (CDD), and (iii) data dependence based decentralization (DDD). This paper investigates the performance aspects of these three decentralization approaches. Using a suite of important benchmarks and realistic sy...
Fetching instruction streams
- In Procs. of the 36th Intl. Symposium on Microarchitecture
, 2002
"... Fetch performance is a very important factor because it effectively limits the overall processor performance. How-ever, there is little performance advantage in increasing front-end performance beyond what the back-end can con-sume. For each processor design, the target is to build the best possible ..."
Abstract
-
Cited by 16 (7 self)
- Add to MetaCart
Fetch performance is a very important factor because it effectively limits the overall processor performance. How-ever, there is little performance advantage in increasing front-end performance beyond what the back-end can con-sume. For each processor design, the target is to build the best possible fetch engine for the required performance level A fetch engine will be better if it provides better per-formance, but also if it takes fewer resources, requires less chip area, or consumes less power. In this paper we propose a novel fetch architecture based on the execution of long streams of sequential instructions, taking maximum advantage of code layout optimizations. We describe our architecture in detail, and show that it re-quires less complexity and resources than other high perfor-mance fetch architectures like the trace cache, while provid-ing a high fetch performance suitable for wide-issue super-scalar processors. Our results show that using our fetch architecture and code layout optimizations obtains 10 % higher performance than the EV8 fetch architecture, and 4 % higher than the FTB architecture using state-of-the-art branch predictors, while being only 1.5 % slower than the trace cache. Even in the absence of code layout optimizations, fetching instruc-tion streams is still lO % faster than the EV8, and only 4% slower than the trace cache. Fetching instruction streams effectively exploits the spe-cial characteristics of layout optimized codes to provide a high fetch performance, close to that of a trace cache, but has a much lower cost and complexity, similar to that of a basic block architecture. 1.

