Results 1 - 10
of
42
Pipeline gating: speculation control for energy reduction
- In Proceedings of the 25th Annual International Symposium on Computer Architecture
, 1998
"... Branch prediction has enabled microprocessors to increase instruction level parallelism (ILP) by allowing programs to speculatively execute beyond control boundaries. Although speculative execution is essential for increasing the instructions per cycle (IPC), it does come at a cost. A large amount o ..."
Abstract
-
Cited by 288 (3 self)
- Add to MetaCart
(Show Context)
Branch prediction has enabled microprocessors to increase instruction level parallelism (ILP) by allowing programs to speculatively execute beyond control boundaries. Although speculative execution is essential for increasing the instructions per cycle (IPC), it does come at a cost. A large amount of unnecessary work results from wrong-path instructions entering the pipeline due to branch misprediction. Results generated with the SimpleScalar tool set using a 4-way issue pipeline and various branch predictors show an instruction overhead of 16 % to 105 % for every instruction committed. The instruction overhead will increase in the future as processors use more aggressive speculation and wider issue widths [9]. In this paper, we present an innovative method for power reduction which, unlike previous work that sacrificed flexibility or performance, reduces power in high-performance microprocessors without impacting performance. In particular, we introduce a hardware mechanism called pipeline gating to control rampant speculation in the pipeline. We present inexpensive mechanisms for determining when a branch is likely to mispredict, and for stopping wrong-path instructions from entering the pipeline. Results show up to a 38 % reduction in wrong-path instructions with a negligible performance loss ( ¢¡¤ £). Best of all, even in programs with a high branch prediction accuracy, performance does not noticeably degrade. Our analysis indicates that there is little risk in implementing this method in existing processors since it does not impact performance and can benefit energy reduction. 1
Load Latency Tolerance In Dynamically Scheduled Processors
- JOURNAL OF INSTRUCTION LEVEL PARALLELISM
, 1998
"... This paper provides a quantitative evaluation of load latency tolerance in a dynamically scheduled processor. To determine the latency tolerance of each memory load operation, our simulations use flexible load completion policies instead of a fixed memory hierarchy that dictates the latency. Alth ..."
Abstract
-
Cited by 76 (2 self)
- Add to MetaCart
This paper provides a quantitative evaluation of load latency tolerance in a dynamically scheduled processor. To determine the latency tolerance of each memory load operation, our simulations use flexible load completion policies instead of a fixed memory hierarchy that dictates the latency. Although our policies delay load completion as long as possible, they produce performance (instructions committed per cycle (IPC)) comparable to a processor with an ideal memory system where all loads complete in one cycle. Our simulations reveal that to produce IPC values within 12% of a processor with an ideal memory system, between 1% and 71% of loads need to be satisfied within a single cycle and that up to 74% can be satisfied in as many as 32 cycles, depending on the benchmark and processor configuration. Load latency
Branch prediction, instruction-window size, and cache size: Performance tradeoffs and simulation techniques
- IEEE Transactions on Computers
, 1999
"... Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Tradeoffs among instruction-window size, branch-prediction accuracy, and instruction- and datacache size can change as these parameters move ..."
Abstract
-
Cited by 58 (18 self)
- Add to MetaCart
(Show Context)
Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Tradeoffs among instruction-window size, branch-prediction accuracy, and instruction- and datacache size can change as these parameters move through different domains. For example, modeling unrealistic caches can under- or over-state the benefits of better prediction or a larger instruction window. Avoiding such pitfalls requires understanding how all these parameters interact. Because such methodological mistakes are common, this paper provides a comprehensive set of SimpleScalar simulation results from SPECint95 programs, showing the interactions among these major structures. In addition to presenting this database of simulation results, major mechanisms driving the observed tradeoffs are described. The paper also considers appropriate simulation techniques when sampling full-length runs with the SPEC reference inputs. In particular, the results show that branch mispredictions limit the benefits of larger instruction windows, that better branch prediction and better instruction cache behavior have synergistic effects, and that the benefits of larger instruction windows and larger data caches trade off and have overlapping effects. In addition, simulations of only 50 million instructions can yield representative results if these short windows are carefully selected.
Multipath Execution: Opportunities and Limits
- In Proc. 12th ICS
, 1998
"... Even sophisticated branch-prediction techniques necessarily suffer some mispredictions, and even relatively small mispredict rates hurt performance substantially in current-generation processors. In this paper, we investigate schemes for improving performance in the face of imperfect branch predicto ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
(Show Context)
Even sophisticated branch-prediction techniques necessarily suffer some mispredictions, and even relatively small mispredict rates hurt performance substantially in current-generation processors. In this paper, we investigate schemes for improving performance in the face of imperfect branch predictors by having the processor simultaneously execute code from both the taken and not-taken outcomes of a branch. This paper presents data regarding the limits of multipath execution, considers fetch-bandwidth needs for multipath execution, and discusses various dynamic confidence-prediction schemes that gauge the likelihood of branch mispredictions. Our evaluations consider executing along several (2--8) paths at once. Using 4 paths and a relatively simple confidence predictor, multipath execution garners speedups of up to 30% compared to the single-path case, with an average speedup of 14.4% for the SPECint suite. While associated increases in instruction-fetch-bandwidth requirements are not ...
Instruction Cache Fetch Policies for Speculative Execution
- Proceedings of the 22nd International Symposium on Computer Architecture
, 1995
"... Current trends in processor design are pointing to deeper and wider pipelines and superscalar architectures. The efficient use of these resources requires speculative execution, a technique whereby the processor continues executing the predicted path of a branch before the branch condition is resolv ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
Current trends in processor design are pointing to deeper and wider pipelines and superscalar architectures. The efficient use of these resources requires speculative execution, a technique whereby the processor continues executing the predicted path of a branch before the branch condition is resolved. In this paper, we investigate the implications of speculative execution on instruction cache performance. We explore policies for managing instruction cachemisses ranging from aggressive policies (always fetch on the speculative path) to conservative ones (wait until branches are resolved). We test these policies and their interaction with next-line prefetching by simulating the effects on instruction caches with varying architectural parameters. Our results suggest that an aggressive policy combined with next-line prefetching is best for small latencies while more conservative policies are preferable for large latencies. 1 Introduction To keep up with ever shorter clock cycles and to e...
Accurate timing analysis by modeling caches, speculation and their interaction
- In ACM Design Automation Conf. (DAC
, 2003
"... Schedulability analysis of real-time embedded systems re-quires worst case timing guarantees of embedded software performance. This involves not only language level program analysis, but also modeling the effects of complex micro-architectural features in modern processors. Speculative ex-ecution an ..."
Abstract
-
Cited by 18 (8 self)
- Add to MetaCart
(Show Context)
Schedulability analysis of real-time embedded systems re-quires worst case timing guarantees of embedded software performance. This involves not only language level program analysis, but also modeling the effects of complex micro-architectural features in modern processors. Speculative ex-ecution and caching are very common in current processors. Hence one needs to model the effects of these features on the Worst Case Execution Time (WCET) of a program. Even though the individual effects of these features have been studied recently, their combined effects have not been inves-tigated. We do so in this paper. This is a non-trivial task be-cause speculative execution can indirectly affect cache per-formance (e.g., speculatively executed blocks can cause ad-ditional cache misses). Our technique starts from the con-trol flow graph of the embedded program, and uses integer linear programming to estimate the program’s WCET. The accuracy of our modeling is illustrated by tight estimates obtained on realistic benchmarks. Categories and Subject Descriptors
Understanding the effects of wrong-path memory references on processor performance
- In Third Workshop on Memory Performance Issues
, 2004
"... • Processors spend a significant portion of execution time on the wrong path – 47 % of all cycles are spent fetching instructions on the wrong path for SPEC INT 2000 benchmarks • Many memory references are made on the wrong path – 6 % of all data references are wrong-path references – 50 % of all in ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
• Processors spend a significant portion of execution time on the wrong path – 47 % of all cycles are spent fetching instructions on the wrong path for SPEC INT 2000 benchmarks • Many memory references are made on the wrong path – 6 % of all data references are wrong-path references – 50 % of all instruction references are wrong-path references • We would like to understand the effects of wrong-path memory references on processor performance – The goal is to build hardware/software mechanisms that take advantage of this understanding 2 Questions We Seek to Answer 1. How important is it to correctly model wrong-path memory references? • How does memory latency and window size affect this? 2. What is the relative significance of negative and positive effects of wrong-path memory references on performance? • Negative: Cache pollution and bandwidth/resource contention • Positive: Prefetching 3. What kind of code structures lead to the positive effects of wrong-path memory references? 3 Experimental Methodology • Execution-driven simulator, accurate wrong-path model • Cycle-accurate, aggressive memory model • Baseline processor – Models the Alpha ISA – 8-wide fetch, decode, issue, retire – 128-entry instruction window – Hybrid conditional branch predictor • 64K-entry PAs, 64K-entry gshare, 64K-entry selector – Aggressive indirect branch predictor (64K-entry target cache) – 20-cycle branch misprediction latency – 64 KB, 4-way, 2-cycle L1 data and instruction caches • Maximum 128 outstanding L1 misses
Modeling Control Speculation for Timing Analysis
- Journal of Real-Time Systems
, 2005
"... Abstract. The schedulability analysis of real-time embedded systems requires Worst Case Execution Time (WCET) analysis for the individual tasks. Bounding WCET involves not only language-level program path analysis, but also modeling the performance impact of complex micro-architectural features pres ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
(Show Context)
Abstract. The schedulability analysis of real-time embedded systems requires Worst Case Execution Time (WCET) analysis for the individual tasks. Bounding WCET involves not only language-level program path analysis, but also modeling the performance impact of complex micro-architectural features present in modern processors. In this paper, we statically analyze the execution time of embedded software on processors with speculative execution. The speculation of conditional branch outcomes (branch prediction) significantly improves a program’s execution time. Thus, accurate modeling of control speculation is important for calculating tight WCET estimates. We present a parameterized framework to model the different branch prediction schemes. We further consider the complex interaction between speculative execution and instruction cache performance, that is, the fact that speculatively executed blocks can generate additional cache hits/misses. We extend our modeling to capture this effect of branch prediction on cache performance. Starting with the control flow graph of a program, our technique uses integer linear programming to estimate the program’s WCET. The accuracy of our method is demonstrated by tight estimates obtained on realistic benchmarks. Keywords.
Instruction Prefetching of Systems Codes With Layout Optimized for Reduced Cache Misses
- In 23rd Annual International Symposium on Computer Architecture
, 1996
"... High-performing on-chip instruction caches are crucial to keep fast processors busy. Unfortunately, while on-chip caches are usually successful at intercepting instruction fetches in loop-intensive engineering codes, they are less able to do so in large systems codes. To improve the performance of t ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
(Show Context)
High-performing on-chip instruction caches are crucial to keep fast processors busy. Unfortunately, while on-chip caches are usually successful at intercepting instruction fetches in loop-intensive engineering codes, they are less able to do so in large systems codes. To improve the performance of the latter codes, the compiler can be used to lay out the code in memory for reduced cache conflicts. Interestingly, such an operation leaves the code in a state that can be exploited by a new type of instruction prefetching: guarded sequential prefetching. The idea is that the compiler leaves hints in the code as to how the code was laid out. Then, at run time, the prefetching hardware detects these hints and uses them to prefetch more effectively. This scheme can be implemented very cheaply: one bit encoded in control transfer instructions and a prefetch module that requires minor extensions to existing next-line sequential prefetchers. Furthermore, the scheme can be turned off and on at ru...
An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors
- IEEE Transactions on Computers
, 2005
"... ..."