Results 1 - 10
of
29
Value Locality and Load Value Prediction
, 1996
"... Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, a third facet of locality that is frequently p ..."
Abstract
-
Cited by 330 (18 self)
- Add to MetaCart
Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, a third facet of locality that is frequently present in real-world programs, and describe how to effectively capture and exploit it in order to perform load value prediction. Temporal and spatial locality are attributes of storage locations, and describe the future likelihood of references to those locations or their close neighbors. In a similar vein, value locality describes the likelihood of the recurrence of a previously-seen value within a storage location. Modern processors already exploit value locality in a very restricted sense through the use of control speculation (i.e. branch prediction), which seeks to predict the future value of a single condition bit based on previously-seen values. Our work extends this to predict entire 32- and 64-bit register values based on previously-seen values. We find that, just as condition bits are fairly predictable on a per-static-branch basis, full register values being loaded from memory are frequently predictable as well. Furthermore, we show that simple microarchitectural enhancements to two modern microprocessor implementations (based on the PowerPC 620 and Alpha 21164) that enable load value prediction can effectively exploit value locality to collapse true dependencies, reduce average memory latency and bandwidth requirements, and provide measurable performance gains. 1. Introduction and Related
Shade: A Fast Instruction-Set Simulator for Execution Profiling
, 1994
"... Tracing tools are used widely to help analyze, design, and tune both hardware and software systems. This paper describes a tool called Shade which combines efficient instruction-set simulation with a flexible, extensible trace generation capability. Efficiency is achieved by dynamically compiling an ..."
Abstract
-
Cited by 315 (2 self)
- Add to MetaCart
Tracing tools are used widely to help analyze, design, and tune both hardware and software systems. This paper describes a tool called Shade which combines efficient instruction-set simulation with a flexible, extensible trace generation capability. Efficiency is achieved by dynamically compiling and caching code to simulate and trace the application program. The user may control the extent of tracing in a variety of ways; arbitrarily detailed application state information may be collected during the simulation, but tracing less translates directly into greater efficiency. Current Shade implementations run on SPARC systems and simulate the SPARC (Versions 8 and 9) and MIPS I instruction sets. This paper describes the capabilities, design, implementation, and performance of Shade, and discusses instruction set emulation in general.
The predictability of data values
- IN PROCEEDINGS OF THE 30TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1997
"... ..."
Dynamic Instruction Reuse
, 1997
"... This paper introduces the concept of dynamic instruction reuse. Empirical observations suggest that many instructions, and groups of instructions, having the same inputs, are executed dynamically. Such instructions do not have to be executed repeatedly --- their results can be obtained from a buffer ..."
Abstract
-
Cited by 164 (4 self)
- Add to MetaCart
This paper introduces the concept of dynamic instruction reuse. Empirical observations suggest that many instructions, and groups of instructions, having the same inputs, are executed dynamically. Such instructions do not have to be executed repeatedly --- their results can be obtained from a buffer where they were saved previously. This paper presents three hardware schemes for exploiting the phenomenon of dynamic instruction reuse, and evaluates their effectiveness using execution-driven simulation. We find that in some cases over 50% of the instructions can be reused. The speedups so obtained, though less striking than the percentage of instructions reused, are still quite significant. 1 Introduction There are three parameters that influence the execution time of a program. Microarchitecture has concentrated on two of them: (i) the number of instructions executed per clock cycle, i.e., the IPC, and (ii) the clock cycle time. The third parameter: (iii) the total number of instructi...
Value Locality And Speculative Execution
, 1997
"... This thesis introduces a program attribute called value locality and proposes speculative execution under the weak dependence model. The weak dependence model lays a theoretical foundation for exploiting value locality and other program attributes by speculatively relaxing and deferring the detectio ..."
Abstract
-
Cited by 51 (1 self)
- Add to MetaCart
This thesis introduces a program attribute called value locality and proposes speculative execution under the weak dependence model. The weak dependence model lays a theoretical foundation for exploiting value locality and other program attributes by speculatively relaxing and deferring the detection and enforcement of control- and data-flow dependences between instructions to expose more instruction-level parallelism without violating program correctness. Value locality is a program attribute that describes the likelihood of the recurrence of a previously-seen value within a storage location inside a computer system. Most modern processors already exploit value locality through the use of control speculation (i.e. branch prediction), which seeks to predict the future values of condition code bits and branch-target addresses based on previously-seen values. Experimental results indicate that value locality exists for condition codes and branch target addresses, and for general-purpose ...
Value-Based Clock Gating and Operation Packing: Dynamic Strategies for Improving Processor Power and Performance
- ACM Transactions on Computer Systems
, 2000
"... This article presents our observations demonstrating that operations on "narrow-width" quantities are common not only in multimedia codes, but also in more general workloads. In fact, across the SPECint95 benchmarks, over half the integer operation executions require 16 bits or less. Based on this d ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
This article presents our observations demonstrating that operations on "narrow-width" quantities are common not only in multimedia codes, but also in more general workloads. In fact, across the SPECint95 benchmarks, over half the integer operation executions require 16 bits or less. Based on this data, we propose two hardware mechanisms that dynamically recognize and capitalize on these narrow-width operations. The first, power-oriented optimization reduces processor power consumption by using operand-value-based clock gating to turn off portions of arithmetic units that will be unused by narrow-width operations. This optimization results in a 45%-60% reduction in the integer unit's power consumption for the SPECint95 and MediaBench benchmark suites. Applying this optimization to SPECfp95 benchmarks results in slightly smaller power reductions, but still seems warranted. These reductions in integer unit power consumption equate to a 5%--10% full-chip power savings. Our second, performance-oriented optimization improves processor performance by packing together narrow-width operations so that they share a single arithmetic unit. Conceptually similar to a dynamic form of MMX, this optimization offers speedups of 4.3%--6.2% for SPECint95 and 8.0%--10.4% for MediaBench. Overall, these optimizations highlight an increasing opportunity for value-based optimizations to improve both power and performance in current microprocessors
Improving Processor Performance by Simplifying and Bypassing Trivial Computations
, 2002
"... During the course of a program's execution, a processor performs many trivial computations; that is, computations that can be simplified or where the result is zero, one, or equal to one of the input operands. This paper shows that, despite compiling a program with aggressive optimizations (-O3), ap ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
During the course of a program's execution, a processor performs many trivial computations; that is, computations that can be simplified or where the result is zero, one, or equal to one of the input operands. This paper shows that, despite compiling a program with aggressive optimizations (-O3), approximately 30% of all arithmetic instructions, which account for 12% of all dynamic instructions, are trivial computations. The amount of trivial computation is not heavily dependent on the program's specific input values. Our results show that eliminating trivial computations dynamically at run-time yields an average speedup of 8% for a typical processor. Even for a very aggressive processor (i.e. one with no functional unit constraints), the average speedup is still 6%. It also is important to note that the area cost (i.e. hardware) required to dynamically detect and eliminate these trivial computations is very low, consisting of only a few comparators and multiplexers.
Dynamic Removal of Redundant Computations
- International Conference on Supercomputing
, 1999
"... A mechanism for dynamic instruction-level reuse in superscalar microprocessors is presented. The underlying concept that the mechanism exploits is the run-time removal of redundant computations, and in particular the elimination of common subexpressions and invariants. Removing redundant computation ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
A mechanism for dynamic instruction-level reuse in superscalar microprocessors is presented. The underlying concept that the mechanism exploits is the run-time removal of redundant computations, and in particular the elimination of common subexpressions and invariants. Removing redundant computation is a target of optimizing compilers but sometimes they do not succeed due to their limited knowledge of the data. Moreover, the proposed mechanism can also remove quasi-redundant computations, such as subexpressions that often produce the same result but sometimes they differ, depending on the data values, and thus they cannot be eliminated by the compiler. Experimental results for the Spec95 show that on average the mechanism can avoid the execution of about 32% of the dynamic instructions and provides a 1.10 speedup in a superscalar microprocessor. An extensive evaluation of different configurations and a comparison with previous schemes is presented, as well as the performance potential ...
Coming challenges in microarchitecture and architecture
- Proc. IEEE
, 2001
"... In the past several decades, the world of computers and especially that of microprocessors has witnessed phenomenal advances. Computers have exhibited ever-increasing performance and decreasing costs, making them more affordable and, in turn, accelerating additional software and hardware development ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
In the past several decades, the world of computers and especially that of microprocessors has witnessed phenomenal advances. Computers have exhibited ever-increasing performance and decreasing costs, making them more affordable and, in turn, accelerating additional software and hardware development that fueled this process even more. The technology that enabled this exponential growth is a combination of advancements in process technology, microarchitecture, architecture, and design and development tools. While the pace of this progress has been quite impressive over the last two decades, it has become harder and harder to keep up this pace. New process technology requires more expensive megafabs and new performance levels require larger die, higher power consumption, and enormous design and validation effort. Furthermore, as CMOS technology continues to advance, microprocessor design is exposed to a new set of challenges. In the near future, microarchitecture has to consider and explicitly manage the limits of semiconductor technology, such as wire delays, power dissipation, and soft errors. In this paper, we describe the role of microarchitecture in the computer world, present the challenges ahead of us, and highlight areas where microarchitecture can help address these challenges. Keywords—Design tradeoffs, microarchitecture, microarchitecture trends, microprocessor, performance improvements, power issues, technology scaling. I.

