Results 1 -
9 of
9
Code Cache Management Schemes for Dynamic Optimizers
, 2002
"... A dynamic optimizer is a software-based system that performs code modifications at runtime, and several such systems have been proposed over the past several years. These systems typically perform optimization on the level of an instruction trace, and most use caching mechanisms to store recently op ..."
Abstract
-
Cited by 25 (4 self)
- Add to MetaCart
A dynamic optimizer is a software-based system that performs code modifications at runtime, and several such systems have been proposed over the past several years. These systems typically perform optimization on the level of an instruction trace, and most use caching mechanisms to store recently optimized portions of code. Since the dynamic optimizers produce variable-length code traces that are modified copies of portions of the original executable, a code cache management scheme must deal with the difficult problem of caching objects that vary in size and cannot be subdivided without adding extra jump instructions. Because of these constraints, many dynamic optimizers have chosen unsophisticated schemes, such as flushing the entire cache when it becomes full. Flushing minimizes the overhead of cache management but tends to discard many useful traces. This paper evaluates several alternative cache management schemes that identify and remove only enough traces to make room for a new trace. We find that by treating the code cache as a circular buffer, we can reduce the code cache miss rate by half of that achieved by flushing. Furthermore, this approach adds very little bookkeeping overhead and avoids the problems associated with code cache fragmentation. These characteristics are extremely important in a dynamic system since more complex strategies will do more harm than good if the overhead is too high.
IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems
- In 36th International Symposium on Microarchitecture
, 2003
"... IA-32 Execution Layer (IA-32 EL) is a new technology that executes IA-32 applications on Intel Itanium processor family systems. Currently, support for IA-32 applications on Itanium-based platforms is achieved using hardware circuitry on the Itanium processors. This capability will be enhanced with ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
IA-32 Execution Layer (IA-32 EL) is a new technology that executes IA-32 applications on Intel Itanium processor family systems. Currently, support for IA-32 applications on Itanium-based platforms is achieved using hardware circuitry on the Itanium processors. This capability will be enhanced with IA-32 EL---software that will ship with Itanium-based operating systems and will convert IA-32 instructions into Itanium instructions via dynamic translation. In this paper, we describe aspects of the IA-32 Execution Layer technology, including the general two-phase translation architecture and the usage of a single translator for multiple operating systems. The paper provides details of some of the technical challenges such as precise exception, emulation of FP, MMX^TM, and Intel Streaming SIMD Extension instructions, and misalignment handling. Finally, the paper presents some performance results.
Dynamic Binary Translation for Accumulator-Oriented Architectures
, 2003
"... A dynamic binary translation system for a co-designed virtual machine is described and evaluated. The underlying hardware directly executes an accumulator-oriented instruction set that exposes instruction dependence chains (strands) to a distributed microarchitecture containing a simple instruction ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
A dynamic binary translation system for a co-designed virtual machine is described and evaluated. The underlying hardware directly executes an accumulator-oriented instruction set that exposes instruction dependence chains (strands) to a distributed microarchitecture containing a simple instruction pipeline and issue logic. To support conventional program binaries, a source instruction set (Compaq Alpha in our study) is dynamically translated to the target accumulator instruction set. The binary translator identifies chains of inter-instruction dependences and assigns them to dependence-carrying accumulators. Because the underlying superscalar microarchitecture is capable of dynamic instruction scheduling, the binary translation system does not perform aggressive optimizations or re-schedule code; this significantly reduces binary translation overhead.
Characterizing Inter-Execution and Inter-Application Optimization Persistence
, 2003
"... Dynamic translation and optimization systems employ code caches to improve performance and to support reuse of dynamically generated code sequences within a single run of an application. However, these intra-application caching techniques are ine#ective at amortizing runtime costs on shortrunning ap ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Dynamic translation and optimization systems employ code caches to improve performance and to support reuse of dynamically generated code sequences within a single run of an application. However, these intra-application caching techniques are ine#ective at amortizing runtime costs on shortrunning applications or scripts that repeatedly invoke the same application. For these situations, dynamic translation systems have successfully used persistent code caching---saving the code cache during one run to prime later executions---as a means for amortizing runtime translation costs. The success of persistent code caching in dynamic optimizers will depend heavily on the amount of interexecution and inter-application optimization persistence that can be found in software applications. Our experiments with the DynamoRIO dynamic optimization system demonstrate that many of the most heavily executed code traces in SPECCPU2000 are identically optimized during successive executions. Our results indicate that there is a significant opportunity for leveraging interexecution persistence, and there is even a small opportunity for inter-application persistence for small programs.
2D-Profiling: Detecting input-dependent branches with a single input data set
- In CGO-4
, 2006
"... Static compilers use profiling to predict run-time program behavior. Generally, this requires multiple input sets to capture wide variations in run-time behavior. This is expensive in terms of resources and compilation time. We introduce a new mechanism, 2D-profiling, which profiles with only one in ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Static compilers use profiling to predict run-time program behavior. Generally, this requires multiple input sets to capture wide variations in run-time behavior. This is expensive in terms of resources and compilation time. We introduce a new mechanism, 2D-profiling, which profiles with only one input set and predicts whether the result of the profile would change significantly across multiple input sets. We use 2D-profiling to predict whether a branch’s prediction accuracy varies across input sets. The key insight is that if the prediction accuracy of an individual branch varies significantly over a profiling run with one input set, then it is more likely that the prediction accuracy of that branch varies across input sets. We evaluate 2D-profiling with the SPEC CPU 2000 integer benchmarks and show that it can identify input-dependent branches accurately. 1.
A Treegion-based Unified Approach to Speculation and Predication in Global Instruction Scheduling
, 2001
"... This paper presents a treegion-based global scheduling technique for wide issue VLIW/EPIC processors. A treegion is a single-entry/multiple-exit global scheduling scope that consists of basic blocks with control-flow forming a tree. We propose a two-phase approach to global scheduling within a treeg ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
This paper presents a treegion-based global scheduling technique for wide issue VLIW/EPIC processors. A treegion is a single-entry/multiple-exit global scheduling scope that consists of basic blocks with control-flow forming a tree. We propose a two-phase approach to global scheduling within a treegion scope that enables speculative code motion in the first phase and uses predication of all instructions in the second phase. In the first scheduling phase, tree traversal scheduling (TTS) takes full advantage of speculation to speed up all possible paths in a treegion. Over-aggressive speculation is limited by scheduling block-ending branches as early as possible, enabled by downward code motion. A multiway branch transformation is also performed to reduce control dependence height. In the second scheduling phase, fully resolved predicates (FRPs) are used to enable branch barrier instructions, such as stores and subroutine calls, to move across branches. Selective if-conversion can also be applied to remove hard-to-predict branches in a treegion. The simulation results based on an 8-issue EPIC style machine model show an average speedup of 21% of TTS over BB scheduling, an additional speedup of 6.4% from multiway branch transformation, and another 1.9% speedup from FRP-guarded code motion. Other code transformations such as treegion code layout and the general operation combining are also presented in this paper.
Profile-assisted Compiler Support for Dynamic Predication in Diverge-Merge Processors
"... Dynamic predication has been proposed to reduce the branch misprediction penalty due to hard-to-predict branch instructions. A recently proposed dynamic predication architecture, the diverge-merge processor (DMP), provides large performance improvements by dynamically predicating a large set of comp ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Dynamic predication has been proposed to reduce the branch misprediction penalty due to hard-to-predict branch instructions. A recently proposed dynamic predication architecture, the diverge-merge processor (DMP), provides large performance improvements by dynamically predicating a large set of complex control-flow graphs that result in branch mispredictions. DMP requires significant support from a profiling compiler to determine which branch instructions and control-flow structures can be dynamically predicated. However, previous work on dynamic predication did not extensively examine the tradeoffs involved in profiling and code generation for dynamic predication architectures. This paper describes compiler support for obtaining high performance in the diverge-merge processor. We describe new profile-driven algorithms and heuristics to select branch instructions that are suitable and profitable for dynamic predication. We also develop a new profile-based analytical cost-benefit model to estimate, at compiletime, the performance benefits of the dynamic predication of different types of control-flow structures including complex hammocks and loops. Our evaluations show that DMP can provide 20.4 % average performance improvement over a conventional processor on SPEC integer benchmarks with our optimized compiler algorithms, whereas the average performance improvement of the best-performing alternative simple compiler algorithm is 4.5%. We also find that, with the proposed algorithms, DMP performance is not significantly affected by the differences in profile- and run-time input data sets. 1.
Code Cache Management in Dynamic Optimization Systems
, 2004
"... Dynamic optimization systems store optimized or translated code in software-managed code caches in order to maximize reuse of transformed code. Code caches store superblocks that are not fixed in size, may contain links to other superblocks, and carry a high replacement overhead. These additional co ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Dynamic optimization systems store optimized or translated code in software-managed code caches in order to maximize reuse of transformed code. Code caches store superblocks that are not fixed in size, may contain links to other superblocks, and carry a high replacement overhead. These additional constraints reduce the effectiveness of conventional cache management policies. This dissertation investigates the code cache management problem in dynamic optimization systems and presents three major advances that cover the design space of cache management decisions. Through code cache simulations, we show that a FIFO replacement policy outperforms other traditional policies, as it enables contiguous cache evictions, allows for a simple circular buffer implementation, and results in comparable cache miss rates to LRU. Furthermore, a pseudo-circular FIFO algorithm is presented, which handles the problem of un-deletable cache blocks. An investigation of cache eviction granularities also reveals that evicting more than the minimum number of superblocks from the code cache at a time results in

