Results 1 - 10
of
40
Dynamo: A Transparent Dynamic Optimization System
- ACM SIGPLAN Notices
, 2000
"... We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT ..."
Abstract
-
Cited by 347 (1 self)
- Add to MetaCart
We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a statically compiled native binary. This paper evaluates the Dynamo system in the latter, more challenging situation, in order to emphasize the limits, rather than the potential, of the system. Our experiments demonstrate that even statically optimized native binaries can be accelerated Dynamo, and often by a significant degree. For example, the average performance of --O optimized SpecInt95 benchmark binaries created by the HP product C compiler is improved to a level comparable to their --O4 optimized version running without Dynamo. Dynamo achieves this by focusing its efforts on optimization opportunities that tend to manifest only at runtime, and hence opportunities that might be difficult for a static compiler to exploit. Dynamo's operation is transparent in the sense that it does not depend on any user annotations or binary instrumentation, and does not require multiple runs, or any special compiler, operating system or hardware support. The Dynamo prototype presented here is a realistic implementation running on an HP PA-8000 workstation under the HPUX 10.20 operating system.
Trace-Driven Memory Simulation: A Survey
- ACM Computing Surveys
, 2004
"... This article surveys and analyzes these developments by establishing criteria for evaluating trace-driven methods, and then applies these criteria to describe, categorize, and compare over 50 trace-driven simulation tools. We discuss the strengths and weaknesses of different approaches and show t ..."
Abstract
-
Cited by 134 (0 self)
- Add to MetaCart
This article surveys and analyzes these developments by establishing criteria for evaluating trace-driven methods, and then applies these criteria to describe, categorize, and compare over 50 trace-driven simulation tools. We discuss the strengths and weaknesses of different approaches and show that no single method is best when all criteria, including accuracy, speed, memory, flexibility, portability, expense, and ease of use are considered. In a concluding section, we examine fundamental limitations to trace-driven simulation, and survey some recent developments in memory simulation that may overcome these bottlenecks
Full-System Timing-First Simulation
- IN PROCEEDINGS OF THE 2002 ACM SIGMETRICS CONFERENCE ON MEASUREMENT AND MODELING OF COMPUTER SYSTEMS
, 2002
"... Computer system designers often evaluate future design alternatives with detailed simulators that strive for functional fidelity (to execute relevant workloads) and performance fidelity (to rank design alternatives). Trends toward multithreaded architectures, more complex micro-architectures, a ..."
Abstract
-
Cited by 56 (9 self)
- Add to MetaCart
Computer system designers often evaluate future design alternatives with detailed simulators that strive for functional fidelity (to execute relevant workloads) and performance fidelity (to rank design alternatives). Trends toward multithreaded architectures, more complex micro-architectures, and richer workloads, make authoring detailed simulators increasingly difficult. To manage simulator complexity, this paper advocates decoupled simulator organizations that separate functional and performance concerns. Furthermore, we define an approach, called timing-first simulation, that uses an augmented timing simulator to execute instructions important to performance in conjunction with a functional simulator to insure correctness. This design simplifies software development, leverages existing simulators, and can model microarchitecture timing in detail. We describe
Transparent dynamic optimization: The design and implementation of Dynamo
, 1999
"... dynamic optimization, compiler, trace selection, binary translation © Copyright Hewlett-Packard Company 1999 Dynamic optimization refers to the runtime optimization of a native program binary. This report describes the design and implementation of Dynamo, a prototype dynamic optimizer that is capabl ..."
Abstract
-
Cited by 49 (4 self)
- Add to MetaCart
dynamic optimization, compiler, trace selection, binary translation © Copyright Hewlett-Packard Company 1999 Dynamic optimization refers to the runtime optimization of a native program binary. This report describes the design and implementation of Dynamo, a prototype dynamic optimizer that is capable of optimizing a native program binary at runtime. Dynamo is a realistic implementation, not a simulation, that is written entirely in user-level software, and runs on a PA-RISC machine under the HPUX operating system. Dynamo does not depend on any special programming language,
FLASH vs. (Simulated) FLASH: Closing the Simulation Loop
, 2000
"... Simulation is the primary method for evaluating computer systems during all phases of the design process. One significant problem with simulation is that it rarely models the system exactly, and quantifying the resulting simulator error can be di#cult. More importantly, architects often assume witho ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
Simulation is the primary method for evaluating computer systems during all phases of the design process. One significant problem with simulation is that it rarely models the system exactly, and quantifying the resulting simulator error can be di#cult. More importantly, architects often assume without proof that although their simulator may make inaccurate absolute performance predictions, it will still accurately predict architectural trends. This paper studies the source and magnitude of error in a range of architectural simulators by comparing the simulated execution time of several applications and microbenchmarks to their execution time on the actual hardware being modeled. The existence of a hardware gold standard allows us to find, quantify, and fix simulator inaccuracies. We then use the simulators to predict architectural trends and analyze the sensitivity of the results to the simulator configuration. We find that most of our simulators predict trends accurately, as long as ...
Efficient Memory Simulation in SimICS
, 1995
"... We describe novel techniques used for efficient simulation of memory in SimICS, an instruction level simulator developed at SICS. The design has focused on efficiently supporting the simulation of multiprocessors, analyzing complex memory hierarchies and running large binaries with a mixture of syst ..."
Abstract
-
Cited by 39 (6 self)
- Add to MetaCart
We describe novel techniques used for efficient simulation of memory in SimICS, an instruction level simulator developed at SICS. The design has focused on efficiently supporting the simulation of multiprocessors, analyzing complex memory hierarchies and running large binaries with a mixture of system-level and user-level code. A software caching
Reducing Synchronization Overhead in Parallel Simulation
, 1995
"... Synchronization is often the dominant cost in conservative parallel simulation, particularly in simulations of parallel computers, in whichlow-latency simulated communication requires frequent synchronization. This thesis presents local barriers and predictive barrier scheduling,two techniques for r ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
Synchronization is often the dominant cost in conservative parallel simulation, particularly in simulations of parallel computers, in whichlow-latency simulated communication requires frequent synchronization. This thesis presents local barriers and predictive barrier scheduling,two techniques for reducing synchronization overhead in the simulation of message-passing multicomputers. Local barriers use nearest-neighbor synchronization to reduce waiting time at synchronization points. Predictive barrier scheduling, a novel technique whichschedules synchronizations using both compile-time and runtime analysis, reduces the frequency of synchronization operations. These techniques were evaluated by comparing their performance to that of periodic global synchronization. Experiments show that local barriers improve performance by up to 24% for communication-bound applications, while predictive barrier scheduling improves performance by up to 65% for applications with long local computation phases. Because the two techniques are complementary, I advocate a combined approach. This work was done in the context of Parallel Proteus, a new parallel simulator of message-passing multicomputers.
Efficient Performance Prediction for Modern Microprocessors
, 1999
"... Performance estimation of computer systems is an important topic to a large number of people in the computer industry. Computer architects need to be able to study future machines, compiler writers need to be able to evaluate the compiler output before a machine exists, and developers need insight i ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Performance estimation of computer systems is an important topic to a large number of people in the computer industry. Computer architects need to be able to study future machines, compiler writers need to be able to evaluate the compiler output before a machine exists, and developers need insight into the machine's performance in order to tune their code. There are many performance estimation techniques that range from profile -based approaches to full machine simulation. Detailed simulation is one of the most common methods for estimating performance. It suffers, however, from potentially long run times when simulating large applications using detailed processor models. This thesis
Transparent Dynamic Optimization
, 1999
"... Dynamic optimization refers to the runtime optimization of a native program binary. This paper describes the design and implementation of Dynamo, a prototype dynamic optimizer that is capable of optimizing a native program binary at runtime. Dynamo is a realistic implementation, not a simulation, ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Dynamic optimization refers to the runtime optimization of a native program binary. This paper describes the design and implementation of Dynamo, a prototype dynamic optimizer that is capable of optimizing a native program binary at runtime. Dynamo is a realistic implementation, not a simulation, that is written entirely in user-level software, and runs on a PA-RISC machine under the HPUX operating system. Dynamo does not depend on any special programming language, compiler, operating system or hardware support. The program binary is not instrumented and is left untouched during Dynamo's operation. Dynamo observes the program's behavior through interpretation to dynamically select hot instruction traces from the running program. The hot traces are optimized using low-overhead optimization techniques and emitted into a software code cache. Subsequent instances of these traces cause the cached version to be executed, resulting in a performance boost. Contrary to intuition, we ...
A Fast and Accurate Approach to Analyze Cache Memory Behavior
- In Proceedings of European Conference on Parallel Computing (Europar'00
, 2000
"... The gap between processors and main memory performance increases every year. In order to overcome this problem, cache memories are very useful. Compiletime program transformations can significantly improve the performance of the cache. To apply most of these transformations, the compiler requires a ..."
Abstract
-
Cited by 16 (9 self)
- Add to MetaCart
The gap between processors and main memory performance increases every year. In order to overcome this problem, cache memories are very useful. Compiletime program transformations can significantly improve the performance of the cache. To apply most of these transformations, the compiler requires a precise knowledge of the locality of the different sections of the code, both before and after being transformed. Cache Miss Equations (CME) allow to obtain an analytical and precise description of the cache memory behavior for loop-oriented codes. Describing the cache behavior by means of diophantine equations allows us to use mathematical techniques to obtain cache misses. Unfortunately, a direct solution of the CME is computationally intractable due to its NP-hard nature. In this work, we propose a fast and accurate approach to estimate the solution of the CME, which is based on the use of sampling techniques. Statistical techniques allow us to approximate the absolute miss ratio of each reference by analyzing a small subset of the iteration space. The size of the subset, and therefore the analysis time, is determined by the accuracy selected by the user. The results show that only a few seconds are required to analyze most of the SPECfp benchmarks with an error smaller than 0.01. 1

