Results 1 - 10
of
314
Automatically characterizing large scale program behavior
, 2002
"... Understanding program behavior is at the foundation of computer architecture and program optimization. Many pro-grams have wildly different behavior on even the very largest of scales (over the complete execution of the program). This realization has ramifications for many architectural and com-pile ..."
Abstract
-
Cited by 778 (41 self)
- Add to MetaCart
(Show Context)
Understanding program behavior is at the foundation of computer architecture and program optimization. Many pro-grams have wildly different behavior on even the very largest of scales (over the complete execution of the program). This realization has ramifications for many architectural and com-piler techniques, from thread scheduling, to feedback directed optimizations, to the way programs are simulated. However, in order to take advantage of time-varying behavior, we.must first develop the analytical tools necessary to automatically and efficiently analyze program behavior over large sections of execution. Our goal is to develop automatic techniques that are ca-pable of finding and exploiting the Large Scale Behavior of programs (behavior seen over billions of instructions). The first step towards this goal is the development of a hardware independent metric that can concisely summarize the behav-ior of an arbitrary section of execution in a program. To this end we examine the use of Basic Block Vectors. We quantify the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural met-rics, explore the large scale behavior of several programs, and develop a set of algorithms based on clustering capable of an-alyzing this behavior. We then demonstrate an application of this technology to automatically determine where to simulate for a program to help guide computer architecture research. 1.
Temperature-aware microarchitecture
- In Proceedings of the 30th Annual International Symposium on Computer Architecture
, 2003
"... With power density and hence cooling costs rising exponentially, processor packaging can no longer be designed for the worst case, and there is an urgent need for runtime processor-level techniques that can regulate operating temperature when the package’s capacity is exceeded. Evaluating such techn ..."
Abstract
-
Cited by 478 (52 self)
- Add to MetaCart
With power density and hence cooling costs rising exponentially, processor packaging can no longer be designed for the worst case, and there is an urgent need for runtime processor-level techniques that can regulate operating temperature when the package’s capacity is exceeded. Evaluating such techniques, however, requires a thermal model that is practical for architectural studies. This paper describes HotSpot, an accurate yet fast model based on an equivalent circuit of thermal resistances and capacitances that correspond to microarchitecture blocks and essential aspects of the thermal package. Validation was performed using finiteelement simulation. The paper also introduces several effective methods for dynamic thermal management (DTM): “temperaturetracking” frequency scaling, localized toggling, and migrating computation to spare hardware units. Modeling temperature at the microarchitecture level also shows that power metrics are poor predictors of temperature, and that sensor imprecision has a substantial impact on the performance of DTM. 1.
Phase Tracking and Prediction
, 2003
"... In a single second a modern processor can execute billions of instructions. Obtaining a bird's eye view of the behavior of a program at these speeds can be a difficult task when all that is available is cycle by cycle examination. In many programs, behavior is anything but steady state, and und ..."
Abstract
-
Cited by 233 (19 self)
- Add to MetaCart
(Show Context)
In a single second a modern processor can execute billions of instructions. Obtaining a bird's eye view of the behavior of a program at these speeds can be a difficult task when all that is available is cycle by cycle examination. In many programs, behavior is anything but steady state, and understanding the patterns of behavior, at run-time, can unlock a multitude of optimization opportunities.
Runtime power monitoring in high-end processors: Methodology and empirical data
, 2003
"... With power dissipation becoming an increasingly vexing problem across many classes of computer systems, measuring power dissipation of real, running systems has become crucial for hardware and software system research and design. Live power measurements are imperative for studies requiring execution ..."
Abstract
-
Cited by 199 (4 self)
- Add to MetaCart
(Show Context)
With power dissipation becoming an increasingly vexing problem across many classes of computer systems, measuring power dissipation of real, running systems has become crucial for hardware and software system research and design. Live power measurements are imperative for studies requiring execution times too long for simulation, such as thermal analysis. Furthermore, as processors become more complex and include a host of aggressive dynamic power management techniques, per-component estimates of power dissipation have become both more challenging as well as more important. In this paper we describe our technique for a coordinated measurement approach that combines real total power measurement with performance-counter-based, perunit power estimation. The resulting tool offers live total power measurements for Intel Pentium 4 processors, and also provides power breakdowns for 22 of the major CPU subunits over minutes of SPEC2000 and desktop workload execution. As an example application, we use the generated component power breakdowns to identify program power phase behavior. Overall, this paper demonstrates a processor power measurement and estimation methodology and also gives experiences and empirical application results that can provide a basis for future power-aware research. 1.
Managing Multi-Configurable Hardware via Dynamic Working Set Analysis
- In 29th Annual International Symposium on Computer Architecture
, 2002
"... Microprocessors are designed to provide good average performance over a variety of workloads. This can lead to inefficiencies both in power and performance for individual programs and during individual phases within the same program. Microarchitectures with multi-configuration units (e.g. caches, pr ..."
Abstract
-
Cited by 192 (3 self)
- Add to MetaCart
(Show Context)
Microprocessors are designed to provide good average performance over a variety of workloads. This can lead to inefficiencies both in power and performance for individual programs and during individual phases within the same program. Microarchitectures with multi-configuration units (e.g. caches, predictors, instruction windows) are able to adapt dynamically to program behavior and enable /disable resources as needed. A key element of existing configuration algorithms is adjusting to program phase changes. This is typically done by "tuning" when a phase change is detected -- i.e. sequencing through a series of trial configurations and selecting the best. We study algorithms that dynamically collect and analyze program working set information. To make this practical, we propose working set signatures -- highly compressed working set representations (e.g. 32-128 bytes total). We describe algorithms that use working set signatures to 1) detect working set changes and trigger re-tuning; 2) identify recurring working sets and re-install saved optimal reconfigurations, thus avoiding the time-consuming tuning process; 3) estimate working set sizes to configure caches directly to the proper size, also avoiding the tuning process. We use reconfigurable instruction caches to demonstrate the performance of the proposed algorithms. When applied to reconfigurable instruction caches, an algorithm that identifies recurring phases achieves power savings and performance similar to the best algorithm reported to date, but with orders-of-magnitude savings in retunings. 1
Picking Statistically Valid and Early Simulation Points
, 2003
"... Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To address this issue we have recently proposed using Simulation Points (found by only examining basic block execution fr ..."
Abstract
-
Cited by 116 (15 self)
- Add to MetaCart
Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To address this issue we have recently proposed using Simulation Points (found by only examining basic block execution frequency profiles) to increase the efficiency and accuracy of simulation. Simulation points are a small set of execution samples that when combined represent the complete execution of the program.
Characterizing and Predicting Program Behavior and its Variability
- In International Conference on Parallel Architectures and Compilation Techniques
, 2003
"... To reach the next level of performance and energy efficiency, optimizations are increasingly applied in a dynamic and adaptive manner. Current adaptive systems are typically reactive and optimize hardware or software in response to detecting a shift in program behavior. We argue that program behavio ..."
Abstract
-
Cited by 115 (4 self)
- Add to MetaCart
(Show Context)
To reach the next level of performance and energy efficiency, optimizations are increasingly applied in a dynamic and adaptive manner. Current adaptive systems are typically reactive and optimize hardware or software in response to detecting a shift in program behavior. We argue that program behavior variability requires adaptive systems to be predictive rather than reactive. In order to be effective, systems need to adapt according to future rather than most recent past behavior. In this paper we explore the potential of incorporating prediction into adaptive systems. We study the time-varying behavior of programs using metrics derived from hardware counters on two different micro-architectures. Our evaluation shows that programs do indeed exhibit significant behavior variation even at a granularity of millions of instructions. In addition, while the actual behavior across metrics may be different, periodicity in the behavior is shared across metrics. We exploit these characteristics in the design of on-line statistical and table-based predictors. We introduce a new class of predictors, cross-metric predictors, that use one metric to predict another, thus making possible an efficient coupling of multiple predictors. We evaluate these predictors on the SPECcpu2000 benchmark suite and show that table-based predictors outperform statistical predictors by as much as 69 % on benchmarks with high variability. 1.
Comparing program phase detection techniques
- In Int. Symposium on Microarchitecture
, 2003
"... Detecting program phase changes accurately is an important aspect of dynamically adaptable systems. Three dynamic program phase detection techniques are compared – using instruction working sets, basic block vectors (BBV), and conditional branch counts. Because program phases are difficult to define ..."
Abstract
-
Cited by 99 (1 self)
- Add to MetaCart
(Show Context)
Detecting program phase changes accurately is an important aspect of dynamically adaptable systems. Three dynamic program phase detection techniques are compared – using instruction working sets, basic block vectors (BBV), and conditional branch counts. Because program phases are difficult to define, we compare the techniques using a variety of metrics. BBV techniques perform better than the other techniques providing higher sensitivity and more stable phases. However, the instruction working set technique yields 30 % longer phases than the BBV method, although there is less stability within phases. On average, the methods agree on phase changes 85 % of the time. Of the 15% of time they disagree, the BBV method is more efficient at detecting performance changes. The conditional branch counter technique provides good sensitivity, but is less effective at detecting major phase changes. Nevertheless, the branch counter technique correlates 83 % of the time with the BBV based technique. As an auxiliary result, we show that techniques based on procedure granularities do not perform as well as those based on instruction or basic block granularities. This is mainly due to their inability to detect changes within procedures. 1.
Simpoint 3.0: Faster and more flexible program analysis
- Journal of Instruction Level Parallelism
, 2005
"... This paper describes the new features available in the Sim-Point 3.0 release. The release provides two techniques for drastically reducing the run-time of SimPoint: faster searching to find the best clustering, and efficiently clustering large numbers of intervals. SimPoint 3.0 also provides an opti ..."
Abstract
-
Cited by 77 (4 self)
- Add to MetaCart
(Show Context)
This paper describes the new features available in the Sim-Point 3.0 release. The release provides two techniques for drastically reducing the run-time of SimPoint: faster searching to find the best clustering, and efficiently clustering large numbers of intervals. SimPoint 3.0 also provides an option to output only the simulation points that represent the majority of execution, which can reduce simulation time without much increase in error. Finally, this release provides support for correctly clustering variable length intervals, taking into consideration the weight of each interval during clustering. This paper describes SimPoint 3.0’s new features, how to use them, and points out some common pitfalls. 1
Using SimPoint for Accurate and Efficient Simulation
- ACM SIGMETRICS Performance Evaluation Review
, 2003
"... Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of a single industry standard benchmark at this level of detail takes on the order of months to complete. This problem is exacerbated by the fact that to properly perform an architectural evalu ..."
Abstract
-
Cited by 72 (2 self)
- Add to MetaCart
(Show Context)
Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of a single industry standard benchmark at this level of detail takes on the order of months to complete. This problem is exacerbated by the fact that to properly perform an architectural evaluation requires multiple benchmarks to be evaluated across many separate runs. To address this issue we recently created a tool called SimPoint that automatically finds a small set of Simulation Points to represent the complete execution of a program for e#cient and accurate simulation. In this paper we describe how to use the SimPoint tool, and introduce an improved SimPoint algorithm designed to significantly reduce the simulation time required when the simulation environment relies upon fast-forwarding.