Aggressive program optimization requires accurate profile information, but such accuracy requires many samples to be collected. We explore a novel profiling architecture that reduces the overhead of collecting each sample by including a programmable co-processor that analyzes a stream of profile samples generated by a microprocessor. From this stream of samples, the co-processor can detect correlations between instructions (e.g., memory dependence profiling) as well as those between different dynamic instances of the same instruction (e.g., value profiling). The profiler's programmable nature allows a broad range of data to be extracted, post-processed, and formatted, as well as provides the flexibility to tailor the profiling application to the program under test. Because the co-processor is specialized for profiling, it can execute profiling applications more efficiently than a general-purpose processor. The co-processor should not significantly impact the cost or performance of the main processor because it can be implemented using a small number of transistors at the chip's periphery. We demonstrate the proposed design through a detailed evaluation of load value profiling. Our implementation quickly and accurately estimates the value invariance of loads, with time overhead roughly proportional to the size of the instruction working set of the program. This algorithm demonstrates a number of general techniques for profiling, including: estimating the completeness of a profile, a means to focus profiling on particular instructions, management of profiling resources.
|
1253
|
The Simplescalar toolset, version 2.0
– Burger, Austin
- 1997
|
|
664
|
ATOM: A system for building customized program analysis tools
– Srivastava, Eustace
- 1994
|
|
318
|
The Stanford FLASH Multiprocessor
– Kuskin, Ofelt, et al.
- 1994
|
|
314
|
Value Locality and Load Value Prediction
– Lipasti, Wilkerson, et al.
- 1996
|
|
313
|
The Alpha 21264 microprocessor
– Kessler
- 1991
|
|
198
|
A general approach for run-time specialization and its application to C
– Consel, Noël
- 1996
|
|
186
|
Efficient path profiling
– Ball, Larus
- 1996
|
|
177
|
DIVA: A Reliable Substrate for Deep Submicron
– Austin
- 1999
|
|
115
|
Selective Value Prediction
– Calder, Reinman, et al.
- 1999
|
|
112
|
Effective Dynamic Compilation
– Fast
- 1996
|
|
97
|
C: A language for high-level, efficient, and machine-independent dynamic code generation
– Engler, Hsieh, et al.
- 1996
|
|
95
|
Dynamic Program Instrumentation for Scalable Performance Tools
– Hollingsworth, Miller, et al.
- 1998
|
|
93
|
ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors
– Dean
- 1997
|
|
91
|
Value profiling
– Calder, Feller, et al.
- 1997
|
|
90
|
Performance analysis using the MIPS R10000 performance counters
– Zagha, Larson, et al.
- 1996
|
|
77
|
The technology behind Crusoe processors
– Klaiber
- 2000
|
|
63
|
Value profiling and optimization
– Calder, Feller, et al.
- 1999
|
|
63
|
Can Program Profiling Support Value Prediction
– Gabbay, Mendelson
- 1997
|
|
60
|
A Hardware-Driven Profiling Scheme for Identifying Program Hot Spots to Support Runtime Optimization,” ISCA
– Merten, Trick, et al.
- 1999
|
|
54
|
Informing Memory Operations: Providing Memory Performance Feedback
– Horowitz
- 1996
|
|
54
|
System Support for automatic Profiling and Optimization
– Zhang
- 1997
|
|
44
|
Storageless Value Prediction Using Prior Register Values
– Tullsen, Seng
- 1999
|
|
36
|
Accurate and Practical Profile-Driven Compilation Using the Profile Buffer
– CONTE, MENEZES, et al.
- 1996
|
|
35
|
et al. A 160-MHz, 32-b, 0.5-W
– Montanaro
- 1996
|
|
28
|
Using Branch Handling Hardware to Support Profile-Driven Optimization
– Conte, Patel, et al.
- 1994
|
|
26
|
et al., “Continuous Profiling: Where have all the cycles gone
– Anderson
- 1997
|
|
26
|
Initial Results for Glacial Variable Analysis
– Autrey, Wolfe
- 1996
|
|
26
|
Integrating Performance Monitoring and Communication in Parallel Computers (92 kB
– Martonosi, Ofelt, et al.
- 1996
|
|
22
|
A hardware mechanism for dynamic extraction and relayout of program hot spots
– Merten, Trick, et al.
- 2000
|
|
20
|
Efficient and flexible value sampling
– Burrows, Erlingson, et al.
- 2000
|
|
20
|
Relational profiling: Enabling thread-level parallelism in virtual machines
– Heil, Smith
- 1996
|
|
17
|
Transparent Dynamic Optimization
– Bala, Duesterwald, et al.
- 1999
|
|
12
|
Value prediction in VLIW machines
– Nakra, Gupta, et al.
- 1999
|
|
7
|
Shift Register Sequences. Aegean Park Press, revised edition
– Golumb
- 1982
|
|
4
|
Vtune: a visual tuning environment. http://support.intel.com/support/performancetools/vtune
– Corporation
|
|
4
|
The shrimp hardware performance monitor: Design and applications
– Martonosi, Clark, et al.
- 1996
|
|
4
|
Instruction sampling instrumentation
– Westcott, White
- 1992
|
|
3
|
Apparatus for sampling instruction operand or result values in a processor pipeline
– Chrysos, Dean, et al.
- 1999
|
|
3
|
Sampling Methods For Applied Research: Text and Cases
– Tryfos
- 1996
|
|
1
|
A Fully Associative Software -Managed Cache Design
– Hallnor, Reinhardt
- 2000
|