| M. Martonosi, A. Gupta, and T. Anderson. Effectiveness of trace sampling for performance debugging tools. In Proceedings of the 1993. |
....to complete transactions in the OLTP benchmark. Clearly a single short simulation run cannot capture the wide spectrum of the commercial workloads behavior. Time sampling is a well known technique that may prove valuable to complete an architectural study within a reasonable simulation time [10, 12]. We intend to explore this further in future work. 8 Related Work Prior work has studied commercial workloads for their architectural and micro architectural characteristics, and has used them for simulation studies and for performance evaluations. The characterization studies can be classified ....
Margaret Martonosi, Anoop Gupta, and Thomas Anderson. Effectiveness of Trace Sampling for Performance Debugging Tools. In Proceedings of the 1993.
....examine memory reference trace sampling and present new algorithms for trace reduction and compression. Other work has studied analytic models for estimating cache miss rates during the unprimed portion of the sample [15, 33] or described means for bounding errors by adjusting simulation lengths [21]. The most widely used sampling technique in the processor architecture community is to perform fulldetail simulation for a single, large segment of execution, anywhere from tens of millions to billions of instructions long. When using this approach, the choice of a representative sample becomes ....
Margaret Martonosi, Anoop Gupta, and Thomas Anderson. Effectiveness of Trace Sampling for Performance Debugging Tools. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 248--59, May 1993.
.... is implemented in software, it limits the effectiveness of techniques such as time sampling [Laha88] and set sampling [Puzak85] Martonosi investigates time sampling in a later paper by adding another check to the instrumented code to enable and disable monitoring at regular intervals [Martonosi93]. When enabled, instrumentation overheads are similar to those cited above, but when disabled, an instrumented reference executes only 6 extra instructions. When trapping is enabled for 10 of the entire execution time, MemSpy slowdowns drop to about 4 to 10, a factor of two improvement over ....
Martonosi, M., Gupta, A. and Anderson, T. Effectiveness of trace sampling for performance debugging tools. In Proceedings of the
....to 5 times speedup in simulation speed compared to our simulator without sampling. 2. 3 Related Work on Sampling Sampling was first proposed by Laha et al. 5] to improve the speed of cache simulation, and has subsequently been used in a number of other studies to improve either cache simulation [6, 7, 8], or processor simulation [9, 10] The main inaccuracies in all these studies stem from loss of state information about the simulated system due to sampling. To reduce these inaccuracies, these studies use a combination of state repair methods, larger sampling periods, and larger warm up periods. ....
M. Martonosi, A. Gupta, and T. Anderson. Effectiveness of Trace Sampling for Performance Debugging Tools. In Proc. of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1993.
....target architectures this accuracy holds. The IMPACT developers do not specify precise criteria regarding the acceptable range of target architectures. Other studies in trace sampling have found that sampling ratios of 10 typically work very well. A trace sampling evaluation by Martonosi, et al. [78], found that sampling with a ratio of 10 and sample sizes of 0.5M instructions gave an absolute error of less than 0.3 when using smaller cache sizes (of up to 128 KB) but much larger sampling sizes are needed for cache sizes of 1MB and up. In our own simulations, we also found that accuracy ....
Margaret Martonosi, Anoop Gupta, and Thomas Anderson, "Effectiveness of Trace Sampling for Performance Debugging Tools," Proceedings of the 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1993.
....memory behaviour at these bottlenecks. One problem with MemSpy is how long it takes to run the tool was developed to use 2.6. The ALITE simulator 33 simulator traces of program execution and the overhead was so noticeable as to prompt work on using samples from traces rather than full runs [24]. Another recent tool, SM prof, also provides a hierarchy of detail levels [6] At the top level, the tool gives a graphic profile of the different types of cache line sharing occurring during execution of the program. A programmer can then concentrate on suspected problem areas, and the tool ....
....cfd for one iteration when using the cat. Such sampling was valid because the memory access patterns of cfd are similar for each iteration. Performance has also been reported as a problem with MemSpy, and the solution adopted with that tool was also to take samples out of the simulation traces [24]. The other limitation of the cat is that it cannot report in terms of program source lines. In contrast, MemSpy reports in terms of procedures in the application program, and SM prof points the user to lines of code which use offending shared memory variables. This feature could be added to the ....
M. Martonosi, A. Gupta, and T. Anderson. Effectiveness of trace sampling for performance debugging tools. In Proceedings of the 1993 ACM SIGMETRICS, pages 248--259, 1993.
....it also demands large amounts of space and time, particularly for large caches and long running applications. These demands can be greatly reduced by employing sampling techniques at the expense of providing only a statistical estimate of the properties of a full trace. Previous studies [10, 7, 6] contain results for other workloads and caches and discuss the conditions under which sampling may, or may not, be used. This work was supported in part by NSF Grant MIP 9700970 and by a gift from Intel Corporation. Our interest in using sampling is three fold. First, we are interested in the ....
....of the integer SPEC95 suite can be found in [9] 3.2. Sampling Techniques After experimenting with various sample sizes and sampling ratios, we settled on a sample size of 500,000 references and a sampling ratio of 0.1. The process of tuning these parameters for a given workload is important [10, 6]. The rationale for our choice for these Windows NT desktop application traces is discussed in a technical report [3] Table 3 describes the sampling techniques considered in this study. As noted in the previous section, they differ by the state of the cache at the beginning of a sample, or, ....
M. Martonosi, A. Gupta, and T. Anderson. Effectiveness of trace sampling for performance debugging tools. In Proceedings of ACM Sigmetrics Conf. on Measurement and Modeling of Computer Systems, pages 248--259, 1993.
....and time, particularly for large caches and long running applications. These demands can be greatly reduced by employing sampling techniques at the expense of providing only a statistical estimate of the properties of a full trace. Previous studies contain results for various workloads and caches [Martonosi et al. 93, Laha et al. 88, Kessler et al. 94] and discuss the conditions under which sampling may, or may not, be used. Our interest is in the behavior of commonly used desktop applications. When compared to benchmarks such as SPEC95, these applications have larger working sets, are This work was ....
....the integer SPEC95 suite can be found in [Lee et al. 98] Methodology: After experimenting with various samples sizes and sample ratios, we settled on a sample size of 500,000 references and a sampling ratio of 0.1. The process of tuning these parameters for a given workload is important [Martonosi et al. 93, Kessler et al. 94] The rationale for our choice for these Windows NT desktop application traces can be found in the technical report. Table 3 describes the sampling techniques considered in this study. As noted in the previous section, they differ by the state of the cache at the beginning of a ....
Martonosi, M., Gupta, A., and Anderson, T. Effectiveness of trace sampling for performance debugging tools. In Proceedings of ACM Sigmetrics Conf. on Measurement and Modeling of Computer Systems, pages 248-- 259, 1993.
....each memory reference, the instrumentation adds procedure calls to a back end simulator that uses MemSpy to collect and maintain statistics about the memory behavior of the application. To reduce the overhead of simulating every memory reference, MemSpy uses an effective trace sampling technique [20] to simulate only randomly chosen portions of the reference trace. MemSpy s hierarchical approach to the detection of memory overheads and the categorization of cache misses is novel. It does not however, provide information on the causes of the communication (what data object and in what ....
Margaret Martonosi, Anoop Gupta and Thomas Anderson. Effectiveness of Trace Sampling for Performance Debugging Tools. In Proceedings of the 1993 SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 248-259. May 1993.
....pollution [2, 22] Furthermore, this overhead is almost always wasted, because in most simulations the common case, e.g. a cache hit, requires no action. Clearly, optimizing the lookup (step 2) to quickly detect these no action cases can significantly improve simulation performance. MemSpy [11] builds on this observation by saving only the registers necessary to determine if a reference is a hit or a miss; hits branch around the remaining register saves and miss processing. MemSpy s optimization improves performance, but sacrifices trace driven simulation s clean abstraction. The action ....
....we lump effective address calculation and action lookup into a single lookup term. Similarly, we lump action simulation and metric update into a single miss processing term. For trace driven simulation, we consider an on thefly simulator that performs a procedure call to perform the lookup [21, 11]. To maintain a clean interface between the reference generator and the simulator, processor state is saved before invoking the simulator. Our implementation inserts two instructions before each memory reference that compute the effective address and jump to a stub; the stub saves processor state, ....
M. Martonosi, A. Gupta, and T. Anderson. Effectiveness of Trace Sampling for Performance Debugging Tools. In Proceedings of the 1993 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 248--259, May 1993.
....no action is required for references to the most recently used (MRU) block in each set for set associative caches with least recently used (LRU) replacement. Clearly, optimizing the lookup (step 2) to quickly detect these no action cases can significantly improve simulation performance. MemSpy [51] builds on this observation by saving only the registers necessary to determine if a reference is a hit or a miss; hits branch around the remaining register saves and miss processing. MemSpy s optimization improves performance but sacrifices trace driven simulation s clean abstraction. The action ....
....address calculation and action lookup into a single lookup term. Similarly, I lump action simulation and metric update into a single miss processing term. For trace driven simulation, I consider two on the fly simulators: one invokes the simulator for each memory reference (via procedure call) [71,51], and one buffers effective addresses, invoking the simulator only when the buffer is full. To maintain a clean interface between the reference generator and the simulator, processor state is saved before invoking the simulator. 17 The procedure call implementation inserts two instructions ....
M. Martonosi, A. Gupta, and T. Anderson. Effectiveness of Trace Sampling for Performance Debugging Tools. In Proceedings of the 1993 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 248--259, May 1993.
....no action is required for references to the most recently used (MRU) block in each set for set associative caches with least recently used (LRU) replacement. Clearly, optimizing the lookup (step 2) to quickly detect these no action cases can significantly improve simulation performance. MemSpy [16] builds on this observation by saving only the registers necessary to determine if a reference is a hit or a miss; hits branch around the remaining register saves and miss processing. MemSpy s optimization improves performance but sacrifices trace driven simulation s clean abstraction. The action ....
....a place holder for register specifier and or immediate operands of the specific memory reference. Figure 8: Fast Cache Lookup with Live Condition Codes 12 For trace driven simulation, we consider two on the fly simulators: one invokes the simulator for each memory reference (via procedure call) [29,16], and one buffers effective addresses, invoking the simulator only when the buffer is full. To maintain a clean interface between the reference generator and the simulator, processor state is saved before invoking the simulator. The procedure call implementation inserts two instructions before ....
M. Martonosi, A. Gupta, and T. Anderson. Effectiveness of Trace Sampling for Performance Debugging Tools. In Proceedings of the 1993 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 248-- 259, May 1993.
.... When hit bypassing is implemented in software, it limits the effectiveness of techniques such as time sampling [Laha88] and set sampling [Puzak85] Martonosi investigated time sampling by adding an additional check to MemSpy s annotations that enabled and disabled monitoring at regular intervals [Martonosi93]. When enabled, annotation overheads are similar to those cited previously (25 instructions per hit) but when disabled, an annotated reference executes only 6 extra instructions. When trapping is enabled for 10 of the entire execution time, MemSpy slowdowns dropped to about 4 to 10, a factor of ....
Martonosi, M., Gupta, A. and Anderson, T. Effectiveness of trace sampling for performance debugging tools. In Proceedings of the 1993 SIGMETRICS Conference on the Measurement and Modeling of Computer Systems, Santa Clara, California, ACM, 248-259, 1993.
....dependent parameter is the sampling ratio, the ratio of the total number of references within the samples, divided by the total number of references in the run. In this paper, we present accuracy results for one setting of sampling parameters, and briefly summarize other results. Peferences [6] [10], and [11] discuss reference trace sampling in more detail. 20 18 MATM ESPR TRI MP3D MP3D Sequential True Miss Rate Estimated Miss Rate Using Sampling CHOL WATER LOCUS Parallel Figure 10: Estimated and true cache miss rates for sequential and parallel applications. 6.4.1 MemSpy ....
M. Martonosi, A. Gupta, and T. Anderson. Effectiveness of Trace Sampling for Performance Debugging Tools. 17 In Proc. A CM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems, May 1993.
....lower level of detail. Only the model s caches, branch predictor, and architectural state are updated. Other work has studied analytic models for estimating cache miss rates during the unprimed portion of the sample [25] 64] or described means for bounding errors by adjusting simulation lengths [34]. Iyengar and Trevillyan have derived the R metric for measuring the representativeness of a trace [18] and they generate traces by scaling basic block transition counts and adjusting selected instructions to optimize the R metric. Their technique incorporates cache and TLB behavior as well as ....
M. Martonosi, A. Gupta, and T. Anderson. Effectiveness of Trace Sampling for Performance Debugging Tools. In Proceedings of the ACM SIGMETRICS Conference on Measurment and Modeling of Computer Systems, pages 248--59, May 1993.
....(in essence, a handler) responds to the trigger by incrementing appropriate counts in a set of histograms. These banks of memory mapped histograms form the statistics state information for this performance monitoring system. For comparison, consider a software based approach such as MemSpy [MGA95] or CProf [LW94] In these tools, the events tobe monitored are memory references in the code. Trigger points are created at these events by instrumenting them with calls to software procedures, or handlers. These monitoring routines update their data structures with statistics about the ....
....monitoring are present in the standard cache coherence mechanisms. 3 FlashPoint: A Case Study As a concrete example of our ideas, we now describe a tool called FlashPoint. The tool gives data oriented breakdowns of memory overhead in the programs being run. That is, similar to tools like MemSpy [MGA95] and CPROF [LW94] it presents program performance information in terms of data, as well as code, structures in the program. FlashPoint maintains data structures that map each memory location accessed by the monitored program to its corresponding program data structure identifier. The mappings ....
[Article contains additional citation context not shown here]
M. Martonosi, A. Gupta and T. Anderson. Effectiveness of Trace Sampling for Performance Debugging Tools. Proc. ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems. May, 1993.
No context found.
M. Martonosi, A. Gupta, and T. Anderson. Effectiveness of trace sampling for performance debugging tools. In Proceedings of the 1993.
No context found.
M. Martonosi, A. Gupta, and T. Anderson. Effectiveness of trace sampling for performance debugging tools. In Proceedings of the 1993.
No context found.
Martonosi, M., Gupta, A. and Anderson, T. Effectiveness of trace sampling for performance debugging tools, In SIGMETRICS, Santa Clara, California, ACM, 248-259, 1993.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC