30 citations found. Retrieving documents...
A. J. Goldberg and J. Hennessy, "Performance Debugging Shared Memory Multiprocessor Programs with MTOOL, " Proceedings of Supercomputing'91, pp. 481--490, Nov. 1991.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents

Visual Assistance for Concurrent Processing - Erbacher (2000)   (Correct)

....events occur [Lehr89, McDow89] In addition, by forcing the program to execute more slowly (while it executes instrumentation code) we may remove contention for resources such as a global bus. Reduction of resource contention is a consequence of reducing the frequency of accesses to resources [Goldb91]. Different processors can also execute instrumentation code a different number of times, changing their relative speeds [Goldb91] It has been shown that adding any amount of instrumentation can change the execution patterns of a program. Merely adding a print statement can cause a malfunctioning ....

....code) we may remove contention for resources such as a global bus. Reduction of resource contention is a consequence of reducing the frequency of accesses to resources [Goldb91] Different processors can also execute instrumentation code a different number of times, changing their relative speeds [Goldb91]. It has been shown that adding any amount of instrumentation can change the execution patterns of a program. Merely adding a print statement can cause a malfunctioning application to execute correctly [Harde92, McDow89] For these reasons, it is important that if monitoring code must introduce ....

[Article contains additional citation context not shown here]

Aaron Goldberg and John Hennessy, Performance Debugging Shared Memory Multiprocessor Programs with MTOOL,in Proceedings of Supercomputing 91, IEEE Press, 1991, pp. 481-490.


Effectiveness of Trace Sampling for Performance Debugging.. - Margaret Martonosi And (1993)   (16 citations)  (Correct)

....programmer has considerable flexibility in tuning the program for better memory system performance. However, tuning the memory behavior of large programs is a complex task requiring detailed information on the program s access patterns. Some performance monitoring systems (such as MTOOL [3, 4]) give only code oriented information indicating the amount of memory overhead in particular loops or procedures. This information is useful for initial queries about application behavior; however, it is often not detailed enough to help the user fix the application s performance bottlenecks. ....

....one tenth of the total references, we get 4 to 6 fold speedups compared to non trace sampled MemSpy. For the benchmarks studied here, this reduces MemSpy s overhead to a factor of 3 to 8. With execution time overheads in this range, MemSpy s performance becomes competitive with other tools [4] that present less detailed statistics. This paper makes several main contributions. As previously stated, we show that within the context of a performance debugging tool, reference trace sampling can be used effectively to improve the tool s performance. We go on to present results showing how ....

[Article contains additional citation context not shown here]

A. J. Goldberg and J. Hennessy. Performance Debugging Shared Memory Multiprocessor Programs with MTOOL. In Proc. Supercomputing, pages 481490, Nov. 1991.


A Fast and Accurate Approach to Analyze Cache Memory.. - Vera, Llosa.. (2000)   (2 citations)  (Correct)

....provide different trade offs between: accuracy, speed, flexibility (i.e is adaptable to different memory configurations) and information provided. Memory simulation techniques are very accurate, flexible and can provide rich information. They are usually based on trace driven simulation [11] [9], 17] 20] 6] 13] 10] 3] 16] 23] However these techniques are very slow (usually several orders of magnitude) For instance, the slowdown exhibited by all simulators surveyed in [22] is in the range of 45 6250. There are some innovative methods that have been proposed with the ....

A.J. Goldberg and J. Hennessy. Performance debugging shared memory multiprocessor programs with mtool. In Procs. of Supercomputing '91 Conf. (SC'91), pages 481--490, 1991.


LBF: A Performance Metric for Program Reorganization - Eom, Hollingsworth (1998)   (Correct)

....and allocate time to specific operations or program components. Performance prediction uses a model or simulation to predict the execution time of an algorithm or program. There are three major types of performance measurement tools: profilers, visualizations, and search tools. Profile metrics[1, 6, 15, 22] associate a value with each component of a distributed or parallel application (frequently procedures) and are presented as sorted tables. Visualizations[8, 13, 14, 18, 23] explain application performance using pictures. Search tools[10, 17, 21] help users to manage performance data information ....

A. J. Goldberg and J. L. Hennessy, "Performance Debugging Shared Memory Multiprocessor Programs with MTOOL," Supercomputing'91. Nov. 18-22, 1991, Albuquerque, NM, pp. 481490.


An Adaptive Cost System for Parallel Program Instrumentation - Hollingsworth, Miller   (Correct)

....the logging of data. However, our technique has the advantage that disabling data collection completely removes the instrumentation code and so there is no latent perturbation due to instrumentation code that is disabled but must execute code to learn that it is disabled. Goldberg and Hennessy[3] used the difference between the measured and predicted time of a code region to quantify the affects of the memory hierarchy. Our approach differs in two ways from theirs. First, since we need to be able to characterize the impact of small, but (potentially) frequently accessed instrumentation ....

A. J. Goldberg and J. L. Hennessy. Performance debugging shared memory multiprocessor programs with MTOOL. Supercomputing 1991, pages 481--490, Nov. 18-22 1991.


Experiment Management Support for Parallel Performance Tuning - Karavanic (1999)   (Correct)

....is inherent to the performance prediction problem. We seek to include performance predictions and models in the scope of our tuning approach by broadening our definition of Program Execution to include models and predictions. Some examples of 16 predictive tools are the MK Toolkit [3] and MTOOL [21]. The MK Toolkit developed by Block and Sarukkai automates the predictive analysis task. They compare actual versus predicted total execution time. Their work does not include any effort to compare results across platforms, environments, or code changes. MTOOL includes a metric that compares ....

A.J. Goldberg and J.L. Hennessy. Performance debugging shared memory multiprocessor programs with MTOOL. Proceedings of Supercomputing '91, pages 481--490, Albuquerque, NM, November 1991.


Cache Profiling and the SPEC Benchmarks: A Case Study - Lebeck, Wood (1994)   (103 citations)  (Correct)

....simple example of nested loops where the outer loop iterates L times and the inner loop sequentially accesses an array of N 4 byte integers. 4 ############################################################################################## a) Cache b) Small Array c) Large Array A[0] A[1] A[2] A[3] A[4] A[0] A[1] A[2] A[8] A[9] A[10] A[16] Figure 1: Determining Expected Cache Behavior Sequentially accessing an array that fits in cache (Figure 1b) should produce M cache misses, where M is the number of cache blocks required to hold ....

....where the outer loop iterates L times and the inner loop sequentially accesses an array of N 4 byte integers. 4 ############################################################################################## a) Cache b) Small Array c) Large Array A[0] A[1] A[2] A[3] A[4] A[0] A[1] A[2] . A[8] A[9] A[10] A[16] Figure 1: Determining Expected Cache Behavior Sequentially accessing an array that fits in cache (Figure 1b) should produce M cache misses, where M is the number of cache blocks required to hold the array. Accessing an array ....

[Article contains additional citation context not shown here]

A. J. Goldberg and J. Hennessy, "Performance Debugging Shared Memory Multiprocessor Programs with MTOOL," Proceedings Supercomputing '91, pp. 481-490 (November 1991).


Symbolic Cache Analysis for Real-Time Systems - Blieberger, Fahringer, Scholz (1999)   (2 citations)  (Correct)

....s(0) 0 s(k 1) s(k) ae(a(k) k) p 4 = true] Figure 6. C program fragment can be done at the source code level or machine code level. For our framework we need a source code level instrumentation. In the past a variety of different cache profilers were introduced, e.g. MTOOL (Goldberg and Hennessy, 1991), PFC Sim (Callahan et al. 1990) CPROF (Lebeck and Wood, 1994) The novelty of our approach is to compute the trace data symbolically at compile time without executing the program. A symbolic tracefile is a constructive description for all possible memory references in chronological order. It ....

Goldberg, A. and J. Hennessy: 1991, `Performance Debugging Shared Memory Multiprocessor Programs with MTOOL'. In: Proc. of the Supercomputing Conference.


Automatic Annotation Of Instructions With Profiling Information - Johnson (1995)   (2 citations)  (Correct)

.... expansion [15] 16] Memory dependence profiling has been used to aid ILP enhancing optimizations by allowing the compiler to reorder ambiguous memory references [2] 4 Profiling has also been used to identify procedures, basic blocks or source lines with high memory overheads or cache misses [17], 18] 19] 20] Profiling information specifying the number of cache misses incurred by each access has been proposed to guide the compiler to selectively prefetch data [21] 22] and has been recently used to hand tune code [20] 1.2 The IMPACT Compiler The tool described in this thesis is ....

A. J. Goldberg and J. Hennessy, "Performance debugging shared-memory multiprocessor programs with mtool," in Proceedings of Supercomputing '91, pp. 481--490, 1991.


Tiling for Parallel Execution - Optimizing Node Cache.. - Kaplow, Szymanski (1996)   (1 citation)  (Correct)

....a compiler. The symbolic methods includes mainly analytic approaches, with the recent addition of compile time simulation [7] The execution driven methods measure the run time of a compiled program to determine the effect of optimization choices, e.g. tile size selection. The method presented in [5,6] involves capturing memory load and store addresses during execution and processing them via a cache simulation model to determine the miss rate of a range of parameters. However, such methods involve program execution and therefore are not suitable for embedding within a compilation system. The ....

....cycle has the following steps: i) get the next event, ii) access the guarded cache model, iii) perform miss processing, if necessary. 1,1,1,1,10,1,2,1,4] NextCand] Sorted Event List Event Guards [1,1,1,1,10,1,33,1,3] NextCand] 1,1,1,1,10,1,34,1,2] NextCand] 1,1,1,1,33,1,2,1,1] NextCand] [1,1,1,1,33,1,2,1,5][NextCand] 1,1,1,2,1,1,1,1,1] NextCand] 1,1,1,2,1,1,1,1,2] NextCand] Guarded Cache Model Cache Probe Miss Data Next Event Affected Event Update NextLex Simulation Loop NextCand Miss Processing Global Clock Insert Event Fig. 4: Event List and Guarded Cache Model The event list object is a ....

A. Goldberg and J. Hennessy. Performance debugging shared-memory multiprocessor programs with mtool. In Processings of Supercomputing 91, 1991.


Program Optimization Based on Compile-Time Cache Performance.. - Wesley Kaplow (1996)   (8 citations)  (Correct)

....cache (cf. 6] Although actions of the cache over the string are very fast (they are taken by the actual hardware) the method requires repetitive compilations and executions of the application codes; therefore it is too time consuming to be used at compile time. Cache Simulation described in, [8,9,10], uses the reference string generated by an execution of an application and a model of a cache. The generation of the reference string from the program allows the inverse mapping, i.e. from a reference to the source code. Thus, the programmer can be informed of the number and types of cache ....

A.J. Goldberg and J. Hennessy, Performance Debugging Shared--Memory Multiprocessor Programs with Mtool, in Proceedings Supercomputing 91, (IEEE Computer Science Press, Los Alamitos, CA) 481--490.


Understanding the Memory Behavior Performance on Software.. - Poulos   (Correct)

....analyzes and instruments arbitrary Fortran code to generate memory reference traces and the SHMAPA program that post processes the trace files and provides memory behavior animations. SHMAP is a simple visualization tool to be used with Fortran programs; it only animates matrix accesses. MTOOL [13] is a tool specifically for analyzing performance losses in sharedmemory parallel programs. By instrumenting the program with basic block counters and using estimates about the execution time of each block, as well as measurements from a profile run of the program to approximate how many times ....

Aaron J. Goldberg and John L. Hennessy. Performance Debugging Shared Memory Multiprocessor Programs with MTOOL. In Proceedings of Supercomputing 1991, pp. 481-490. November 1991.


MemSpy: Analyzing Memory System Bottlenecks in Programs - Martonosi, Gupta, Anderson (1992)   (46 citations)  (Correct)

....reasons for why those misses occurred. Understanding the cause of misses is important, since some of those misses may be essential misses (e.g. cold start misses) while others may be more easily optimized away (e.g. replacement or invalidation misses) Most existing performance debugging tools [2, 3, 5, 7, 8, 9, 15], do not provide the detailed information mentioned above. In this paper, we describe MemSpy, a prototype tool that provides such information and helps programmers improve the memory reference behavior of applications. The paper outlines two case studies showing the usefulness of MemSpy s detailed ....

....memory system time together, making it difficult to determine when the memory behavior is a bottleneck. However, Quartz is quite good at focusing the user s attention on those procedures that are most critical to performance; we have incorporated some of Quartz s functionality into MemSpy. MTOOL [7, 8] is a system specifically designed to detect memory bottlenecks in both sequential and parallel programs. MTOOL s basic performance metric is the difference between a program s actual execution time with nonideal memory system behavior, and the execution time of the same code with an ideal memory ....

A. J. Goldberg and J. Hennessy. Performance Debugging Shared Memory Multiprocessor Programs with MTOOL. In Proc. Supercomputing, pages 481--490, Nov. 1991.


Parallel Program Performance Metrics: A Comparison and.. - Hollingsworth, Miller (1992)   (8 citations)  (Correct)

....improve it. Profiling metrics (in sorted lists) also have the nice property that they scale well to massively parallel systems. These metrics are a natural complement to display and visualization tools. Many metrics # have been developed to help in the performance debugging of parallel programs [1, 9, 15, 17, 18]. Typically, new metrics either are compared to existing sequential tools or used in a case study to provide testimonials to their usefulness. Unfortunately, with testimonial case studies it is impossible to isolate the quality of the metric from the quality of the programmer using the metric. ....

....artifacts of implementation variations. The goal of increasing the precision of data collected directly conflicts with the goal of reducing instrumentation overhead. To avoid this problem, tools need to incorporate better algorithms to reduce the amount of data collected. AE[13] QP[3] and Mtool[9] are examples of this approach to the problem. Another option is to alter dynamically the level of instrumentation during a program s execution, depending on the desired information. This approach offers the greatest flexibility, but requires additional effort by the programmer (or a sophisticated ....

A. J. Goldberg and J. L. Hennessy, "Performance Debugging Shared Memory Multiprocessor Programs with MTOOL", Proc. of Supercomputing'91, Albuquerque, NM, Nov. 1822,


Mapping Performance Data for High-Level and Data Views of.. - Irvin, al. (1995)   (2 citations)  (Correct)

....programming libraries, or system level code) may add and remove sentences from the SAS and need not know about the existence of other layers to do so. Our use of the SAS resembles the way in which some performance tools for sequential programs make use of a monitored program s function call stack [6,7,8,17,23]. A program s function call stack records the functions that are active at any given point in time. By exploring the call stack, a performance tool can relate performance measurements for a function to each of its ancestors in the program s call graph. Users of such a performance tool can then ....

A. J. Goldberg and John Hennessy. Performance debugging shared memory multiprocessor programs with mtool. In Supercomputing 1991, pages 481--490, November 1991.


Predicting Application Behavior in Large Scale Shared-memory .. - Harzallah, Sevcik (1995)   (4 citations)  (Correct)

.... Because accurate modeling of communication is a critical element in providing accurate estimates of execution times of applications running on shared memory platforms, several execution driven simulation environments have been specifically designed for isolating communication overheads [MGA92, GH91, SSRV94] Although these tools are useful when considering short programs with small data sets, they are impractical when dealing with realistic applications. A 100 to 1000 fold slowdown between direct execution and execution driven simulation is quite common [GH91] depending on how many of the ....

....communication overheads [MGA92, GH91, SSRV94] Although these tools are useful when considering short programs with small data sets, they are impractical when dealing with realistic applications. A 100 to 1000 fold slowdown between direct execution and execution driven simulation is quite common [GH91] depending on how many of the application s instructions make memory references and on the number of processors being simulated. For example, simulating the SP application (the uniprocessor execution time of which is 2.7 hours) of the NAS parallel benchmark suite [BBB 94] on a 64 processor ....

A. J. Goldberg and J. L. Hennessy. Performance debugging shared memory multiprocessor programs with Mtool. In Supercomputing'91, pages 481--490. ACM, November 1991.


Parallel Hierarchical Radiosity on Cache-Coherent Multiprocessors - Richard, Singh (1997)   (Correct)

....that was about to be dereferenced did not have a null value. Null pointers were usually symptomatic of a critical region not having been protected by locks. We attempted to measure the lock and memory overhead of the program using two parallel performance monitoring tools (Memspy [6] and Mtools [7]) on both real machines and simulators. However, the tools have not been exercised previously on a program of this complexity, and they either produced no output or output that was inconsistent and clearly spurious. This may have been because of the program s large and complex memory requirements. ....

Aaron J. Goldberg and John L. Hennessy, "Performance Debugging Shared Memory Multiprocessor Programs with MTOOL", Proceedings of Supercomputing '91, pp. 481-490, November 1991. Page 12


Dynamic Program Instrumentation for Scalable Performance Tools - Jeffrey Hollingsworth (1994)   (9 citations)  (Correct)

....points are the procedure entry, exit and individual call statements. In future versions of our instrumentation, points will be extended to include basic blocks and individual statements. 4 foo( SendMsg(dest, ptr, cnt, size) addCounter(bytes, param[3] param[4]) addCounter(fooCount,1) Figure 3. Example Showing Two Different Metrics. Figure 4 shows a slightly more complex example of dynamic instrumentation. Four instrumentation points are used to compute the waiting time due to message passing constrained to a single procedure. The top two primitives ....

....at different levels than just procedures (e.g. modules and loops) makes it possible to collect data at an appropriate granularity for each application. 10 5. Related Work Several systems have been built that defer instrumentation until after compilation. Both QPT[1] and Mtool[4] use binary re writing to insert instrumentation into an object file after it has been compiled and assembled. These systems require data collection decisions to be made prior to program execution. One system that defers instrumentation until the program has started to execute is the TAM ....

A. J. Goldberg and J. L. Hennessy, "Performance Debugging Shared Memory Multiprocessor Programs with MTOOL", Supercomputing'91, Albuquerque, NM, Nov. 1991, pp. 481-490.


Dynamic Control of Performance Monitoring on Large Scale.. - Hollingsworth, Miller (1993)   (32 citations)  (Correct)

....performance metrics, visualization and data collection. Performance metrics address the user side of the performance problem by reducing large volumes of performance data into single values or tables of values. Many metrics have been proposed for parallel programs: Critical Path[21] NPT[1] MTOOL[8], Gprof[9] Each of these metrics can provide useful information; however in an earlier paper[12] we compared several of these metrics (and a few variations) and concluded that no single metric was optimal for all programs. However, we did discover several factors that can be used to help select ....

A. J. Goldberg and J. L. Hennessy, "Performance Debugging Shared Memory Multiprocessor Programs with MTOOL", Proc. of Supercomputing'91 , Albuquerque, NM, Nov. 18-22, 1991, pp. 481-490.


Program Optimization Based on Compile-Time Cache Performance.. - Wesley Kaplow (1996)   (8 citations)  (Correct)

....[7] Although actions of the cache over the string are very fast (they are taken by the actual hardware) the method requires repetitive compilations and executions of the application codes; therefore it is too time consuming to be used at compile time. Execution Driven Simulation described in [8,9,10], runs a model of a cache over the reference string generated by an execution of an application. Such a solution allows for mapping from a reference to the source code. Thus, the programmer can be informed of the number and types of cache misses caused by each source line. However, the cache model ....

A.J. Goldberg and J. Hennessy, Performance Debugging Shared--Memory Multiprocessor Programs with Mtool, in Proceedings Supercomputing 91, (IEEE Computer Science Press, Los Alamitos, CA) 481--490.


Improving the Cache Locality of Memory Allocation - Grunwald, Zorn, Henderson (1993)   (33 citations)  (Correct)

....two level cache that requires 200 cycles to service a second level cache miss [19] Although the performance of caches is improving, new processors commonly use a smaller on chip primary cache, with a larger secondary cache. Increased cache misses are difficult to detect. Some recent tools [7, 17] indicate what regions of a program incur excessive cache misses. However, they do not indicate the reason for those misses. Although the cache misses may be seen in one region of the program, the cause may arise elsewhere. More insidiously, the increased cache misses may be spread over all ....

A. J. Goldberg and J. Hennessy. Performance debugging shared memory multiprocessor programs with MTOOL. In Proceedings Supercomputing '91, pages 481--491, 1991.


Load Balancing and Data Locality in Adaptive.. - Singh, Holt.. (1995)   (30 citations)  Self-citation (Hennessy)   (Correct)

....large enough problems that speedups are meaningful. In addition to speedups, we also present results that separately compare the load balancing and communication behavior of different schemes. These results are obtained on the simulator, and we have corroborated their trends with the MTOOL [12] performance debugger on DASH. We compare load balancing behavior by measuring the time that processes spend waiting at synchronization points. In comparing communication behavior, we focus on inherent communication in the program that implements a given scheme, rather than on how these inherent ....

....The scheme that incorporates patch stealing is called cost estimates patch steal (C PS) 2. Figure 25 shows that the incorporation of patch stealing makes a dramatic difference to parallel performance and provides very good speedups. Measurements obtained with the MTOOL performance debugger [12] on DASH (Figure 27) show that the techniques used to maintain locality while stealing cause the stealing to contribute no appreciable increase in memory referencing overhead for these numbers of processors (16, in the figures) Figure 27 shows the overheads due to time spent in the memory system ....

Aaron J. Goldberg and John L. Hennessy. Performance debugging shared memory multiprocessor programs with MTOOL. In Proceedings of Supercomputing '91, pages 481-490, November 1991.


Exploiting Cache Locality At Run-Time - Yan (1998)   (Correct)

No context found.

A. J. Goldberg and J. Hennessy, "Performance Debugging Shared Memory Multiprocessor Programs with MTOOL, " Proceedings of Supercomputing'91, pp. 481--490, Nov. 1991.


TAPE: A Transactional Application Profiling Environment - Chi (2005)   (Correct)

No context found.

A. J. Goldberg and J. L. Hennessy. Performance debugging shared memory multiprocessor programs with MTOOL. In Supercomputing '91: Proceedings of the 1991.


Software---Practice And Experience, Vol. 24(8).. -..   (Correct)

No context found.

A. Goldberg and J. Hennessy, `Performance debugging shared memory multiprocessor programs with MTOOL', Proc. Supercomputing, November 1991, pp. 481--490.

First 50 documents

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC