17 citations found. Retrieving documents...
M. Martonosi, A. Gupta, and T. Anderson. Tuning memory performance in sequential and parallel programs. IEEE Computer, April 1995. 6.1, 6.1, 6.4

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Techniques for Accurate, Accelerated Processor Simulation: .. - Haskins, Jr., Skadron (2002)   (Correct)

....run in an average of 4 hours, with the worst case finishing in just less than 6.5 hours. 3 Sampling A large body of prior work has explored sampling techniques for computer architecture research. For example, Laha, Patel, and Iyer [19] Crowley and Baer [6] Martonosi, Gupta, and Anderson [22]; Kaplan, Smaragdakis, and Wilson [14] and Elnozahy [9] examine memory reference trace sampling and present new algorithms for trace reduction and compression. Other work has studied analytic models for estimating cache miss rates during the unprimed portion of the sample [15, 33] or described ....

Margaret Martonosi, Anoop Gupta, and Thomas Anderson. Tuning Memory Performance in Sequential and Parallel Programs. IEEE Computer, pages 32--40, Apr. 1995.


Education at a Distance: A Report From the Front - Lilja (1999)   (Correct)

....a new architectural feature improves system performance. 2 The students are assigned readings from the textbook The Art of Computer Systems Performance Analysis [2] and several relevant papers selected by the instructor from the current literature [3] 4] 5] 6] 7] 8] 9] 10] 11] [12], 13] 14] 15] 16] 17] The outline for this course is shown in Table I. It is assumed that, before taking the course, the students have developed an understanding of computer organization and design and basic computer architecture # have the ability to program in a general purpose ....

M. Martonosi, A. Gupta, and T. E. Anderson, "Tuning memory performance of sequential and parallel programs," in IEEE Computer, April 1995, vol. 28, pp. 32--40.


Hardware Performance Monitoring in Memory of NUMAchine.. - Pin (1997)   (Correct)

....factors. There are many aspects of performance loss that are invisible to the programmer of parallel applications. Software tools can provide estimates of some of the sources of loss, but hardware is in general better suited to the task. For instance, tools such as CProf [21] 17 and MemSpy [8] can effectively do memory tuning to gain performance, but their simulation based property make them slow and cumbersome. Hardware is likely to be faster, more efficient and more accurate. In addition, these software tools cannot model contention well and they make assumptions about uniform ....

....AD[25] 10 (Mode 2) Block Label[8. 1] AD[25] 11 (Mode 3) Block Label[7. 0] AD[6] TABLE 4.6. Four Different Modes of Histogram Statistics. 40 PhaseId[3. 0] Histo Mode2 Block Label[8. 0] Histo Mode2 Histo Mode4 direct addr Histo Mode3 AD[6] Histo Mode1 Block Label[8] block label sel [7. 4] 3. 0] 4 4 4 4 4 4 4 Histogram Stats ff[15. 7] Block Label ff[8. 0] PhaseId ff[3. 0] Histogram Histogram Stats[14. 11] Histogram Stats[15] Stats[10. 7] FIGURE 4.13. Histogram Statistics Mode: Generating Histogram Stats[15. 7] 4 4 4 Group of ....

[Article contains additional citation context not shown here]

M. Martonosi, A. Gupta, T.E. Anderson, "Tuning Memory Performance of Sequential and Parallel Programs," Computer, 28(4), April 1995


Performance Tuning Of Programs For Shared-Memory Multiprocessors - Talbot (1995)   (Correct)

....monitoring tools such as Gprof [15] and Mtool [13] are intended to produce simple, high level statistics with minimal overhead. At the other extreme are tools like SHMAP [10] which provides a reference by reference animated picture of program memory behaviour. In addition, tools such as MemSpy [23, 25] and SM prof [6] provide a range of detail levels. In Section 2.5, some existing performance debugging tools are examined to provide a basis for developing a new tool. 1.2. Project motivation 19 0 10 20 30 40 50 60 0 10 20 30 40 50 60 70 processing elements cfd ideal memory real memory ....

....at the most basic level, depends on the intrinsic nature of the application. However, the programmer still has considerable flexibility in manipulating the algorithm, data structures, and program structure to change the memory reference patterns in order to better exploit the memory hierarchy [25]. There has been a surge of interest in recent years in developing tools to support application performance tuning, rather than simply correctness debugging of programs on sharedmemory. Broadly stated, the purpose of a performance debugging tool is to focus the user s attention on where a program ....

[Article contains additional citation context not shown here]

M. Martonosi, A. Gupta, and T. Anderson. Tuning memory performance of sequential and parallel programs. IEEE Computer, 28(4):32--40, April 1995.


Prescriptive Performance Tuning: The RX Approach - Rajamony (1998)   (Correct)

....affected by the placement of its data structures in memory. Changing this layout can reduce conflict misses and also increase implicit prefetching by improving spatial locality. Consequently, the application performance can be improved. Existing tools that target memory performance [GH93, LW94, MGA95, SB94] present only descriptive information, such as cache statistics, to the user. In contrast, a prescriptive tool can treat this as an optimization problem where the goal is to optimally arrange the program data structures, and directly specify the best layout to use. The first step is to ....

....Cprof [LW94] is a cache profiling system for sequential programs that presents statistics in terms of both code and data structures. It categorizes cache misses obtained from a simulation of the program into compulsory, capacity or conflict misses and presents them the programmer. MemSpy [MGA95] also presents data oriented 13 statistics, for tuning the memory performance of parallel programs. At compile time, instrumentation is added into the program which calls an event simulator at interesting events. The program is then simulated using Tango Lite [Gol93] an address reference ....

[Article contains additional citation context not shown here]

M. Martonosi, A. Gupta, and T. E. Anderson. Tuning memory performance of sequential and parallel programs. IEEE Computer, 28(4), April 1995.


Performance Implications of Context Switches on Misses to DRAM - Meerdervoort (1999)   (Correct)

....1991] Reorganizing software with the aim of making better use of caches is another way of tackling the problem. This involves optimizing programs for spatial and temporal locality. For example, Memspy is a software package that provides programmers with the means to use locality more effectively [Martenosi et al. 1995]. 22 4.3.4 Compiler based methods The compiler can assist in restructuring code to improve its memory locality. Most efforts are directed at optimizing loops since this is an area with potential for locality improvement. This is however a static optimization and is limited in this respect. In ....

M Martenosi, A Gupta and T Anderson. Tuning Memory Performance of Sequential and Parallel Programs. Computer, vol. 28 no. 4, April 1995, pp 32-40.


An SRAM Main Memory Model - Salverda (1997)   (1 citation)  (Correct)

....In particular, emphasis is placed on those strategies which represent competing approaches to the RAMpage hierarchy. Softwareoriented techniques such as code reorganization, either through compiler optimization [3] or through tools which assist the programmer in manually optimizing code [29], are therefore not discussed. Although such techniques are indeed important contributors to improved memory system performance, they can be applied equally well within the context of the new hierarchy, and therefore have no bearing upon this research work. Section 4.2 is devoted to the primary ....

M. Martonosi, A. Gupta, and T.E. Anderson. Tuning memory performance of sequential and parallel programs. Computer, 28(4):32--40, April 1995.


Annai/PMA Instrumentation Intrusion Management of Parallel.. - Endo, al. (1995)   (Correct)

.... as seen for Cenju 3 in graph (b) Our experience is that cache management is largely effective, such that cache misses occur infrequently: unfortunately run time cache miss information is generally not available and therefore this cannot be included within PMA performance and intrusion analyses ( MGA95] considers such cache analysis) The basic processing costs of instrumentation functions are subsequently well defined, either as a fixed cost for all instances, or with CSCS TR 95 05 5. PERFORMANCE ANALYSIS a function or instance dependent part which is determinable (and provides an ....

Margaret Martonosi, Anoop Gupta, and Thomas E. Anderson. Tuning memory performance of sequential and parallel programs. IEEE Computer, 28(4):32--40, April 1995.


Design Issues and Tradeoffs for Write Buffers - Skadron (1997)   (18 citations)  (Correct)

....gmtry and cholsky, they obtain good speedups for several other SPEC92 benchmarks, notably tomcatv. But they also point out that tuning cache performance is difficult. Most programs are not as easily analyzed as the NASA kernels finding opportunities for improvement often requires cache profiling [16, 17], and even then conflict misses can make cache behavior vary from input to input. We therefore focus on simple changes to the write buffer itself, and not on compiler techniques or application specific modifications. Results are for the SPEC benchmarks as shipped, without cache optimizations. ....

M. Martonosi, A. Gupta, , and T. E. Anderson. Tuning memory performance in sequential and parallel programs. IEEE Computer, pages 32--40, Apr. 1995.


Performance Debugging Shared Memory Parallel Programs.. - Ramakrishnan Rajamony (1997)   (1 citation)  (Correct)

....the performance problem. Quartz[3] uses the normalized processor time metric to rank the contribution of procedures in the program to the overall execution. The resultant listing of the importance or different procedures to the execution a la gprof can be used for performance tuning. MemSpy [13] and ParaView [22] concentrate on the memory performance of programs. Both these tools simulate the program being tuned and present the collected trace information in various formats. MemSpy categorizes cache misses and presents data oriented statistics. ParaView presents the times spent in ....

M. Martonosi, A. Gupta, and T. E. Anderson. Tuning memory performance of sequential and parallel programs. IEEE Computer, 28(4), April 1995.


Cautious, Machine-Independent Performance Tuning for.. - Talbot, Bennett, Kelly   (Correct)

....trace analysis, describes our implementation, and presents our experience of using the tool. 1 Introduction There has been considerable recent interest in developing tools to support manual performance optimisation of applications running on coherent cache sharedmemory multiprocessors (e.g. [1, 3]) The purpose of a performance tuning tool is to direct the programmer s attention to where a program is spending its time and to give as much guidance as possible into how to reduce the performance bottlenecks. Existing performance tools measure (using special monitoring circuitry) or predict ....

....by cfd new. In addition, the simulations showed that the cache miss rate was always lower for cfd new in comparison with cfd orig. The reduction in false sharing lead to a significant improvement in performance, even though active sharing increased slightly. 6 Related Shared Memory Tools MemSpy [3] assists in locating bottlenecks by providing detailed information that focuses the programmer s attention on the problem areas in the application. SM prof [1] is similar to that presented here, but has the drawback that it does not distinguish between active and false sharing of cache lines. It ....

M. Martonosi, A. Gupta, and T. Anderson. Tuning memory performance of sequential and parallel programs. IEEE Computer, 28(4):32--40, April 1995.


Informing Memory Operations: Providing Memory Performance.. - Horowitz (1996)   (26 citations)  Self-citation (Martonosi)   (Correct)

....such as cache blocking [AKL79,GJMS87,WL91] and prefetching [MLG92, Por89] use static program analysis to predict which references are likely to suffer misses. Memory performance tools have relied on sampling or simulation based approaches to gather memory statistics [CMM 88,DBKF90,GH93,LW94,MGA95] Operating systems have used coarse grained system information to reduce latencies by adjusting page coloring and migration strategies [BLRC94,CDV 94] Knowledge about memory referencing behavior is also important for cache coherence and data access control; for example, Wisconsin s Blizzard ....

....of each technique, and we quantify their impact on performance in Section 4.2. 4.1. 1 Performance Monitoring Performance monitoring tools collect detailed information to guide either the programmer or the compiler in identifying and eliminating memory performance bottlenecks [BM89, GH93, LW94, MGA95] A major difficulty with such tools is how to collect sufficiently detailed information quickly and without perturbing the monitored program. The high overheads of today s memory observation techniques have resulted in tools that either provide coarsegrained information (e.g. at loop level ....

M. Martonosi, A. Gupta, and T. E. Anderson. Tuning Memory Performance of Sequential and Parallel Programs. IEEE Computer, pp 32-40. April 1995.


Software Methods to Improve Data Locality and Cache Behavior - Beyls (2004)   (Correct)

No context found.

M. Martonosi, A. Gupta, and T. Anderson. Tuning memory performance in sequential and parallel programs. IEEE Computer, April 1995. 6.1, 6.1, 6.4


Using Set Sampling for Level Three Cache Studies - Thornock (1999)   (Correct)

No context found.

Margaret Martonosi, Anoop Gupta, and Thomas E. Anderson, Tuning memory performance of sequential and parallel programs, Computer, pages 32--40, April 1995.


Software Methods to Improve Data Locality and Cache Behavior - Beyls (2004)   (Correct)

No context found.

M. Martonosi, A. Gupta, and T. Anderson. Tuning memory performance in sequential and parallel programs. IEEE Computer, April 1995. 6.1, 6.1, 6.4


A Proposal for a New Hardware Cache Monitoring Architecture - Schulz, Tao, Jeitner, Karl (2002)   (Correct)

No context found.

M. Martonosi, A. Gupta, and T. E. Anderson. Tuning Memory Performance in Sequential and Parallel Programs. IEEE Computer, pages 32-40, April 1995.


Data Locality Optimization of Shared Memory Programs on NUMA.. - Tao   (Correct)

No context found.

M. Martonosi, A. Gupta, and T. E. Anderson. Tuning Memory Performance in Sequential and Parallel Programs. IEEE Computer, pages 32--40, April 1995.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC