15 citations found. Retrieving documents...
C. Hristea, D. Lenoski, and J. Keen, "Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks," in Proc. ICS, pp. 1--12, Nov 1997.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Compiler-Controlled Memory - Cooper, Harvey (1998)   (17 citations)  (Correct)

....less time than a reference to an element not in the cache. A reference that hits the cache typically completes in a single cycle, while a reference that misses takes five to ten cycles on a simple uniprocessor machine, and as long as hundreds of cycles in a distributed memory multiprocessor [14, 15, 22, 17, 1, 2]. This di#erence in access time has a strong impact on the performance of individual programs. Accordingly, much recent research in compilation has been directed at techniques that improve the likelihood of references hitting in the cache. Most of this work falls into two major categories. ....

Cristina Hristea, Daniel Lenoski, and John Keen. Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks. In ACM, editor, SC'97: High Performance Networking and Computing: Proceedings of the 1997.


Evaluating the Memory Performance of a ccNUMA System - Prestor (2001)   (2 citations)  (Correct)

....transactions, and the interconnect network traffic, there are no tools that use this information to aid in performance analysis of parallel programs on the Origin. Furthermore, very little information exists about the memory system performance of large Origin systems. The existing publications [19, 50] focus on small systems, and they do not evaluate the impact of the directory protocol on memory performance. The contribution of this thesis is twofold. First, it gives a detailed analysis of the memory performance on the Origin 2000, evaluating the system architecture, the directory cache ....

....restarts the pipeline which they call latency in isolation) is too optimistic and not likely to be relevant for actual commercial programs. Accurately estimating the lower bound on load latency is a nontrivial task, especially when the microbenchmark is intended to be portable. Hristea et al. [19] have written another collection of microbenchmarks that focuses entirely on memory performance. They distinguish between the back to back latency (pointer chasing kernel, similar to the one used by lmbench) and restart latency (i.e. latency in isolation) They estimate the restart latency by ....

[Article contains additional citation context not shown here]

HRISTEA, C., LENOSKI, D., AND KEEN, J. Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks. In Proceedings of Supercomputing '97 (San Jose, CA, November 1997).


Performance Prediction for Random Write Reductions: A Case.. - Jin, Agrawal (2002)   (1 citation)  (Correct)

....described by Saavedra et al. 16] The key idea in this method is to measure the time required for accessing arrays of increasing size, accessed using a stride that equals the page size. For computing cache coherence miss latency, we use a method similar to the one described by Hristea et al. [9]. Finally, we also computed the service time of the memory system in processing a read and write back (RWB) request. The method we used is again similar to the one developed by Hristea et al. 9] Our experience with random write reductions showed that memory contention is insignificant on Sun ....

....computing cache coherence miss latency, we use a method similar to the one described by Hristea et al. 9] Finally, we also computed the service time of the memory system in processing a read and write back (RWB) request. The method we used is again similar to the one developed by Hristea et al. [9]. Our experience with random write reductions showed that memory contention is insignificant on Sun Fire 6800. Therefore, the RWB service time was evaluated only for Sun Enterprise 450. 4.3 Results on the Large SMP Machine We now present experimental results on Sun Fire 6800. We were able to ....

[Article contains additional citation context not shown here]

Cristina Hristea, Daniel Lenoski, and John Keen. Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks. In Proceedings of SC 97, 1997.


Performance Experiences on Sun's WildFire Prototype - Noordergraaf, van der Pas (1999)   (7 citations)  (Correct)

....node) 704 ns 2000 ns 1762 ns Remote cache2cache (3hop nearest nodes) 1272 ns 2500 ns 2150 ns Latency for extra router hop 50 ns 20 ns No router # router hops grows Log Linearly No router TABLE 1. Latencies as Measured by the lm bench c Benchmark. a. The Origin 2000 numbers are from [Hristea97]. b. Numa Q numbers are not explicitly published, and have been extracted from [Lovett96] c. The lm bench Benchmark is described in [McVoy96] 3 when CMR can improve performance over cc NUMA, and vice versa. The Solaris Operating Environment makes use of integrated hardware counters to ....

C. Hristea, D. Lenoski, and J. Keen, Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks, In Proceedings of Supercomputing 1997.


FLASH vs. (Simulated) FLASH: Closing the Simulation Loop - Gibson, Kunz, Ofelt.. (2000)   (8 citations)  (Correct)

....bus timing to accommodate Mipsy s new and more accurate secondary cache interface, adjusting the latency through the network router, and tuning the latencies from the network to the node controller and viceversa. We also used snbench s restart time test, based on Hristea s microbenchmark suite [8], to set Mipsy parameters that determine delays between the processor core and the processor pins. 3.1.3 Results with Tuned Simulator Figure 3 shows the uniprocessor results with the simulators tuned as we describe above. We expect that since these 1.4 1.5 1.0 1.0 SimOS Mipsy 150 MHz 1.0 ....

C. Hristea, D. Lenoski, and J. Keen. Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks. In Proceedings of Supercomputing 1997 , November 1997.


Timestamp Snooping: An Approach for Extending SMPs - Martin, Sorin, Ailamaki.. (2000)   (3 citations)  (Correct)

....their scalability by adding a level of indirection to handle cache to cache transfers, significantly increasing average miss latency. For example, a cache to cache transfer takes 1036 ns on the SGI Origin 2000 [26] compared to 742 ns on the Sun UE10000 [12] despite a comparable memory access time [21]. Instead of fading away, SMPs have continued to dominate the multiprocessor market, largely due to a wealth of techniques for building high bandwidth, low latency buses that deliver transactions in an implicit, physical total order. These techniques include split transaction buses, physically ....

C. Hristea, D. Lenoski, and J. Keen. Measuring Memory Hierarchy Performance of Cache-coherent Multiprocessors Using Micro Benchmarks. In Proceedings of Supercomputing '97, Nov. 1997.


A Case for User-Level Dynamic Page Migration - Nikolopoulos, Papatheodorou.. (2000)   (1 citation)  (Correct)

....reference counters. For each node that contends for the page, we add a constant factor of 50ns to the base uncontended memory latency, in order to compute the contended memory latency. This value is extracted from a previous study that computed the latencies in the Origin2000 memory hierarchy [11]. Our competitive criterion is accurate, in the sense that a page migration will always reduce the maximum latency due to remote accesses for the processors that compete for the page. Assuming that each remote memory access from the same node to the same page has constant latency, our competitive ....

C. Hristea, D. Lenoski, and J. Keen. Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks. In Proc. of Supercomputing '97, November 1997.


Using Switch Directories to Speed Up Cache-to-Cache Transfers .. - Ravi Iyer Laxmi (2000)   (1 citation)  (Correct)

....transfers supported in most modern CC NUMA systems. The service time for these dirty cache to cache transfers on current systems is roughly 1.5 to 2 times longer than a clean memory access. For example, the access latency for dirty blocks is 1. 6 times that for clean blocks on the SGI Origin 2000 [3]. This increase in access latency is due to the slow DRAM This research has been supported by NSF grants MIP 9622740, CCR9810205, and an IBM Partnership Award. This work was done while Ravi Iyer was at Texas A M University. directory lookup, several message transfers over the interconnect and ....

....We model a release consistent system by assuming that all write requests are cache hits. Additionally, we model the switch directory interconnect. We use constant memory access latencies for cache miss service time. These parameters were derived from latencies measured on recent multiprocessors [3, 4]. The detailed trace driven simulation parameters are shown in Table 3. # ## ## ## ## ### ### ) 7 7 625 ) 866 73 73 SSOLFDWLRQV 1RUP##5HGXFWLRQ#RI # RPH1RGH# # V DVH ### ### #### #### Figure 8. Reduction in Home Node CtoC Transfers # ## ## ## ## ### ### ) 7 7 625 ....

C. Hristea, et al., "Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using MicroBenchmarks, " Proceedings of Supercomputing: High Performance Networking and Computing, 1997.


Comparing the Memory System Performance of the HP.. - Iyer, Amato.. (2000)   (Correct)

....in part by NSF CAREER Award CCR 9734471 and NCSA Grant ASC980006N. Supported in part by NSF Grant ACI 9872126 and DOE ASCI ASAP (Level 2 Program) Grant B347886. To appear in the Proceedings of the 13th ACM SIGARCH International Conference on Supercomputing, June 1999, Rhodes, Greece. work [1, 2, 6, 10] in this area mainly uses microbenchmarks and typically focuses on understanding the performance of a single multiprocessor. In this paper, we focus on an indepth understanding of the cache memory performance of two state of the art multiprocessors, the HP V Class and the SGI Origin 2000. Our ....

C. Hristea and D. Lenoski, "Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks,"Proceeding of Supercomputing: High Performance Networking and Computing, 1997.


Memory System Characterization of Commercial Workloads - Barroso, Gharachorloo.. (1998)   (103 citations)  (Correct)

....our system for dirty misses is 125 cycles (417 ns) as opposed to 80 cycles (267 ns) for memory. This trend is prevalent on other current multiprocessors with latencies of 742 ns (dirty) vs. 560 ns (clean) on the Sun Enterprise 10000 and of 1036 ns (dirty) vs. 472 ns (clean) on the SGI Origin 2000 [8]. Second, the fraction of dirty misses increases with both the size of Bcaches and with the number of CPUs. The fraction of OLTP DSS Q1 DSS Q4 DSS Q5 DSS Q6 DSS Q8 DSS Q13 AltaVista Icache (global) 19.9 9.7 8.5 4.6 5.9 3.7 6.7 1.8 Dcache (global) 42.5 6.9 22.9 11.9 11.3 11.0 12.4 ....

C. Hristea, D. Lenoski, and J. Keen. Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks. In Proceedings of Supercomputing '97, November 1997.


Effectiveness of Simple Memory Models for Performance.. - Irina Chihaia Thomas (2004)   (Correct)

No context found.

C. Hristea, D. Lenoski, and J. Keen, "Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks," in Proc. ICS, pp. 1--12, Nov 1997.


Performance Prediction for Random Write Reductions: A Case.. - Jin, Agrawal (2002)   (1 citation)  (Correct)

No context found.

Cristina Hristea, Daniel Lenoski, and John Keen. Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks. In Proceedings of SC 97, 1997.


Cross-Architecture Performance Predictions for Scientific.. - Marin, Mellor-Crummey (2004)   (Correct)

No context found.

C. Hristea, D. Lenoski, and J. Keen. Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks. In Proceedings of the 1997.


A Study of Implicit Data Distribution Methods for.. - Nikolopoulos..   (Correct)

No context found.

D. Lenoski C. Hristea and J. Keen. Measuring Memory Hierarchy Performance on Cache-Coherent Multiprocessors Using Microbenchmarks. In Proc. of the ACM/IEEE Supercomputing'97: High Performance Networking and Computing Conference (SC'97), San Jose, California, November 1997.


Comparing the Memory System Performance of the HP.. - Iyer, Amato.. (2000)   (Correct)

No context found.

C. Hristea and D. Lenoski, "Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks," Proceeding of the SuperComputing: High Networking and Computing," 1997.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC