16 citations found. Retrieving documents...
J. Chapin, et. al., "Memory system performance of Unix on CC-NUMA multiprocessors", Proceedings of ACM SIGMETRICS Conference on Measuring and Modeling of Computer Systems, May 1995, pp. 1-13.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Clustered Objects: Initial Design, Implementation and Evaluation - Appavoo   (Correct)

.... system accounted for as much as 32 47 of the non idle execution time[36] Similarly Xia and Torrellas showed that for a di#erent set of workloads, 42 54 of time was spent in the operating system [43] while Chapin et al. found that 24 of total execution time was spent in the operating system[6] for their workload. To avoid the operating system from limiting application performance, it must be highly concurrent. The traditional approach to developing SMP operating systems has been to start with a uniprocessor operating system and to then successively tune it for concurrency. This is ....

....al. 35] The traditional approach of adding locks and selectively redesigning also does not explicitly lead to increased locality. Chapin et al. studied the memory system performance of a commercial Unix system, parallelized to run e#ciently on the 64 processor large Stanford DASH multiprocessor[6]. They found that the time spent servicing operating system data misses was three times higher than time spent executing operating system code. Of the time spent servicing operating system data misses, 92 was due to remote misses. Kaeli et al. showed that careful tuning of their operating system ....

J. Chapin, S. A. Herrod, M. Rosenblum, and A. Gupta. Memory system performance of UNIX on CC-NUMA multiprocessors. In Proc. of the 1995.


revEELing Solaris - Remzi Arpaci Manuel (1996)   (1 citation)  (Correct)

....goal is to collect complete instruction and data traces of large applications. The particular problem addressed by this project is to include the operating system s activity in the trace. Operating system traces can be obtained in several ways. Past efforts have utilized hardware monitors [CHRG95], complete simulation (SimOS [WR96] and binary instrumentation [BKLW90, CB93] Binary instrumentation is a low cost, medium effort approach, which has the advantage of not requiring particular monitoring hardware. Furthermore, we can build on top of existing binary instrumentation tools, leading ....

John Chapin, Stephen A. Harrod, Mendel Rosenblum, and Anoop Gupta. Memory System Performance of UNIX on CC-NUMA Multiprocessors. In Proceedings of the Joint International Conference on Measurement & Modeling of Computer Systems (Sigmetrics '95/Performance'95), pages 1--13, May 1995.


Studies of Windows NT Performance using Dynamic Execution Traces - Perl, Sites (1996)   (65 citations)  (Correct)

....similar to other inline tracing efforts, but differs significantly in at least one dimension. Most published studies are of user code only [EKKL90, LB94] or are done on a single processor [BKW90, CB93] or require rebuilding source code [SJF92] or trace only cache misses, not all instructions [CHRG95, TGH92] None use Windows NT. The excellent Shade paper [CK94] summarizes about thirty previous tools. Using that paper s classification, PatchWrx, like ATUM, traces executables, user and system code, multiple domains, multiple processors, signals, dynamic linking, and bugs, with performance ....

....and is therefore invisible to the operating system. The log buffer holds about 5.9 million eight byte log entries, which is enough for 5 20 seconds of real time. There is so much information in a single reconstructed trace that we have not been motivated to try stitching multiple traces together [CHRG95, AH90] a single reconstructed trace contains about 650 MB of dynamic i stream with instruction and data addresses. Recording the log in main memory is much faster than recording on disk or tape. Recording in physical memory instead of virtual memory allows us to trace the lowest levels of the ....

John Chapin, Stephen A. Herrod, Mendel Rosenblum, and Anoop Gupta. Memory system performance of UNIX on CCNUMA multiprocessors. In ACM SIGMETRICS, pages 1--13, May 1995. 14


Tornado: Maximizing Locality and Concurrency in a.. - Gamsa, Krieger.. (1999)   (12 citations)  (Correct)

....when other resources are shared, and study the performance of our system for real applications. 8 Related work A number of papers have been published on performance issues in shared memory multiprocessor operating systems, mostly in the context of resolving specific problems in a specific system [5, 6, 9, 22, 26, 28] . These systems were mostly uniprocessor or small scale multiprocessor systems trying to scale up to larger systems. Other workon locality issues in operating system structure were mostly either done in the context of earlier non cachecoherent NUMA systems [8] or, as in the case of Plan 9, were ....

J. Chapin, S. A. Herrod, M. Rosenblum, and A. Gupta. Memory system performance of UNIX on CC-NUMA multiprocessors. In Proc. ACM SIGMETRICS Intl. Conf. on Measurement and Modelling of Computer Systems, 1995.


Studies of Windows NT Performance using Dynamic Execution Traces - Perl, Sites (1997)   (65 citations)  (Correct)

....similar to other inline tracing efforts, but differs significantly in at least one dimension. Most published studies are of user code only [EKKL90, LB94] or are done on a single processor [BKW90, CB93] or require rebuilding source code [SJF92] or trace only cache misses, not all instructions [CHRG95, TGH92] None use Windows NT. The excellent Shade paper [CK94] summarizes about thirty previous tools. Using that paper s classification, PatchWrx, like ATUM, traces executables, user and system code, multiple domains, multiple processors, signals, dynamic linking, and bugs, with performance ....

....and is therefore invisible to the operating system. The log buffer holds about 5.9 million eightbyte log entries, which is enough for 5 20 seconds of real time. There is so much information in a single reconstructed trace that we have not been motivated to try stitching multiple traces together [CHRG95, AH90] a single reconstructed trace contains about 650 MB of dynamic i stream with instruction and data addresses. Recording the log in main memory is much faster than recording on disk or tape. Recording in physical memory instead of virtual memory allows us to trace the lowest levels of the ....

John Chapin, Stephen A. Herrod, Mendel Rosenblum, and Anoop Gupta. Memory system performance of UNIX on CC-NUMA multiprocessors. In ACM SIGMETRICS, pages 1--13, May 1995.


Tornado: Maximizing Locality and Concurrency in a.. - Gamsa, Krieger.. (1999)   (12 citations)  (Correct)

....when other resources are shared, and study the performance of our system for real applications. 8 Related work A number of papers have been published on performance issues in shared memory multiprocessor operating systems, mostly in the context of resolving specific problems in a specific system [5, 6, 8, 21, 24, 26] . These systems were mostly uniprocessor or small scale multiprocessor Unix systems trying to scale up to larger systems. Two 1 2 4 8 12 16 Processors 1 10 Slow Down a) sgi convex ibm sun numa 1 2 4 8 12 16 Processors 1 10 b) 1 2 4 8 12 16 Processors 1 10 c) 1 2 4 8 12 16 Processors 5 1 Slow ....

J. Chapin, S. A. Herrod, M. Rosenblum, and A. Gupta. Memory system performance of UNIX on CC-NUMA multiprocessors. In Proc. ACM SIGMETRICS Intl. Conf. on Measurement and Modelling of Computer Systems, 1995.


Searching for the Sorting Record: Experiences in.. - Arpaci-Dusseau.. (1998)   (2 citations)  (Correct)

....Though most modern processors have a reasonable set of performance counters [12, 22] that have been shown to be useful for detailed performance profiling [27] other components of the machine are ignored. For example, researchers have shown that network packet counters can be extremely useful [8, 20]. However, just monitoring in coming and out going packets is not enough. Minimally, 32 bit counters should be available for every interconnection in the system, from the memory and I O bus of each workstation, out into the switches of the network. These counters should track the number of bytes ....

J. Chapin, S. Herrod, M. Rosenblum, and A. Gupta. Memory System Performance of UNIX on CC-NUMA Multiprocessors. In 1995 ACM SIGMETRICS/Performance Conference, pages 1-- 13, May 1995.


Improving the Data Cache Performance of Multiprocessor Operating .. - Chun Xia (1996)   (3 citations)  (Correct)

....loads, Torrellas et al. [19] reported that the operating system is responsible for a large fraction of the data cache misses. In addition, they showed that the dominant sources of data misses in the operating system are coherence activity and block operations. Similarly, Chapin et al. [8] have recently reported similar findings for a NUMA multiprocessor running UNIX. While all this past work has successfully characterized the problem, very little work has been done toward eliminating it [8, 19] In this paper, we focus exclusively on eliminating most of the data misses in a ....

....in the operating system are coherence activity and block operations. Similarly, Chapin et al. [8] have recently reported similar findings for a NUMA multiprocessor running UNIX. While all this past work has successfully characterized the problem, very little work has been done toward eliminating it [8, 19]. In this paper, we focus exclusively on eliminating most of the data misses in a multiprocessor operating system. Any changes that we propose, however, should be compatible with the use of off the shelf processors. We use a performance monitor to examine traces of a 4 processor shared memory ....

[Article contains additional citation context not shown here]

J. Chapin, S. A. Herrod, M. Rosenblum, and A. Gupta. Memory System Performance of UNIX on CC-NUMA Multiprocessors. In ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 1--13, May 1995.


STiNG: A CC-NUMA Computer System for the Commercial Marketplace - Lovett (1996)   (76 citations)  (Correct)

....scalability limitations. When one considers systems running modified versions of UNIX, it is apparent that NUMA aware page allocation policies, including page replication and migration, can provide in software much of the benefit of the hardware page replication and migration that COMA provides [9]. Furthermore, it should be noted that the advantage of COMA is perceived to be a large reduction in capacity misses at each node when compared to a directory based CC NUMA system. However, when sufficient cache size is provided in combination with OS restructuring, capacity misses are virtually ....

....combination with OS restructuring, capacity misses are virtually eliminated in the CC NUMA system, and node to node traffic becomes dominated by communication (or coherency) misses. COMA systems offer no advantage over directory based CC NUMA systems for handling communication misses, as noted in [9] and [10] This is discussed further in Sec. 4 below. 2.1.4 CC NUMA CC NUMA systems, e.g. DASH[11] FLASH[12] and Alewife [13] provide a viable method of scaling using small single processor or SMP building blocks without the drawbacks of the other options. The familiar shared memory ....

J. Chapin, S. A. Herrod, M. Rosenblum, and A. Gupta. Memory system performance of UNIX on CC-NUMA multiprocessors. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, pages 1-13, May 1995.


Exploiting Multiprocessor Memory Hierarchies For Operating Systems - Xia (1996)   (1 citation)  (Correct)

....that OS causes a large fraction of the cache misses in a bus based shared memory multiprocessor. They pointed out that the OS code suffers considerable self interference in the instruction cache and the dominant sources of data misses in OS are coherence activity and block operations. Chapin et al. [10] have recently reported similar findings for a NUMA multiprocessor running UNIX. They showed that a surprisingly large fraction of OS time (79 ) is spent on memory system stalls that is divided equally between instruction and data cache miss time. For data cache misses, they found that a small ....

....architectures are utilized. While all this past work has successfully characterized the problem, very little work has been done towards eliminating it. In discussions on support for block operations, for example, while Torrellas et al. [41] suggested cache bypassing and prefetching, Chapin et al. [10] suggested cache bypassing and some OS policies to reduce the remote caching of data, and Cheriton et al. [14] proposed the deferred copy scheme, none of them actually evaluated their proposed schemes. 1.2.2 Comparison of OS and Application Cache Performance Some researchers have studied and ....

[Article contains additional citation context not shown here]

J. Chapin, S. A. Herrod, M. Rosenblum, and A. Gupta. Memory System Performance of UNIX on CC-NUMA Multiprocessors. In ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 1--13, May 1995.


Performance Issues for Multiprocessor Operating Systems - Gamsa, Krieger, Parsons.. (1995)   (Correct)

....between the two processors. This problem is known as false sharing and can contribute significantly to the cache miss rate. Chapin et al. in investigating the performance of IRIX ported to a 32 processor experimental system, found that many of the worst case hot spots were caused by false sharing [7]. Although strategies for dealing with misses in uniprocessors by maximizing temporal and spatial locality and by reducing conflicts are well known, if not always easily applied, techniques for reducing true and false sharing misses in multiprocessors are less well understood. Semi automatic ....

....of a cache line, further increasing both latency and network load. This effect is clearly illustrated in the operating system investigation by Chapin et al. which found that local data miss costs were twice the local instruction cache miss costs, due to the need to send multiple remote messages [7]. In the case of large systems, the physical distribution of memory has a considerable effect on performance. Such systems are generally referred to as NUMA systems, or Non Uniform Memory Access time systems, since the time to access memory varies with the distance between the processor and the ....

[Article contains additional citation context not shown here]

John Chapin, Stephen A. Herrod, Mendel Rosenblum, and Anoop Gupta. Memory system performance of UNIX on CC-NUMA multiprocessors. In Proceedings of the 1995 ACM SIGMETRICS Joint International Conferentce on Measurement and Modelling of Computer Systems, May 1995.


Comprehensive Hardware and Software Support for Operating.. - Xia, Torrellas (1999)   (4 citations)  (Correct)

....be provided by the memory hierarchy of these machines Unfortunately, previous work in the literature does not target this question directly or completely. A large group of researchers have examined the cache performance of the operating system without focusing much on proposing optimizations [1, 2, 5, 6, 9, 10, 11, 12]. There is some work specifically focused on optimizing the performance of the operating system [13, 15, 16] However, it examines part of the problem only, for example, instruction accesses only or prefetching only. The combined effect of all the optimizations proposed is unknown, especially for ....

J. Chapin, S. A. Herrod, M. Rosenblum, and A. Gupta. Memory System Performance of UNIX on CC-NUMA Multiprocessors. In ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 1--13, May 1995.


Fast Messages (FM): Efficient, Portable Communication.. - Pakin, Karamcheti, Chien (1997)   (8 citations)  (Correct)

....implementation by discarding packets. When networks were unreliable, this practice made sense, but modern networks are highly reliable, so such discarding is the major source of data and therefore, performance loss. Experience with messaging layers in multicomputers [39] shared memory systems [11], and high speed wide area networks indicate that cache interference is a critical effect for both communication 2 FM send( and FM send 4( call FM extract( only when necessary to avoid buffer deadlock. and local computational performance. Providing control over the scheduling of ....

John Chapin, Stephen A. Herrod, Mendel Rosenblum, and Anoop Gupta. Memory system performance of UNIX on CC-NUMA multiprocessors. In Proceedings of SIGMETRICS/PERFORMANCE, May 1995. Available from http://www-flash.stanford.edu/OS/papers/SIGMETRICS95/numa-os.ps.Z.


OS Support for Improving Data Locality on CC-NUMA.. - Verghese, Devine.. (1996)   (14 citations)  Self-citation (Rosenblum Gupta)   (Correct)

No context found.

J. Chapin, S. A. Herrod, M. Rosenblum, and A. Gupta. Memory System Performance of UNIX on CC-NUMA Multiprocessors. In ACM SIGMETRICS `95, pages 1-13, May 1995.


Coherent Block Data Transfer in the FLASH Multiprocessor - Heinlein, Bosch, Jr.. (1997)   (3 citations)  Self-citation (Rosenblum Gupta)   (Correct)

No context found.

John Chapin, Stephen A. Herrod, Mendel Rosenblum, and Anoop Gupta. Memory System Performance of UNIX on CC-NUMA Multiprocessors. In SIGMETRICS/ PERFORMANCE `95, May 1995.


Proceedings of the 21st International Conference on.. - Dynamic Load Sharing   (Correct)

No context found.

J. Chapin, et. al., "Memory system performance of Unix on CC-NUMA multiprocessors", Proceedings of ACM SIGMETRICS Conference on Measuring and Modeling of Computer Systems, May 1995, pp. 1-13.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC