| K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedings of the Seventh International Parallel Processing Symposium, Newport Beach, CA, April 1993. |
....NCC NUMA machines are only slightly harder to build, but they provide two important advantages for implementing software coherence: they permit very fast access to remote directory information, and they allow data to be moved in cache line size chunks. We also build on the work of Petersen and Li [26, 27], who developed an efficient software implementation of release consistency for small scale multiprocessors. The key observation of their work was that NCC NUMA machines allow the coherence block and the data transfer block to be of different sizes. Rather than copy an entire page in response to ....
....4 and compare our work to other approaches in section 5. We summarize our findings and conclude in section 6. 2 The Software Coherence Protocol In this section we present a scalable algorithm for software cache coherence. The algorithm was inspired by Karin Petersen s thesis work with Kai Li [26, 27]. Petersen s algorithm was designed for small scale multiprocessors with a single physical address space and non coherent caches, and has been shown to work well for several applications on such machines. Like most behavior driven software coherence schemes, Petersen s relies on address ....
[Article contains additional citation context not shown here]
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedings of the Seventh International Parallel Processing Symposium, Newport Beach, CA, April 1993.
.... and communication architecture) most of the e#orts so far have been in the lower two: relaxed consistency models and protocol implementations to reduce communication frequency and tra#c [4, 21, 47, 2] and additional hardware support in the communication architecture to reduce communication costs [39, 11, 35, 19, 18, 7, 30, 29]. With the relative maturity of protocols, in the last couple of years SVM research has moved to greater emphasis on the application layer and the synergies available across layers. New areas are being emphasized like application driven performance evaluation, application restructuring for SVM ....
....well they will scale. The Brazos system uses multicast support in a mostly update protocol, with protocol mechanisms for reducing the drawbacks of updates [42] 4. 2 Fine grain Remote Writes Several papers have suggested hardware support for fine grain remote operations in the network interface [11, 35, 30, 29]. Recent real implementations include the AURC home based protocol on the SHRIMP multicomputer, which uses the automatic update hardware mechanism [7] to snoop writes o# the memory bus and propagate them to the home of the page if it is remote. This eliminates di#s, but can generate more tra#c on ....
K. Petersen and K. Li. Cache coherence for shared memory multiprocessors based on virtual memory support. In Proceedings of the IEEE 7th International Parallel Processing Symposium, April 1993.
.... [GLL90] Home based LRC [IDFL96,ZIL96] ISL96a,JSS97] Kel96] SW LRC Adaptive LRC [KCZ92] LRC ERC Fine grain [SFL94] Protocols Applications Architectural Support [SB97] Home based ScC [ISL96c] Multicast based ScC [SZB94] DCZ96] Consistency Models Software dirty bits [CF89,PL93, Remote Operations IDFL96,KS96b] and restructuring understanding Application Compiler Support [BFRS94] protocols Application specific [SFL94,SGT96] ACDZ97,Kel98] Figure 2: Research in Shared Virtual Memory approach in the early 1990s and lead to eager release consistency (ERC) ....
....models which relax the consistency requirements beyond release consistency by limiting the data to which they apply. In the 1990s, software shared memory became a very active research area, with many groups designing new protocols, consistency models and systems [BH90, CBZ91, KCZ92, ZSLW92, BZ91, PL93, PL94, SFL 94, ZIL96, ISL96c, ACDZ97, SGT96, KHS 97, SB97] These research groups inspired and motivated one another, building on each other s ideas CHAPTER 1. INTRODUCTION 5 to push performance higher. During this time, the largest number of SVM publications occurred in the protocol ....
[Article contains additional citation context not shown here]
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedings of the IEEE 7th International Parallel Processing Symposium, April 1993.
....a synchronization release point. ParaNet (Treadmarks) 14] relaxes the Munin protocol further by postponing the posting of write notices until the subsequent acquire. Both Munin and ParaNet are designed to run on networks of workstations, with no hardware support for coherence. Petersen and Li [19] have presented a lazy release consistent protocol for small scale multiprocessors with caches but without cache coherence. Their approach posts notices eagerly, using a centralized list of weak pages, but only processes notices at synchronization acquire points. The protocol presented in this ....
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedings of the Seventh International Parallel Processing Symposium, Newport Beach, CA, April 1993.
....merge buffers. Software mediated cache fills and software doubling of writes cannot exploit the additional bandwidth; performing these operations in software increases their latency but leaves bandwidth requirements constant. 5 Related Work Our work is closely related to that of Petersen and Li [24, 25]; we both use the notion of weak pages, and purge caches on acquire operations. The main difference is scalability: we distribute the directory and weak list, distinguish between safe and unsafe pages, check the weak list only for unsafe pages mapped by the local processor (i.e. for those ....
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. 7th Intl. Parallel Processing Symp., Newport Beach, CA, Apr. 1993.
....and application suite in section 3 and present results in section 4. We compare our protocol to a variety of existing alternatives, including release consistent hardware, straightforward sequentially consistent software, and a coherence scheme for small scale NCC NUMAs due to Petersen and Li [26]. We show that certain simple program modifications can improve the performance of software coherence substantially. Specifically, we identify the need to mark reader writer locks, to avoid certain interactions between program synchronization and the coherence protocol, to align data structures ....
....In this section we present a protocol for software cache coherence on large scale NCC NUMA machines. As in most software coherence systems, we use virtual memory protection bits to enforce consistency at the granularity of pages. As in Munin [6] Treadmarks [18] and the work of Petersen and Li [26], we allow more than one processor to write a page concurrently, and we use a variant of release consistency [23] to limit coherence operations to synchronization points. Between these points, processors can continue to use stale data in their caches. As in the work of Petersen and Li, we ....
[Article contains additional citation context not shown here]
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedings of the Seventh International Parallel Processing Symposium, Newport Beach, CA, April 1993.
....a new consistent copy of the page. This process usually involves a word by word comparison of the modified copies of the page with respect to an unmodified, shadow copy. The existence of a shared physical address space on NUMA machines permits several optimizations in software coherence systems [64]. Simple directory operations can be performed via (uncached) remote reference, rather than by sending a message. Pages can also be mapped remotely, so that cache fills bring lines into the local node on demand, eliminating the need for a copy in local memory. If the caches use write through, or ....
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedings of the Seventh International Parallel Processing Symposium, Newport Beach, CA, April 1993.
....in our simulations. The CC NUMA machine uses the directory based write invalidate coherence protocol of the Stanford DASH machine [13] This protocol employs an eager implementation of release consistency. Our software coherent NUMA machine uses a scalable extension of the work of Petersen and Li [15], with additional ideas from the work of Keleher et al. 8] It employs a lazy implementation of release consistency, in which invalidation messages are sent only at synchronization release points, and processed (locally) only at synchronization acquire points. At an acquire, a processor is ....
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedings of the Seventh International Parallel Processing Symposium, Newport Beach, CA, April 1993.
....section we present a protocol for software cache coherence on large scale NCC NUMA machines. As in most software coherence systems, we use virtual memory protection bits to enforce consistency at the granularity of pages. As in Munin [16, 8, 7] Treadmarks [26] and the work of Petersen and Li [37, 36, 38], we allow more than one processor to write a page concurrently, and we use a variant of release consistency to limit coherence operations to synchronization points. Between these points, processors can continue to use stale data in their caches. 1 As in the work of Petersen and Li, we exploit ....
.... rel.centr.del rel.centr.ndl seq Figure 2: Comparative performance of different software protocols on 64 processors gauss sor water mp3d appbt fft Overhead on 64 processors ipc interrupts lock wait coherence cache Figure 3: Overhead analysis of different software protocols on 64 processors Li [36, 37], with the exception that while the weak list is conceptually centralized, its entries are distributed physically among the nodes of the machine. rel.centr.del: Same as rel.distr.del, except that write notices are propagated by inserting weak pages in a global list which is traversed on acquires. ....
[Article contains additional citation context not shown here]
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedings of the Seventh International Parallel Processing Symposium, Newport Beach, CA, April 1993.
....cache less sharedmemory multiprocessors [5, 10, 22] Lazy, multi writer protocols were pioneered by Keleher et al. 19] and later adopted by several other groups. Several of the ideas in Cashmere were based on Petersen s coherence algorithms for small scale, non hardware coherent multiprocessors [30]. Recent work by the Alewife group at MIT has addressed the implementation of software coherence on a collection of hardware coherent nodes [39] Wisconsin s Blizzard system [34] maintains coherence for cache line size blocks, either in software or by using ECC. It runs on the Thinking Machines ....
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedingsof the Seventh International Parallel Processing Symposium, Newport Beach, CA, April 1993.
....across processors only at synchronization points. Compilers do this as a matter of course. Hardware implementations of relaxed consistency [18, 30] typically initiate coherence operations as soon as possible, but only wait for them to complete when synchronizing. Software implementations [8, 15, 27] are often more aggressive, delaying the initiation of the operations as well. Among other things, the delay serves to mitigate the effects of false sharing. It also supports programs that can correctly utilize stale data, allows messages to be batched, and ....
K. Petersen and K. Li, "Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support," Proceedings of the Seventh International Parallel Processing Symposium, 13-16 April 1993.
....in all cases. We are currently investigating hybrids; we believe we can choose the strategy that works best on a page by page basis, dynamically. Our work borrows ideas from several other systems, including Munin [2] TreadMarks ParaNet [7] Platinum [3] and the thesis work of Karin Petersen [15, 16]. It is also related to ongoing work on the Wind Tunnel [19] and the Princeton Shrimp [1] project and, less directly, to several other DSM and multiprocessor projects. Full protocol details and comparisons to related work can be found in other papers [8, 9, 10] 2 Results and Project Status We ....
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedings of the Seventh International Parallel Processing Symposium, Newport Beach, CA, April 1993.
....the amount of interprocessor communication required to signal the transition of a coherence block to an inconsistent state. We have seen reductions in running time of well over 50 for our protocol when compared with similar software coherence protocols that do not perform these optimizations [11]. A more detailed comparison of software coherence protocols can be found in [8] We have found the choice of cache architecture to also have a significant impact on the performance of software cache coherence. Write back caches provide the best performance for almost all cases, however they ....
....to eliminate coherence overhead. The flexibility of software coherence provides designers with the ability to incorporate all mechanisms in a single protocol and choose the one that best fits the sharing pattern at hand. 5 Related work Our work is most closely related to that of Petersen and Li [11]: we both use the notion of weak pages, and purge caches on acquire operations. The difference is scalability: we distribute the coherent map, distinguish between safe and unsafe pages, check the weak bits in the coherent map only for unsafe pages mapped by the current processor, and multicast ....
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedings of the Seventh International Parallel Processing Symposium, Newport Beach, CA, April 1993.
....not required to multicast write notices, but acquiring processors must actively query directory information to see if invalidations are necessary. The protocol borrows ideas from several other systems, including Munin [2] Treadmarks ParaNet [6] Platinum [3] and the thesis work of Karin Petersen [12, 13]. Full details can be found in other papers [7, 8] We have evaluated the performance of our protocol via detailed execution driven simulations. Sample results appear in figure 1. This graph compares the performance of our protocol to that of the Stanford Dash protocol [10] on simulated ....
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedings of the Seventh International Parallel Processing Symposium, Newport Beach, CA, April 1993.
....transparent shared memory with locks in software. Their system, Ivy, uses a directory based scheme to manage coherence on the page level. Later systems, such as Munin [10] seek to reduce communication by using relaxed consistency models, such as release consistency [18] Petersen and Li [33] and Konthothanassis and Scott [25] implement release consistency using the operating system s virtual memory mechanisms. Most papers that use the term software coherence have referred to the insertion of invalidation instructions by a compiler. Darnell and Kennedy [15] Cheong and Veidenbaum ....
K. Petersen and K. Li. Cache coherence for shared memory multiprocessors based on virtual memory support. In Intl. Parallel Processing Symp., April 1993.
....following release. Instead, it requires performing the access only at the next acquire to the same location as a following release, and only with respect to the acquiring processor. Petersen and Li have proposed alternative implementations of lazy release consistency using virtual memory support [PeL92a, PeL92b]. Bershad et al. BeZ91] have proposed a relaxation of release consistency for the Midway software based shared virtual memory system [BeZ91] This model, called entry consistency, requires programmers to associate each data operation with a lock variable that should protect the operation. Like ....
....with weak ordering. Petersen and Li use trace based simulation to study sequential consistency and lazy release consistency implemented using virtual memory hardware support and compare it with a snooping protocol on a bus based system (they also give a few results for a crossbar system) [PeL92a, PeL92b]. They show that the lazy release consistency scheme is competitive with the snooping scheme in terms of performance, and recommend it because of its implementation simplicity and flexibility. Lee and Ramachandran use a probabilistic workload to compare buffer consistency with sequential ....
K. PETERSEN and K. LI, Cache Coherence for Shared Memory Multiprocessors Based On Virtual Memory Support, Technical Report Tech. Rep.-400-92, Princeton University, December 1992.
....or protocol data structures, nor do we place shared memory in Memory Channel space. Cashmere [11] is a software coherence system expressly designed for memory mapped network interfaces. It was inspired by Petersen s work on coherence for small scale, non hardware coherent multiprocessors [17]. Cashmere maintains coherence information using a distributed directory data structure. For each shared page in the system, a single directory entry indicates one of three possible page states: uncached, read shared, or write shared. At a release operation a processor consults the directory ....
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proc. of the 7th Intl. Parallel Processing Symp., Apr. 1993.
....cache less shared memory multiprocessors [4, 19] Lazy, multi writer protocols were pioneered by Keleher et al. 16] and later adopted by several other groups. Several of the ideas in Cashmere were based on Petersen s coherence algorithms for small scale, non hardware coherentmultiprocessors [26]. Recent work by the Alewife group at MIT [33] and the FLASH group at Stanford [10] has addressed the implementation of software coherence on a collection of hardwarecoherent nodes. Wisconsin s Blizzard system [29] maintains coherence for cache line size blocks,either in software or by using ECC. ....
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proc. of the 7th Intl. Parallel Processing Symp., Apr. 1993.
....bandwidth requirements constant. As in the latency graphs we have omitted results for the remaining applications to avoid cluttering the figures. The applications shown here are representative of the whole application suite. 5 Related Work Our work is closely related to that of Petersen and Li [24, 25]; we both use the notion of weak pages, and purge caches on acquire operations. The main difference is scalability: we distribute the coherent map and weak list, distinguish between safe and unsafe pages (those that are unlikely or likely, respectively, to be weak [17] check the weak list ....
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedings of the Seventh International Parallel Processing Symposium, Newport Beach, CA, April 1993.
....a synchronization release point. ParaNet (Treadmarks) 13] relaxes the Munin protocol further by postponing the posting of write notices until the subsequent acquire. Both Munin and ParaNet are designed to run on networks of workstations, with no hardware support for coherence. Petersen and Li [19] have presented a lazy release consistent protocol for small scale multiprocessors with caches but without cache coherence. Their approach posts notices eagerly, using a centralized list of weak page,s but only processes notices at synchronization acquire points. The protocol presented in this ....
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedings of the Seventh International Parallel Processing Symposium, Newport Beach, CA, April 1993.
....Coherence Protocol In this section we present a scalable protocol for software cache coherence. As in most software coherence systems, we use virtual memory protection bits to enforce consistency at the granularity of pages. As in Munin [26] Treadmarks [14] and the work of Petersen and Li [20, 21], we allow more than one processor to write a page concurrently, and we use a variant of release consistency to limit coherence operations to synchronization points. Between these points, processors can continue to use stale data in their caches. As in the work of Petersen and Li, we exploit the ....
....notice. Use of a distributed coherent map and per processor weak lists enhances scalability by minimizing memory contention and by avoiding the need for processors at acquire points to scan weak list entries in which they have no interest (something that would happen with a centralized weak list [20]. However it may make the transition to the weak state very expensive, since a potentially large number of remote memory operations may have to be performed (serially) in order to notify all sharing processors. Ideally, we would like to maintain the low acquire overhead of perprocessor weak lists ....
[Article contains additional citation context not shown here]
K. Petersen and K. Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedings of the Seventh International Parallel Processing Symposium, Newport Beach, CA, April 1993.
.... a software only shared memory system significantly [14, 7] and that on distributed memory architectures with remote memory reference capability, the performance of software cache coherence maintained at the virtual memory page level is competitive with that of hardware cache coherence schemes [23, 24, 15]. These results suggest that appropriate architectural support may improve the performance of shared virtual memory substantially. The SHRIMP project at Princeton studies how to provide high performance communication mechanisms to integrate unmodified, commodity desktops such as PCs and ....
....Ph.D. thesis in 1986 [17, 19] It was first implemented on a network of workstations [18] and then applied to large scale multicomputer systems [20, 19, 21, 29] Relaxed consistency models [9] allow shared virtual memory to reduce the cost of false sharing. Recent research in this area includes [5, 6, 14, 1, 23, 24, 14, 7]. New coherency protocols improving the performance of release consistency include lazy release consistency [14] and entry consistency [1] in which all shared data is explicitly associated with some synchronization variable. The TreadMarks library [13] is an example of stateof the art ....
K. Petersen and K. Li. Cache coherence for shared memory multiprocessors based on virtual memory support. In Proceedings of the IEEE 7th International Parallel Processing Symposium, April 1993.
....time for the snoopy scheme represents 100 , and the execution time of all other schemes is normalized to it. Execution time is measured as the number of cycles required for all processors to simulate the whole trace. A detailed analysis of the results presented in this section can be found in [PL93] and [Pet93] For brevity we will summarize the most important points of the comparison among the coherency schemes when implemented on the bus based architecture, labeled Bus in Figure 4: 1. Although the most economical and the simplest method of cache coherence is implemented by not Bus Snoop ....
Karin Petersen and Kai Li. Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support. In Proceedings of the 7th International Parallel Processing Symposium, April 1993.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC