| Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. R. Toonen, I. Schoinas, M. D. Hill, and D. A. Wood, "Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation," in Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, (Las Vegas), pp. 193 -- 205, June 1997. |
....of the paper to change substantially. We are also including a fine grained protocol with delayed consistency or single writer, eager release consistency rather than sequential consistency (SC) in the simulator, and we may present results for that protocol as well in the final version. Results in [31] show that the results for such protocols are a little better than SC for most granularities, since they alleviate the effects of read write false sharing, but not by a very large amount. requires some programmer intervention) The granularities used are 64 bytes in all other cases than the ....
....especially in applications that use locks frequently like Barnes Hut and to a lesser extent Water Spatial and Volrend and those that interact very poorly with coarse granularity like Radix. This is not surprising given the earlier similar results for this comparison on the Typhoon zero platform [31]. The simulated platform has somewhat more bandwidth relative to processor speed than Typhoon zero, helping HLRC, but much more efficient access control, helping SC. A rough comparison that includes instrumentation cost can be obtained by adding in the instrumentation costs from the literature ....
[Article contains additional citation context not shown here]
Y. Zhou, L. Iftode, J. Singh, K. Li, B. Toonen, I. Schoinas, M. Hill, and D. Wood. Relaxed consistency and coherence granularity in dsm systems: A performance evaluation. In Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, June 1997.
.... to handle asynchronous incoming messages, which would otherwise cause expensive and frequent interrupts (interrupts are much less frequent in the coarser 1 Protocols using more complex, delayed consistency or single writer eager release consistency were found to perform only a little better in [90]. CHAPTER 3. PERFORMANCE PORTABILITY TO CLUSTERS 60 grained SVM, so the tradeo#s between interrupts and polling are less clear there) Since we do not have access to high performance instrumentation for the x86 instruction set, and since it is unclear what cost to ascribe to hardware access ....
Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. Toonen, I. Schoinas, M. Hill, and D. Wood. Relaxed consistency and coherence granularity in DSM systems: A performance evaluation. In Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, June 1997.
....the applications or (in SMP systems) by reserving one processor for protocol processing. Recent results for interrupts versus polling in SVM systems vary. One study finds that polling may add a significant overhead, leading to inferior performance than interrupts for page grain SVM systems [101]. On the other hand, Stets et al. find that polling gives generally better results than interrupts [91] We believe more research is needed on modern systems to understand the role of polling. Another interesting direction that we are exploring is moving some of the protocol processing itself to ....
....Cashmere 2L [54] where the main processor polls the network queues for incoming messages on program back edges. This requires that the code be instrumented to add polling instructions, and a#ects the time when messages are handled. A similar polling method was used and compared with interrupts in [101]. The tradeo#s between interrupts and polling are not very clear (the performance di#erences are found to be small for the platform used in [101] Both methods use the main processor at the destination and both have disadvantages; interrupts rely on the operating system which makes performance ....
[Article contains additional citation context not shown here]
Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. Toonen, I. Schoinas, M. Hill, and D. Wood. Relaxed consistency and coherence granularity in DSM systems: A performance evaluation. In Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, June 1997.
....compute processors poll the network queues for incoming messages via code instrumentation on program back edges. They find that polling and interrupt based asynchronous protocol processing performs comparably on their system. A similar polling method was also used and compared with interrupts in [52], with similar results. Our approach avoids the need for interrupts or polling altogether. 7 Conclusions We have used network interface support to decouple asynchronous message handling from protocol processing and to thereby eliminate the need for expensive interrupts or polling in SVM ....
Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. Toonen, I. Schoinas, M. Hill, and D. Wood. Relaxed consistency and coherence granularity in DSM systems: A performance evaluation. In Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, June 1997.
....implementation has been shown to be a promising and desirable paradigm for exploiting parallel execution. We adopt this paradigm for the bulky synchronous structure and OLTP workloads. The shared memory is supported by hardware in SMPs. For clusters of workstations or clusters of SMPs, some work [18, 21, 29] has been done on the emulation of sharedmemory. We assume there is a software layer for programmers, which emulates the shared memory in the cluster. Our execution model of cluster computing is mainly based on the probabilities of references to different levels of the memory hierarchy in Figure ....
Y. Zhou, et al., "Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation", Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, June 1997.
....of release consistency which allow multiple writers and postpone communication until synchronization points. The performance tradeoffs between the use of such page based release consistent systems and hybrid systems with support for fine grain sharing have been compared by Zhou et al. [86]. They find mixed results, but several aspects, such as operating system overhead, memory consumption, and support for multiple coherence granularity are ignored. 141 To alleviate problems with page level sharing in software DSMs, some systems abandon the use of the virtual memory support for ....
Y. Zhou, L. Iftode, K. Li, J.P. Singh, B.R. Toonen, I. Schoinas, M.D. Hill, and D.A. Wood. Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation. In Proc. of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'97), pages 193--205, June 1997.
....factor than we expected in a few cases. This effect is partly due to continued improvements in the Alpha compiler that lead to more efficient code, thus increasing the relative overhead of instrumentation code in some cases. The only relevant study that we are aware of is by Zhou et al. [23], which also examines performance tradeoffs between fine and coarse grain software coherence. However, several critical differences between the studies lead to differing performance results and a number of novel observations in our work. Section 5 contains a detailed comparison of the two ....
....of the performance of the various applications on Shasta and Cashmere. To better understand the reasons for performance differences, we discuss the applications in groups based on their spatial data access granularity and temporal synchronization granularity (similar to notions used by Zhou et al. [23]) Applications with coarse grain data access tend to work on contiguous regions at a time, while fine grain applications are likely to do scattered reads or writes. The temporal synchronization granularity is related to the frequency of synchronization in an application on a given platform. An ....
[Article contains additional citation context not shown here]
Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. R. Toonen, I. Schoinas, M. D. Hill, and D. A. Wood. Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, June 1997.
....we introduce the Spark98 kernels, a set of 10 SMVP kernels for shared memory and message passing systems. 1 We expect that builders of cache coherent shared memory multiprocessors [15, 6] distributed memory systems [22] message passing libraries [18] and distributed shared memory libraries [1, 25] will find that the Spark98 kernels are useful tools for understanding the performance of irregular codes on their systems. In our own experience with the Spark98 kernels we notice that efficient parallel programming of sparse codes requires careful partitioning of data references, regardless of ....
ZHOU, Y., IFTODE, L., SINGH, J., LI, K., TOONEN, B., SCHOINAS, I., HILL, M., AND WOOD, D. Relaxed consistency and coherence granularity in DSM systems: A performance evaluation. In Proceedings of the SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) (Las Vegas, NV, June 1997), ACM, pp. 193--205.
....DSM systems only. In our paper, we consider both hardware and software DSM systems together, and study the relationship between coherence protocol and memory consistency model. Recently, Zhou et.al discussed the relationship between relaxed consistency model and coherence granularity in DSM systems[44]. They only consider the granularity of coherence protocol, while never consider the coherence protocol. Dubios et.al in [18] proposed delay consistency model for a release consistent system where an invalidation is buffered at the receiving processor until a subsequent acquire is executed by the ....
Y.Zhou, L.Iftode, K.Li, J.P.Singh, B.R.Toonen, I.Schoinas, M.D.Hill and D.A Wood. Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation. To appear Proceedings of the 6th Symposium on Principles and Practice of Parallel Programming, June 1997.
....that the adjustable cache block size implementation did better than the best fixed size implementations for most of the programs in their suite. In an earlier study [10] Goodman also evaluated the effect of the size of the consistency units on the behavior of a virtual address cache. Zhou et al. [20] discuss the relationship between relaxed consistency and coherence granularity in DSM systems. They conclude that sequential consistency with small consistency units and lazy release consistency with larger consistency units perform comparably for the applications used in the study. They only ....
Y. Zhou, L. Iftode, K. Li, J.P. Singh, B.R. Toonen, I. Schoinas, M.D. Hill, and D.A. Wood. Relaxed consistency and coherence granularity in DSM systems: A performance evaluation. In Proceedings of the 6th Symposium on the Principles and Practice of Parallel Programming, June 1997. To appear.
....then apply heuristics based on max cut to insert barrier synchronization and satisfy dependence. Stoher andO Boyle [25] extend this work, presenting an optimal algorithm for eliminating barrier synchronization in perfectly nested loops. There has been a large amount of research on software DSMs [1, 6, 19, 24, 30]. More recently, groups have examinedcombining compilers and software DSMs. Viswanathan and Larus developed a two part predictive protocol for iterative computations for use in the data parallel language C [29] Chandra and Larus evaluated combining the PGI HPF compiler and the Tempest software ....
Y. Zhou, L. Iftode, J. Singh, K. Li, B. Toonen, I. Schoinas, M. Hill, and D. Wood. Relaxed consistency and coherence granularity ind DSM systems: A performance evaluation. In Proceedings of the Sixth ACM SIGPLAN Symposiumon Principles and Practice of Parallel Programming, Las Vegas, NV, June 1997.
....mainly of a simple, clean protocol that handles page faults. Prologue The main disadvantage of page based dsm systems . is that it restricts the coherence granularity to be a virtual memory page size. For systems with large page sizes, false sharing and fragmentation will occur. [28]; The unit of sharing in a page based dsm is a virtual memory page. The larger coherence granularities used by dsms cause them to suffer increased coherence traffic because of false sharing. 14] the conventional wisdom remains that the overhead of false sharing, as well as ....
....operating system and the communication network, beyond that which is required by the applications. It was recently shown that reducing the granularity in systems which implement strict consistency may achieve performance comparable to that of systems implementing relaxed consistency memory models [22, 28]. In accordance with these findings, Sequential Consistency was employed in millipage: initial performance evaluation shows results comparable or superior to those obtained in systems which employ relaxed consistency models. When false sharing is efficiently avoided using the MultiView method, the ....
Y. Zhou, L. Iftode, K. Li, J. P. Singh, B. R. Toonen, I. Schoinas, M. D. Hill, and D. A. Wood. Relaxed consistency and coherence granularity in dsm systems: A performance evaluation. In Proc. of the Sixth ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPOPP'97), pages 193--205, June 1997.
....encapsulating shared memory initialization, synchronization primitives, and task creation within PARMACS ANL macros [16] These macros should then be adapted for any shared memory platform. The data access patterns of the programs in the SPLASH 2 suite have been characterized in earlier research [17, 18]. FFT performs a one dimensional Fast Fourier Transform of n complex data points. Three all to all interprocessor communication phases are required for a matrix transpose. The data access pattern is hence regular. Two programs for the blocked LU factorization of a dense matrix form part of the ....
Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. R. Toonen, I. Schoinas, M. D. Hill, and D. A. Wood, "Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation," in Proceedings of the ACM Symposium on the Principles and Practice of Parallel Programming, (Las Vegas), pp. 193 -- 205, June 1997.
....factor than we expected in a few cases. This effect is partly due to continued improvements in the Alpha compiler that lead to more efficient code, thus increasing the relative overhead of instrumentation code in some cases. The only relevant study that we are aware of is by Zhou et al. [22], which also examines performance tradeoffs between fine and coarse grain software coherence. However, several critical differences between the studies lead to differing performance results and a number of novel observations in our work. Section 5 contains a detailed comparison of the two ....
....the performance of the various applications on Shasta and Cashmere. To better understand the reasons for performance differences, we discuss the applications in groups based on their spatial data access granularity and temporal synchronization granularity (similar to notions used by Zhou et al. [22]) Applications with coarse grain data access tend to work on contiguous regions at a time, while fine grain applications are likely to do scattered reads or writes. The temporal synchronization granularity is related to the frequency of synchronization in an application on a given platform. An ....
[Article contains additional citation context not shown here]
Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. R. Toonen, I. Schoinas, M. D. Hill, and D. A. Wood. Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation. In Proc. of the Sixth PPOPP, June 1997.
....the page may ping pong between the nodes. One solution to this problem is to hold on to a freshly arrived page for some time before releasing it to another requester [2] Relaxed memory consistency models that allow multiple concurrent writers have also been proposed to alleviate this symptom [9,10,4,11]. These systems ensure that all nodes see the same data at well defined points in the program, usually when synchronization occurs. Extra effort is required to ensure program correctness in this case. One technique that has been investigated recently to improve DSM performance is the use of ....
....computing, as well as a kernel for solving partial differential equations by the successive over relaxation technique and the classical traveling salesman problem. 3.1. SPLASH 2 programs The data access patterns of the programs in the SPLASH 2 suite have been characterized in earlier research [19,11]. FFT performs a transform of n complex data points and requires three all toall interprocessor communication phases for a matrix transpose. The data access is regular. LU c and LU n perform factorization of a dense matrix. The non contiguous version has a single producer and multiple consumers. ....
Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. R. Toonen, I. Schoinas, M. D. Hill, and D. A. Wood, "Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation," in Proceedings of the ACM Symposium on the Principles and Practice of Parallel Programming, (Las Vegas), pp. 193 -- 205, June 1997.
....by encapsulating shared memory initialization, synchronization primitives, and task creation within ANL macros [24] These macros should then be adapted for any shared memory platform. The data access patterns of the programs in the SPLASH 2 suite have been characterized in earlier research [62, 63]. FFT performs a one dimensional Fast Fourier Transform of n complex data points. Three all to all interprocessor communication phases are required for a matrix transpose. The data access pattern is hence regular. Two programs for the blocked LU factorization of a dense matrix form part of the ....
Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. R. Toonen, I. Schoinas, M. D. Hill, and D. A. Wood, "Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation," in Proceedings of the ACM Symposium on the Principles and Practice of Parallel Programming, (Las Vegas), pp. 193 -- 205, June 1997.
....migratory behavior may indicate false sharing, which a programmer can eliminate by aligning and padding data structures. Moreover, recent research has focused on new tools, such as custom protocols and extensible memory systems, that offer programmers far greater control over shared memory systems [3, 28]. For example, a programmer can greatly reduce the cost of producer consumer sharing by increasing the cache block size or using an invalidate, rather than update, protocol [7] Again, understanding a program s access pattern is the necessary first step to improving its shared memory performance. ....
....block. The text in the right corner shows information about the selected block. Other people have also used the Blizzard version of Paradyn to find shared memory performance bottlenecks. Brian Toonen used the tool to search for and validate performance bottlenecks in the DSM system itself [28]. Satish Chandra used the tool to monitor the performance of the custom protocols he wrote for his HPF compiler for Blizzard [3] Trishul Chilimbi used the tool to tune the performance of a database storage management system. 4.5 Discussion The memory performance tool successful found the ....
Y. Zhou, L. Iftode, K. Li, J. P. Singh, B. R. Toonen, I. Shoinas, M.D. Hill and D. A. Wood. Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation. 6th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. Las Vegas, June 1997.
....greatly compare with LRC used in the TreadMarks. Whether this 1 Although in 2 level SVM systems snoopy coherence is used in intranode, here we consider internode coherence only. scheme is better or worse than COMA like sheme is not yet clear. Some evaluated results can be found in [58] 24] [59]. The reported results from Zhou. et.al [59] demonstrate that HLRC have comparable or better performance than other families of SVM protocols. However, platform dependent evaluation lead to the results reported are not so convincible. An alternative meory management strategy which is similar to ....
....Whether this 1 Although in 2 level SVM systems snoopy coherence is used in intranode, here we consider internode coherence only. scheme is better or worse than COMA like sheme is not yet clear. Some evaluated results can be found in [58] 24] 59] The reported results from Zhou. et.al [59] demonstrate that HLRC have comparable or better performance than other families of SVM protocols. However, platform dependent evaluation lead to the results reported are not so convincible. An alternative meory management strategy which is similar to CC NUMA scheme is proposed by the I ACOMA ....
[Article contains additional citation context not shown here]
Y. Zhou, L. Iftode, K. Li, J. P. Singh, B. R. Toonen, I. Schoinas, M. D. Hill, and D. A. Wood. Relaxed consistency and coherence granularity in dsm systems: A performance evaluation. In Proc. of the Sixth ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPOPP'97), pages 193--205, June 1997.
No context found.
Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. R. Toone n, I. Schoinas, M.D. Hill, and D. Wood. Relaxed consistency and coherence granularity in DSM systems: A performance evaluation. In Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, June 1997.
....the applications or (in SMP systems) by reserving one processor for protocol processing. Recent results for interrupts versus polling in SVM systems vary. One study finds that polling may add a significant overhead, leading to inferior performance than interrupts for page grain SVM systems [24]. On the other hand, Stets et al. find that polling gives generally better results than interrupts [20] We believe more research is needed on modern systems to understand the role of polling. Another interesting direction that we are exploring is moving some of the protocol processing itself to ....
M. D. H. Y. Zhou, I. S. L. Iftode, B. R. T. K. Li, J. P. Singh, and D. A. Wood. Relaxed consistency and coherence granularity in DSM systems: A performance evaluation. Technical Report TR-535-96, Department of Computer Science, Princeton University, December 1996, 10 Pages.
No context found.
Y. Zhou, L. Iftode, J. P. Singh, K. Li, B.R. Toonen, I. Schoinas, M.D. Hill, and D. Wood. Relaxed consistency and coherence granularity in DSM systems: A performance evaluation. In Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, June 1997.
.... and frequent interrupts (interrupts are much less frequent in the coarser grained SVM, so the tradeo#s between interrupts and polling are less clear 1 Protocols using more complex, delayed consistency or singlewriter eager release consistency were found to perform only a little better in [22]. there) Since we do not have access to high performance instrumentation for the x86 instruction set, and since it is unclear what cost to ascribe to hardware access control which can be implemented in various ways, we assume access control and polling are free in the FG protocol (access control ....
....the other in di#erent applications. Applications with more complex pointer references may incur higher instrumentation overheads. These results for FG versus SVM would also be similar to the results obtained on the Typhoon zero platform, which provides commodityoriented hardware support [22]. The communication parameters here are di#erent than on Typhoon zero: the C0P0 platform has somewhat more bandwidth relative to processor speed, helping SVM, but much more e#cient access control, helping FG. If we disallow application specific granularities, using 128 bytes or 256 bytes (the ....
[Article contains additional citation context not shown here]
Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. Toonen, I. Schoinas, M. Hill, and D. Wood. Relaxed consistency and coherence granularity in DSM systems: A performance evaluation. In Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, June 1997.
....r y Figure 2: Simulated node architecture. The fine grained access control needed for FG can be provided via either code instrumentation [7, 18] or hard 1 Protocols using more complex, delayed consistency or singlewriter eager release consistency were found to perform only a little better in [22]. ware support [17] Code instrumentation is also used for polling to handle asynchronous incoming messages, which would otherwise cause expensive and frequent interrupts (interrupts are much less frequent in the coarser grained SVM, so the tradeoffs between interrupts and polling are less clear ....
....the other in different applications. Applications with more complex pointer references may incur higher instrumentation overheads. These results for FG versus SVM would also be similar to the results obtained on the Typhoon zero platform, which provides commodityoriented hardware support [22]. The communication parameters here are different than on Typhoon zero: the C0P0 platform has somewhat more bandwidth relative to processor speed, helping SVM, but much more efficient access control, helping FG. If we disallow application specific granularities, using 128 bytes or 256 bytes ....
[Article contains additional citation context not shown here]
Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. Toonen, I. Schoinas, M. Hill, and D. Wood. Relaxed consistency and coherence granularity in DSM systems: A performance evaluation. In Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, June 1997.
....DC protocol, a single writer LRC protocol (similar to that in [28] a multiple writer LRC protocol (similar to that in Section 3.3. 2) and a sequentially consistent protocol (all invalidation based) have been compared on a number of SPLASH 2 applications that cover most of the key sharing patterns [48]. Table 2 shows some of the results as measured on the Wisconsin Typhoon zero cluster [39] The lazier application and propagation of the LRC protocol have significant advantages over the DC protocol, especially in complex irregular applications that use substantial lock synchronization. Also, the ....
....SC DC SW LRC HLRC LU 8.6 8.6 8.4 8.3 Ocean 2.7 3.8 5.7 8.3 Water Nsquared 11.2 11.3 11.3 11.2 Volrend 0.8 1.7 2.9 9 Water Spatial 4.9 5.8 7.3 12 Raytrace 6.6 8 9 13 Barnes 0.9 1.9 2. 2 6 Table 2: Speedups for SC, Single writer DC and LRC, and Multiple writer LRC Protocols, 16 processors [48]. The multiple writer solution used is the software home based one described in Section 3.3.2. The trend today is toward lazy, multiple writer protocols. However, the performance storage complexity tradeo#s in laziness are not yet clear for emerging platforms and real applications and bear ....
[Article contains additional citation context not shown here]
Y. Zhou, L. Iftode, J.P. Singh, K.Li, B.R. Toonen, I.Schoinas, M.D. Hill, and D.A. Wood. Relaxed consistency and coherence granularity in dsm systems: A performance evaluation. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1997.
....Cashmere 2L [22] where the main processor polls the network queues for incoming messages on program back edges. This requires that the code be instrumented to add polling instructions, and affects the time when messages are handled. A similar polling method was used and compared with interrupts in [33]. The tradeoffs between interrupts and polling are not very clear (the performance differences are found to be small for the platform used in [33] and the instrumentation needed for x86 platforms is not easily available. Our approach of using network interface support to reduce processor ....
....to add polling instructions, and affects the time when messages are handled. A similar polling method was used and compared with interrupts in [33] The tradeoffs between interrupts and polling are not very clear (the performance differences are found to be small for the platform used in [33]) and the instrumentation needed for x86 platforms is not easily available. Our approach of using network interface support to reduce processor involvement eliminates interrupts, for both data and synchronization transfers, and at the same time improves performance further by reducing other ....
Y. Zhou, L. Iftode, J. Singh, K. Li, B. Toonen, I. Schoinas, M. Hill, and D. Wood. Relaxed consistency and coherence granularity in dsm systems: A performance evaluation. In Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, June 1997.
No context found.
Y. Zhou, L. Iftode, J.P. Singh, K.Li, B.R. Toonen, I.Schoinas, M.D. Hill, and D.A. Wood. Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1997.
....the next acquire operation (a typical example is the next barrier) though it is different from LRC in that it does not postpone invalidation application until the next causally related acquire. The queue mechanisms needed by DC can be used in hardware protocols as well as in software DC protocols [11, 38]. Two RC based protocols that allow only a single writer to a page at a time (see next subsection) but propagate and apply invalidations at different times have been compared with SC on a number of SPLASH 2 applications that cover most of the basic sharing patterns [38] One is an eager propagation ....
.... in software DC protocols [11, 38] Two RC based protocols that allow only a single writer to a page at a time (see next subsection) but propagate and apply invalidations at different times have been compared with SC on a number of SPLASH 2 applications that cover most of the basic sharing patterns [38] One is an eager propagation but lazy application delayed consistency protocol as described above, and the other is a single writer LRC protocol (i.e. lazy propagation and application) similar to the one proposed in [25] Table 3 shows some of the speedup comparisons, obtained on the Wisconsin ....
[Article contains additional citation context not shown here]
Y. Zhou, L. Iftode, J.P. Singh, K.Li, B.R. Toonen, I.Schoinas, M.D. Hill, and D.A. Wood. Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1997.
....the applications or (in SMP systems) by reserving one processor for protocol processing. Recent results for interrupts versus polling in SVM systems vary. One study finds that polling may add a significant overhead, leading to inferior performance than interrupts for page grain SVM systems [24]. On the other hand, Stets et al. find that polling gives generally better results than interrupts [20] We believe more research is needed on modern systems to understand the role of polling. Another interesting direction that we are exploring is moving some of the protocol processing itself to ....
M. D. H. Y. Zhou, I. S. L. Iftode, B. R. T. K. Li, J. P. Singh, and D. A. Wood. Relaxed consistency and coherence granularity in DSM systems: A performance evaluation. Technical Report TR-535-96, Department of Computer Science, Princeton University, December 1996, 10 Pages.
....to be cost effective when faced with the high latencies (and overheads) of traditional local area networks. Recent work shows that, on the Typhoon 0 prototype, fine grain sequentially consistent DSM performs comparably to a coarse grain DSM using lazy release consistency on the same hardware [58]. We expect that fine grain DSM will show a performance advantage on systems with lower overheads and lower network latencies, such as those simulated in Section 4. Other approaches to optimizing communication in DSM systems include relaxed memory models [25] prefetching [35] and special writes ....
Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. R. Toonen, I. Schoinas, M. D. Hill, and D. A. Wood. "Relaxed consistency and coherence granularity in DSM systems: A performance evaluation." In Sixth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, June 1997.
No context found.
Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. R. Toonen, I. Schoinas, M. D. Hill, and D. A. Wood, "Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation," in Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, (Las Vegas), pp. 193 -- 205, June 1997.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC