| P. Stenstrom, M. Brorsson, and L. Sandberg. Adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Int. Symp. on Computer Architecture, pages 109--118, May 1993. |
....instruction retirement is in program order) The predictor is indexed by instruction address. Instruction based predictors for optimizing read modify write patterns as above have been proposed earlier [84] Address based techniques for optimizing read modify write patterns have also been proposed [32, 157]. Cache blocks that are only read within critical sections are brought into the cache in a shared state. If repeated upgrade induced violations occur, the processor can issue exclusive requests for all blocks accessed within the critical section, obtain the blocks in owned state and defer ....
Per Stenstrm, Mats Brorsson, and Lars Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.
....primitive (a global reduction implemented using message passing) with far more favorable scalability properties required little work. Although there has been some work related to optimizing coherence protocols to improve performance on migratory data in the context of hardware based DSM systems [20, 76], it remains unclear whether migratory data sharing will dominate for applications other than those studied in this thesis, John D. Kubiatowicz, personal communication, August 1995. 84 and, if so, how frequently can be replaced with carefully selected, efficient message based primitives. The ....
Per Stenstrom, Mats Brorsson, and Lars Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.
....patterns in Oracle when running our OLTP workload. Our key observations are summarized below. 99 First, we observed that 88 of all shared write accesses and 79 of read dirty misses are to data that exhibit a migratory sharing pattern. We use the following heuristic to identify migratory data [25, 114]. A cache line is marked as migratory when the directory receives a request for exclusive ownership to a line, the number of cached copies of the line is 2, and the last writer to the line is not the requester. Because our base system uses a relaxed memory consistency model, optimizations for ....
....request for exclusive ownership to a line, the number of cached copies of the line is 2, and the last writer to the line is not the requester. Because our base system uses a relaxed memory consistency model, optimizations for dealing with migratory data such as those suggested by Stenstrom et al. [114] will not provide any gains since the write latency is already hidden. OLTP is characterized by fine grain updates of meta data and frequent synchronization that protects such data. As a result, data structures associated with the most actively used synchronization tend to migrate with the passing ....
Per Stenstrom, Mats Brorsson, and Lars Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.
....substantial, consuming a non negligible portion of the network bandwidth. Previous studies have showed that coherency traffic exhibit some regular patterns [CBZ90] Corresponding optimizations have been proposed to address some of these 0 7803 9802 5 2000 10.00 (c) 2000 IEEE specific patterns [Per93] Kax98] More general schemes have also been proposed, but they remain costly in hardware, they require on chip modification, or large extension of directory structure (memory overhead) MH98] In the scientific computing world, it is well known that data access are highly structured. This ....
Per Stenstrom, Mats Brorsson and Lars Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th International Symposium on Computer Architecture, May 1993.
....complexity to reduce the frequency of messaging. Adaptive cache coherence shared memory protocols, for instance, minimize communication by monitoring the sharing behavior of data at runtime and selecting one out of many protocols suitable for enforcing coherence on a given data item [FW97b,CF93,SBS93,BCZ90] Network interface cards may provide hardware support for high level abstractions. For instance, CNI s [MFHW96] hardware moves data in cache block granularities between the network interface card and processor caches. Typhoon 1 s [RPW96] shared memory hardware atomically moves a ....
Per Stenstrom, Mats Brorsson, and Lars Sandberg. Adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.
....a piece of data is read and written by a succession of processors 19 in a lockstep manner. This pattern results in the transfer of data from one processor to another, and usually involves two coherence operations (each with multiple messages) one for the read and one for the write. Recent work [24, 7, 19] in both hardware and software coherent systems discusses methods to classify migratory data and then collapsing the two coherence messages into one. This technique could be built into our system, and may be very helpful in reducing the overhead due to unnecessary migration requests. 5 ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the Twentieth International Symposium on Computer Architecture, San Diego, CA, May 1993.
....consistency model that the system provides. Cache coherent architectures therefore provide latency hiding either through separate mechanisms such as data prefetching or by modifying the cache coherence mechanism. For example, the cache coherence protocol could be made adaptive [Archibald 88, Stenstrom et al. 93, Bennett et al. 90, Carter et al. 91] or it could implement a weaker memory consistency model [Hutto Ahamad 90, Gharachorloo et al. 90] In this section, we evaluate our three communication architectures with respect to the amount of communication latency they incur. We begin by describing, ....
....of the cachecoherent shared memory architecture without changing the programming model. For example, researchers have studied adaptive or user compiler selectable cache coherence mechanisms that use different coherency protocols for different sharing patterns [Carter et al. 91, Bennett et al. 92, Stenstrom et al. 93] One way of combining synchronization and data transfer in the framework of a sharedmemory architecture is through the use of full empty bits on memory words [Agarwal et al. 91, Alverson et al. 90] though we would argue that this approach can be very costly, and certainly is overkill for the ....
[Article contains additional citation context not shown here]
P. Stenstrom, M. Brorsson, and L. Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of 20th International Symposium on Computer Architecture, pages 109--118, 1993.
....the batch code. A relaxed memory model simplifies handling the above corner cases in an efficient manner. 3. 6 Detecting Migratory Sharing Patterns The Shasta protocol provides a sophisticated mechanism for detecting data that is shared in a migratory fashion and optimizing accesses to such data [6, 23]. Migratory sharing occurs when data is read and modified by different processors, leading to the migration of the data from one processor to another. By keeping extra information at each directory entry, the protocol detects whether the data in each line exhibits migratory behavior. A line is ....
....latencies are much larger. The primary reason for this is that the processors are utilized for handling protocol messages while they wait for data and synchronization, thus making it more difficult to improve performance through latency hiding. Cox and Fowler [6] and Stenstrom et al. [23], independently proposed the idea of optimizing the transfer 14 of migratory data and evaluated the performance of this optimization (through simulation) for a small number of applications in the context of hardware DSM systems. Both studies focus on a small block size of 16 bytes. They both ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.
....implement DSM. This programmable feature is used to support both the CC NUMA and SCOMA memory models. However, it does not provide a way for the user to select the model this selection is performed when the machine is booted. There have been several systems that support more than one protocol [14, 27]. All these systems use information collected during runtime to determine the protocol to use for individual cache lines. These systems, in addition to incurring run time overhead, require all programs to followed the consistency model that the system adapts to. 3 Design of a Multi Protocol DSM ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109-- 118, May 1993.
....a line in an exclusive state on a read miss. This reduces the number of message transit delays associated with sharing, though does not reduce second cache misses as directly as DSI and SLID. Such schemes were described by Cox and Fowler [5] and in a similar paper Stenstrom, Brorsson, and Sandberg [23]; they observe that a pattern of read invalidate misses observed at a line might indicate that the data is migratory, exclusively read and written over a span of time by one processor at a time. A read by a processor to such a line might fetch an exclusive copy (rather than a shared copy) ....
....downgraded only if another line last accessed by the same instruction is downgraded. This could happen long after the write executed or after some lines accessed by the write were subsequently written by other instructions. Other schemes attempt to determine the behavior of accesses to a location [1,5,16,23]. In an adaptive caching scheme described by Bennett, Carter, and Zwaenepoel [1] data sharing behavior is divided into classes. In a system based on this idea, the history of memory accesses to a location would be used to determine its class. A coherence mechanism appropriate to the class would ....
Stenstrom, P., Brorsson, M., and Sandberg, L. An adaptive cache coherence protocol optimized for migratory sharing. Proceedings of the International Symposium on Computer Architecture. May 1993, pp. 109--118.
....into our adaptive protocols. Using per word timestamps [ACD 96a, KFJ, ZSB94] addresses the problem of diff accumulation directly. The problem is alleviated in our system because we switch to using whole pages whenever the diffs are large. Cox and Fowler, and Stenstrom and Brorsson [CF93, SBS93] describe hardware cache coherence protocols that adapt to migratory sharing patterns. Migratory cache blocks are detected automatically. If a processor first reads and then writes a block, these protocols invalidate the old copy and migrate ownership of the block to the new processor on the read ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, May 1993.
....wide sharing, but I also examine migratory sharing, and producer consumer sharing. I propose a new optimization for wide sharing called GLOW and a new optimization based on speculative execution for producer consumer sharing. For migratory sharing I study an optimization inspired by previous work [28,93]. Because these optimizations are specific to a sharing pattern they should not be applied indiscriminately for all accesses. Doing so may result in performance loss. Therefore, we need to identify the data affected by a sharing pattern, or the accesses that belong to a sharing pattern, and ....
....update protocols in GLOW. 1.1.2 Migratory sharing Migratory sharing refers to data that are read, modified, and written by a single processor at time. Typically, such data are accessed within critical sections. In previous work (by Cox and Fowler [28] and by Stenstrm, Brorsson, and Sandberg [93]) the optimization is to collapse the coherent read (that first accesses the migratory data) and the coherent write (that updates the migratory data) in a single transaction. This optimization is performed by the home node directory, which is responsible to dynamically detect migratory sharing. ....
[Article contains additional citation context not shown here]
Per Stenstrm, Mats Brorsson, Lars Sandberg, "An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing." In Proceedings of the 20th Annnual International Symposium on Computer Architecture, 1993.
....broadcast with remote writes. In the future, we would like to examine the impact of other basic network issues on SDSM performance. These issues include DMA versus programmed I O interfaces, messaging latency, and bandwidth. We are also interested in incorporating predictive migration mechanisms [8, 21, 27] that would identify migratory pages and then trigger migration at the time of an initial read fault. Acknowledgement: The authors would like to thank Ricardo Bianchini and Alan L. Cox for many helpful discussions concerning this paper. ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proc. of the 20th Intl. Symp. on Computer Architecture, San Diego, CA, May 1993.
....Predictions can be made by programmers [51, 136] compilers [84, 113] or hardware. Specialized predictors in hardware include read modify write operation prediction in the SGI Origin protocol [66] pair wise sharing prediction in SCI [116] dynamic self invalidation [67] and migratory protocols [28, 120]. Existing predictors, however, are directed at specific sharing patterns known a priori. Furthermore, the protocol implementation is often made more complex by intertwining one or more predictors with the standard coherence protocol. 16 This thesis seeks a more general predictor to accelerate ....
....timing simulation of a protocol that can be accelerated using prediction. Second, I do not want initial results in this area obscured by implementation idiosyncra 140 sies. Nevertheless, I expect such integration to be successful because the integration of directed predictions has been successful [66,67,28, 120]. Section 6.3 briefly discusses possibilities for such integration. The second contribution of this chapter is a detailed evaluation of the Cosmos coherence message predictor. Section 6.4 states methodological assumptions, including the use of five scientific benchmarks on a target shared memory ....
[Article contains additional citation context not shown here]
Per Stenstrom, Mats Brorsson, and Lars Sandberg. Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993. 185
....the retention of the miss classification while the line remains in the cache. We examine many more applications of the technique. Architectures have been proposed that dynamically attempt to classify other aspects of references, including temporal or spatial locality [5, 11] or migratory behavior [4, 16]. 3. Classifying Misses Miss classification identifies the following conflict miss scenario. Cache line B is accessed, resulting in a cache miss, and evicts line A from the cache. The next miss to the same cache set is an access to line A. The second miss is a conflict miss which can be ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In 20th Annual International Symposium on Computer Architecture, pages 109--118, San Diego, CA, May 1993. ACM.
....One simple reference event is a write hit on shared data in the local memory. Here, note that the history of past events can also characterize the current event, for example, the number of occurrences of a particular event [CDV 94, FW97] and the dynamic detection of migratory data [CF93, SBS93] In theory, any fresh data can be bound to any memory location in any event of a reference. In the example of Figure 2.3, at each memory reference, the data arrangement on the memory locations could be any of the 24 possible data arrangements. DSM architectures exploit the implicit method. By ....
Per Stenstrom, Mats Brorsson, and Lars Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118. Computer Architecture News, 21(2), May 1993.
....node accesses frequently. They perform poorly when data is heavily shared and writes are frequent, because after each write data must be reloaded when next accessed by remote nodes. Migratory protocols slightly improve performance for applications where memory is concurrently shared infrequently [13, 24]. Write update protocols work well when writes are frequent and written data is typically read by remote nodes prior to being overwritten, exactly those cases handled poorly by write invalidate. They generate excessive communication overhead when modifications to shared data are not typically read ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.
.... ( Migratory ) iii) a release consistent [16] implementation of a conventional multiple reader, single writer, write invalidate protocol ( Dash ) iv) a protocol that adapts to the way that it is used to dynamically switch between Dash and Migratory, depending on how a cache line is being used [13, 27] ( Adaptive ) and (v) a release consistent multiple reader, multiple writer, write update protocol ( Munin ) We selected these five protocols because they covered a wide spectrum of options available to system designers. Table 2 summarizes the design parameters of the four protocols. The rest of ....
....single processor at a time, such as data always accessed via exclusive RW locks, because it avoids unnecessary invalidations or updates when the data is written after it is read. Several researchers have found that the provision of a migratory protocol can improve system performance significantly [13, 27]. Dash is identical to the Conventional protocol, except that the new owner of a cache block does not have to stall while it waits for acknowledgements to its invalidation messages. This optimization assumes that the program is written using sufficient synchronization to avoid data races, which ....
[Article contains additional citation context not shown here]
P. Stenstrom, M. Brorsson, and L. Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.
....transactions performed by the transaction buffer. Second, dirty cache lines are invalidated at the time that they are read, rather than being left in the read only state. This policy provides a simple alternative to various adaptive protocols that have been proposed to handle migrating data[15, 34]; in fact, this policy may partially account for the good performance seen for MP3D in Section 4. LimitLESS cache coherence involves a close interaction between hardware and software. The hardware invokes software handling for remote requests by making use of the Alewife message passing interface: ....
....in Section 3.1, the Alewife coherence protocol is optimized for this type of sharing. Eliminating these requests improves the performance of sequentially consistent architectures by reducing the number of (costly) write operations, and by reducing overall network traffic. Results presented in [34] and [15] show that protocols aware of migratory sharing can achieve a reduction of more than 80 in the number of exclusive access requests for MP3D and 16 byte cache lines. The other reason for the good performance of the original MP3D code on Alewife is its relatively low, 60 cycle latency for ....
Per Stenstrom, Mats Brorsson, and Lars Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual Symposium on Computer Architecture 1993, pages 109--118, New York, May 1993. ACM.
....broadcast with remote writes. In the future, we would like to examine the impact of other basic network issues on SDSM performance. These issues include DMA versus programmed I O interfaces, messaging latency, and bandwidth. We are also interested in incorporating predictive migration mechanisms [8, 21, 27] into the protocol. Such mechanisms would identify migratory pages and then trigger migration at the time of an initial Read fault, thereby eliminating the overhead of a subsequent migration request. ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the Twentieth International Symposium on Computer Architecture, San Diego, CA, May 1993.
....Their simulations indicated that adaptive protocols could almost halve the inter node communication compared to conventional protocols for the applications that had a larger portion of data sharing as migratory, and reduce execution time by almost 19 for some applications. Stenstrom et al. [18] also looked at adaptive protocols for optimizing coherence actions for migratory sharing. Their idea is very similar to that of Cox and Fowler. They suggest an implementation scheme for extending a protocol similar to the directorybased protocol of Stanford DASH [13] to include the adaptive ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, May 1993. 53
....for sending a copy of a page to a remote node, on which a page fault occurs, is directly followed by an invalidation request from the remote node. An adaptive migratory algorithm should eliminates the overhead for invalidation by self invalidation on sending a copy of a page to the remote node [3, 12]. The adaptive scheme presented in section 2 is modified to include the migratory protocol as one of the protocol choices. Our adaptive migratory scheme requires each node to periodically estimate the cost of using each candidate protocol for each page in its local memory as in section 2; the ....
P. Stenstrom, M. Brorsson, and L. Sandberg, "An adaptive cache coherence protocol optimized for migratory sharing," in Proceedings of the 20th Annual International Symposium on Computer Architecture, pp. 109--118, May 1993.
....(at run time) a different protocol (migratory, invalidate, or competitive update protocol) for each page by using the local information only. Experimental results show that the performance is improved by dynamically selecting the migratory protocol. Adaptive protocols for migratory sharing [4, 12, 14], and self invalidation [11] have been proposed previously for hardware cache coherent scheme. 4, 12, 14] dynamically identify migratory shared data and switch to migratory protocol in order to reduce the overhead. 11] can predict blocks to be invalidated and perform self invalidation. Dynamic ....
....page by using the local information only. Experimental results show that the performance is improved by dynamically selecting the migratory protocol. Adaptive protocols for migratory sharing [4, 12, 14] and self invalidation [11] have been proposed previously for hardware cache coherent scheme. [4, 12, 14] dynamically identify migratory shared data and switch to migratory protocol in order to reduce the overhead. 11] can predict blocks to be invalidated and perform self invalidation. Dynamic page placement schemes [3, 10] including page migratory have been implemented in OS level NUMA memory ....
[Article contains additional citation context not shown here]
P. Stenstrom, M. Brorsson, and L. Sandberg, "An adaptive cache coherence protocol optimized for migratory sharing," in Proceedings of the 20th Annual International Symposium on Computer Architecture, pp. 109--118, May 1993.
....is sent. The scenario described is that of a migratory access pattern: a sequence of reads followed by a sequence of writes by one processor with no intervening accesses by other processors [26] Detecting migratory access and eliminating the explicit ownership message is straightforward [9] [24]. If a page is migratory, when a processor performs its first read from the page, it will fault because the page is invalid. Its request for the page will go to the processor that still owns the page. If that processor accessed the page in a similar, migratory fashion, it will preemptively send ....
....according to recent reference patterns. They found that the adjustable cache block size implementation did better than the best fixed size implementations for most of the programs in their suite. The adaptation to migratory behavior was first suggested by Cox and Fowler [9] and Stenstrom et al. [24] in the context of hardware shared memory machines. Another form of adaptivity that is important in networks of workstations is adapting to environmental characteristics such as processor and network load [6] 7] This form of adaptivity is orthogonal to the one discussed in this paper. VIII. ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, May 1993.
.... access patterns into a number of specific sharing patterns, e.g. the producerconsumer pattern and the migratory pattern [1] Adaptive shared memory systems allow multiple coherence protocols to run at the same time, or allow the coherence protocol to adapt to some identifiable access patterns [3, 11]. The main difference in these systems is regarding what and how access patterns are detected. Some heuristic mechanisms have been proposed to predict and trigger appropriate protocol behavior [8] The implementation of an adaptive cache coherence protocol involves two issues: what adaptivity can ....
P. Stenstrom, B. brorsson, and L. Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, May 1993.
....statements right before starting the hash search. Alternatively, instead of keeping one central list of free buckets, we can distribute the list into per processor sublists. If no software change is possible, we can support the well known migratory data optimization in the cache coherence protocol [7]. In the data part of the Xid Hash hash table (Xid Hash Data) hash buckets are never shared with other processors once allocated, the misses result from their being reused after deallocation. These, therefore, are artificial sharing misses that can be eliminated if the buckets are privatized. In ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.
....techniques into our adaptive protocols. Using per word timestamps [1, 15, 21] addresses the problem of diff accumulation directly. The problem is alleviated in our system because we switch to using whole pages whenever the diffs are large. The work of Cox and Fowler, and Stenstrom and Brorsson [7, 20] adapts to migratory sharing patterns in hardware cache coherence protocols. Migratory cache blocks are detected dynamically (how ) If a processor first reads and then writes a migratory block, they invalidate the old copy and migrate ownership of the block to the new processor on the read miss ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, May 1993.
.... is well known that depending on the application, either write invalidate (WI) or write update (WU) performs best [EK88] Indeed, even within applications, various forms of sharing behavior exist: for example, widely write shared data (like locks) largely read only data, and migratory data [CF93, SBS93] Because of these different sharing patterns, varying the protocol as the application executes has the potential to increase application performance by decreasing both the number of bus transactions and the number of bytes transferred over the bus, leading to decreased bus contention and reduced ....
Per Stenstrom, Mats Brorsson, and Lars Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proc. of 20th Int. Symp. on Computer Architecture, pages 109--118, 1993.
....by identifiable synchronization. The above observations suggest two possible solutions for reducing the performance loss due to migratory dirty read misses. First, a software solution that identifies accesses to migratory data struc 2 We use the following heuristic to identify migratory data [3, 25]. A cache line is marked as migratory when the directory receives a request for exclusive ownership to a line, the number of cached copies of the line is 2, and the last writer to the line is not the requester. Because our base system uses a relaxed memory consistency model, optimizations for ....
....request for exclusive ownership to a line, the number of cached copies of the line is 2, and the last writer to the line is not the requester. Because our base system uses a relaxed memory consistency model, optimizations for dealing with migratory data such as those suggested by Stenstrom et al. [25] will not provide any gains since the write latency is already hidden. tures can schedule prefetches to the data, enabling the latency to be overlapped with other useful work. Support for such softwaredirected prefetch instructions already exists in most current processors. Second, a solution ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.
....reduce the impact of coherence overhead. Prefetching cache blocks before their expected use hides the latency to obtain a cache block [55,27] Multithreading [69,27] tolerates latency by rapidly switching to a new computation thread when a remote miss is encountered. Migratory data optimizations [14,72] speculate about future write requests by the same processor when responding to a read request. Self invalidation is complementary to these optimizations and could be combined with them. For example, the SPARC V9 prefetch read once instruction [7] indicates that a block should be prefetched, but ....
Per Stenstrom, Mats Brorsson, and Lars Sandberg. Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.
....in conjunction with, ii) the removal of course grain synchronisation points, and, iii) the latency hiding update mechanism. SOC is applicable to any invalidation based consistency model and complements other VSM optimisations such as distributed invalidation [6, 10] migratory data optimisations [4, 21] and pre fetching [23] We show how, in particular, distributed invalidation and SOC combine to improve performance. In Section 2 we review work related to our research. Section 3 describes our implementation of SOC and the applications used in our experiments. Section 4 shows our experimental ....
P. Stenstrom, M. Brosson, and L.Sandberg. Adaptive cache coherence protocol optimized for migratory sharing. In Proc. 20th Intl. Symp. on Computer Architecture, pp 109--118, 1993.
....WI protocols is two bus transactions: a bus read transaction that creates a new copy of the block followed by an invalidation request at the next write to the block. If a block is known to be migratory, it can be fetched with a single (read exclusive) bus operation. There are both adaptive [8, 27] and non adaptive [21] protocols that do reasonably well at exploiting this. In contrast, AXP requires three bus operations (read, update, update) to migrate a cache block. In programs with a lot of migratory data, this is considerably more expensive. We found that, among several programs in the ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, May 1993.
....on the first fault on any of the group s pages. The extent to which prefetching techniques can outperform update strategies such as IUA is one of the subjects of our future work. Several researchers have proposed techniques for adapting to sharing patterns in the context of hardware DSMs, e.g. [7, 11, 22, 23]. Cox and Fowler [7] and Strenstrom et al. 22] have looked at optimizing the handling of migratory data in the coherence protocol. Dahlgren and Strenstr om [11] have studied a hybrid invalidate update protocol where each processor makes a local decision to invalidate or update a cache block when ....
....to which prefetching techniques can outperform update strategies such as IUA is one of the subjects of our future work. Several researchers have proposed techniques for adapting to sharing patterns in the context of hardware DSMs, e.g. 7, 11, 22, 23] Cox and Fowler [7] and Strenstrom et al. [22] have looked at optimizing the handling of migratory data in the coherence protocol. Dahlgren and Strenstr om [11] have studied a hybrid invalidate update protocol where each processor makes a local decision to invalidate or update a cache block when it receives an update message. Trancoso and ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, May 1993.
....Madison WI 53706 1685, USA shubu,markhill cs.wisc.edu URL: http: www.cs.wisc.edu shubu,markhill 22 Submitted for publication. Please do not distribute. prediction in the SGI Origin protocol [19] pair wise sharing prediction in SCI [35] dynamic self invalidation [20] andmigratoryprotocols[10,37].Existingpredictors,however,aredirected at specificsharingpatternsknown a priori. Furthermore, the protocol implementation is often made more complex by intertwining one or more predictors with the standard coherence protocol. This paper seeks a more general predictor to accelerate coherence ....
....timing simulation of a protocol that can be accelerated using prediction. Second, we do not want initial results in this area obscured by implementation idiosyncrasies. Nevertheless, we expect such integration to be successful because the integration of directed predictions has been successful [19,20,10, 37]. Section 4 briefly discusses possibilities for such integration. The second contribution of this paper is a detailed evaluation of the Cosmos coherence message predictor. Section 5 states methodological assumptions, including the use of five scientific benchmarks on a target sharedmemory machine ....
[Article contains additional citation context not shown here]
Per Stenstrom, Mats Brorsson, and Lars Sandberg. Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.
....reduce the impact of coherence overhead. Prefetching cache blocks before their expected use hides the latency to obtain a cache block [31,20] Multithreading [37,20] tolerates latency by rapidly switching to a new computation thread when a remote miss is encountered. Migratory data optimizations [12,38] speculate about future write requests by the same processor when responding to a read request. Self invalidation is complementary to these optimizations and could be combined with them. For example, the SPARC V9 prefetch read once instruction [6] indicates that a block should be prefetched, but ....
Per Stenstrom, Mats Brorsson, and Lars Sandberg. Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.
....by guessing the stride of memory accesses. Once determined, prefetch can occur as far in advance as needed. The effectiveness of such schemes depends upon regular access to memory and so may only work for certain programs. Other schemes attempt to determine the behavior of accesses to a location [1,5,11,17]. In an adaptive caching scheme described by Bennett, Carter, and Zwaenepoel, data sharing behavior is divided into classes. In a system based on this idea, the history of memory accesses to a location would be used to determine its class. A coherence mechanism appropriate to the class would then ....
....be used to determine its class. A coherence mechanism appropriate to the class would then be chosen for the location. Trace driven simulations show that classes can be detected and that performance gains are possible. Cox and Fowler [5] and in a similar paper Stenstrom, Brorsson, and Sandberg [17], observe that a pattern of read invalidate miss observed at a line might indicate that the data is migratory, exclusively read and written over a span of time by one processor at a time. A read by a processor to such a line might fetch an exclusive copy (rather than a shared copy) anticipating ....
Stenstrom, P., Brorsson, M., and Sandberg, L. An adaptive cache coherence protocol optimized for migratory sharing. Proc. of the Intl. Symp. on Computer Arch. May 1993, 20th, pp. 109--118.
....This avoids a read miss when the data is first accessed, and allows data motion and synchronization to be merged into a single message exchange. Several researchers have found that migratory data is fairly common and that direct support for it can improve the performance of shared memory [CF93, SBS93] Write shared variables are frequently written by multiple threads concurrently, without intervening synchronization to order the accesses, because the programmer knows that each thread reads from and writes to independent portions of the data. Because of the way that the data is laid out in ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.
....transparently offer high performance while preserving programmers sanity. Thus, there is compelling reason to examine transparent hardware optimizations. Indeed, many adaptive cache coherence protocols that optimize various sharing patterns at run time have been proposed: for migratory data [16][17], for pairwise sharing and producer consumer sharing [2] 5] and for widely shared data [8] Recently Mukherjee and Hill [22] showed that address based prediction in coherence protocols can be generalized using two level adaptive predictors which were proposed in the context of branch ....
....prediction can lead to optimization of migratory sharing patterns. The reasoning is that migratory sharing patterns often generate load misses closely followed by store write faults. The optimization we propose (inspired by the work of Cox and Fowler [16] and of Stenstr m, Brorsson, and Sandberg [17]) is to convert the coherent read miss to a coherent write miss. We examine three variations of this scheme and show that it works well for programs with migratory sharing while requiring no more than 72 entries in any node s prediction table. Predict whether a load will access widely shared ....
[Article contains additional citation context not shown here]
Per Stenstrom, Mats Brorsson, Lars Sandberg, "An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing," In Proc. of the 20th ISCA, 1993.
....primitive (a global reduction implemented using message passing) with far more favorable scalability properties required little work. Although there has been some work related to optimizing coherence protocols to improve performance on migratory data in the context of hardware based DSM systems [20, 76], it remains unclear whether migratory data sharing will dominate for applications other than those studied in this thesis, 2 John D. Kubiatowicz, personal communication, August 1995. and, if so, how frequently can be replaced with carefully selected, efficient message based primitives. The ....
Per Stenstrom, Mats Brorsson, and Lars Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.
....in supporting the memory model. The final optimization we study reduces overhead for one particular pattern of data sharing. General cache coherence protocols may not optimally handle the majority of common data sharing patterns. Consequently, researchers have investigated many protocol extensions [7, 9, 11, 24] that allow existing protocols to perform better under specific classes of data sharing patterns. These extensions attempt to reduce network traversals and or memory accesses. Two common classes of sharing patterns are migratory data [16] and producer consumer data sharing. In this paper we study ....
....of interconnect messages and remote memory accesses are compared. This analysis shows that QOLB outperforms both of these algorithms, both in terms of interconnect messages and memory accesses needed to gain access to a critical section. Cox and Fowler [9] and Stenstr m, Brorsson, and Sandberg [24] present studies that propose different solutions for dealing with the problem of migratory sharing patterns. Both studies present adaptive schemes that can be implemented by a hardware cache coherence protocol. QOLB which predates these studies captures the same opportunities for ....
Per Stenström, Mats Brorsson, and Lars Sandberg. Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proc. of the 20th Annual International Symposium on Computer Architecture, pp. 109--118, May 1993.
....12 14] These models, while simple, are inaccurate in their predictions to be useful for our purposes. The second group of models [3, 10, 15 20] is mostly experimental in nature in that the models use information from complete memory traces of real applications, or fully simulate the applications [1, 8, 21 24]. They define data access parameters based on the precise interleaving characterization of the memory accesses performed by different processors. For example, they predict performance based on how many different processors perform reads between two writes, or on the number of reads and or writes ....
P. Stenstrom, M. Brorsson, and L. Sandberg, "An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing", In Proceedings of the 20th Annual International Symposium on Computer Architecture, San Diego, California, pages 109-118, May 1993.
.... 50,000 0 Total Amount of Data (Bytes) 70,000 K 60,000 K 50,000 K 40,000 K 30,000 K 20,000 K 10,000 K Figure 15: Experiment 8: Performance Comparison: Total costs over the entire application 6 Related Work Many schemes have been proposed to reduce overhead by adapting to memory access patterns [2, 33, 11, 8, 29, 24, 9, 1, 14, 10, 4, 7, 20, 5, 26, 27, 3, 21, 16, 6]: ffl The approach proposed in this paper is related to the work by Veenstra and Fowler [31] 31] evaluates the performance of three types of off line algorithms: i) an algorithm that chooses statically, at the beginning of the program, either invalidate or update protocols on a per page basis, ....
....Archibald [2] dynamically chooses to update or invalidate copies of a shared data object. If there are three writes by a single processor without intervening references by any other processor, all other cached copies are invalidated. ffl Optimizations for migratory sharing have also been proposed [8, 29, 9, 22]. These protocols dynamically identify migratory shared data and switch to migratory protocol in order to reduce the overhead. 8, 29] are based on invalidate protocol, and [22, 9] are based on competitive update protocol. ffl Quarks [16] limits the number of updates independently for each page ....
[Article contains additional citation context not shown here]
P. Stenstrom, M. Brorsson, and L. Sandberg, "An adaptive cache coherence protocol optimized for migratory sharing," in Proceedings of the 20th Annual International Symposium on Computer Architecture, pp. 109--118, May 1993.
....we compare the performance of the proposed protocols. As a reference, we also evaluate state of the art non delayed invalidate and update protocols. The protocols that we examine are: Protocols used as a reference: ffl Inval Mg: Non delayed invalidate with migratory data handling as described in [13]. We call this protocol base invalidate protocol. This protocol is quite sophisticated. ffl CompUp: Non delayed competitive update with threshold of one. We call it base update protocol. Delayed competitive update protocols: ffl DCU: Delayed competitive update with a threshold of one. ffl ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.
....have extended cache coherence protocols to detect and optimize migratory sharing behavior. Cox et al. evaluated adaptive protocols for bus based and directorybased systems [5] Stenstrom et al. presented an adaptive protocol to take advantage of migratory behavior of a shared memory application [26]. Our pattern detection serves a different role to detect particular access patterns, but not optimize their communication which makes our implementation simpler and cheaper. Moreover, our mechanisms detect patterns other than migratory behavior. 7 Conclusion This paper describes a new ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An Adaptive Cache Coherence Protocol for Optimized Migratory Sharing. 20th Annual Int'l Symp. on Comp. Architecture, May 1993.
....in which a block is read shared. These annotations are only advisory; the system still maintains coherence even if they are not used or they are used incorrectly. The Queue On Sync Bit (QOSB, called Queue On Link Bit in [10] mechanism implements a similar scheme in hardware. Stenstrom et al. [20] have developed a directory based adaptive protocol for migratory data that is very similar to the one we describe. Their rule for shifting into migratory mode is identical to the one we use. Both protocols shift out of migratory mode on read miss to a clean and migra tory block. Their protocol ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, May 1993.
....future work we will modify the protocol to work in a hierarchical bus system. We plan to evaluate the effectiveness of using different transfer and coherence sizes at different levels in the system. We will also investigate incorporating the detection of migratory sharing into our subblock protocol[SBS93]. ....
Per Stenstrom, Mats Brorsson, and Lars Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proc. of 20th Int. Symp. on Computer Architecture, pages 109--118, 1993.
....memory. The studies show that given appropriate annotations, a large class of applications can perform well on Dir 1 H 1 S B,LACK . 24] demonstrates a compiler annotation scheme for optimizing the performance of protocols that dynamically allocate directory pointers. Dynamic detection [12] and [27] propose a hardware mechanism that dynamically adapts to migratory data. Protocol extension software could perform similar optimizations. In addition, there are some classes of data that create severe performance bottlenecks. These classestend to be the result of a simplistic programming style or ....
Per Stenstrom, Mats Brorsson, and Lars Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual Symposium on Computer Architecture 1993, New York, May 1993. ACM.
....these techniques into our adaptive protocols. Using per word timestamps [1, 15, 22] addresses the problem of diff accumulation directly. The problem is alleviated in our system because we switch to using whole pages whenever the diffs are large. Cox and Fowler, and Stenstrom and Brorsson [7, 20] describe hardware cache coherence protocols that adapt to migratory sharing patterns. Migratory cache blocks are detected automatically. If a processor first reads and then writes a block, these protocols invalidate the old copy and migrate ownership of the block to the new processor on the read ....
P. Stenstrom, M. Brorsson, and L. Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, May 1993.
No context found.
P. Stenstrom, M. Brorsson, and L. Sandberg. Adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Int. Symp. on Computer Architecture, pages 109--118, May 1993.
No context found.
P. Stenstrom, M. Brorsson, and L. Sandberg. An AdaptiveCache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th International Symposium on Computer Architecture, pages 109--118, May1993. 17
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC