| M. M. Michael, A. K. Nanda, B.-H. Lim, and M. L. Scott. Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors. In Proceedings of the 24th International Symposium on Computer Architecture, pages 219--228, June 1997. |
....can significantly increase performance over uniprocessor systems by coordinating work among multiple processor nodes. A key feature of cachecoherent scalable shared memory systems is a Coherence Controller (CC) at each node, which ensures that cached data are kept coherent. Past research [6, 13] shows that the occupancy of CCs can be a performance bottleneck for applications with high communication requirements. A high CC occupancy hinders performance by inducing contention and reducing CC throughput. As microprocessors used in multiprocessors become more aggressive by generating This ....
....enabled by these enhancements serves to increase the overall throughput of CCs. To further enhance CC throughput, these three optimization techniques can be combined with previously proposed CC optimizations such as multiple protocol engines, pipelining, and split request response streams [1, 2, 8, 9, 11, 13, 14, 15]. Multiple protocol engines replicate the core processing engine of a CC to increase concurrency. Hardwired CCs are pipelined to increase their bandwidth. Split request response streams allocate dedicated hardware resources in a CC to process a request and a response in parallel. We evaluate ....
[Article contains additional citation context not shown here]
M. M. Michael, A. K. Nanda, B.-H. Lim, and M. L. Scott. Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors. In Proceedings of the 24th International Symposium on Computer Architecture, pages 219--228, June 1997.
....to understand the performance tradeo s between using customised hardware and or a programmable protocol processor to implement the coherence protocol. Michael et al. have carried out a study 29 of the performance and convenience tradeo s between using hardwired or programmable node controllers [28]. The Wisconsin Typhoon [34] and Stanford Flash [24] are research examples of programmable network controllers which aim to combine the speed and concurrency of existing hardware mechanisms with the exibility of software coherence. The Sequent NUMA Q is a commercial system with a programmable ....
Maged M. Michael, Ashwini K. Nanda, Beng-Hong Lim, and Michael L. Scott. Coherence controller architectures for SMP-based CC-NUMA multiprocessors. Twenty-Fourth Annual International Symposium on Computer Architecture, Denver, in Computer Architecture News, 25(2):219-228, June 1997.
....system. Decreasing the number of network interface cards also increases the protocol traffic into a node. The combined effect of a higher protocol request rate and a faster incoming 14 protocol traffic quickly makes software protocol execution a communication bottleneck as SMP nodes become larger [MNLS97,LC96] One approach to mitigate the software protocol bottleneck is to parallelize the protocol execution. Software protocols may run on either multiple embedded processors on the network interface card, or several SMP node commodity processors. Parallel protocol execution using SMP processors ....
....handlers. Providing high bandwidth access paths to protocol resources in tightly integrated custom devices also may significantly increase cost. Rather than provide high bandwidth access paths, some systems opt to specialize resource accesses to specific embedded processors on the custom device [MNLS97,CAA 95] Specializing resource accesses, however, may result in load imbalance in protocol execution and offset the advantages of parallel protocol execution. 21 Resource synchronization and contention in multi threaded protocol execution may also significantly increase protocol occupancy ....
[Article contains additional citation context not shown here]
Maged Michael, Ashwini K. Nanda, Beng-Hong Lim, and Michael L. Scott. Coherence controller architectures for SMP-based CC-NUMA mulitprocessors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, May 1997.
....events on the same memory block. Alternatively, other proposals avoid handling multiple operations on a memory block simultaneously by partitioning the shared address space and exploiting event parallelism across partitions. Such an address partitioning can happen either statically at design time [11,8] or dynamically at runtime [4] The static schemes simply use address demultiplexers to decide which engine will service a coherence event. The dynamic schemes require an address synchronization mechanism built into the coherence event queue to serialize servicing multiple events for the same ....
....to search and synchronize addresses in a single event queue. 3. 3 Static Home Based Partitioning In home based partitioning (Figure 4) protocol events for (local) home memory addresses are handled by one set of FSMs and the protocol events for remote memory addresses are handled by another set [8]. Within a set, static block interleaved partitioning is used to partition the protocol events among the FSMs. DSM clusters typically implement a global physical address space in which the upper address bits include a home identifier. An address demultiplexor uses the home identifier bits to ....
M. Michael, A. K.Nanda, B.-H. Lim,andM. L. Scott. Coherence controller architectures for SMP-based CC-NUMA mulitprocessors. In Proceedings of the 24th Annual International SymposiumonComputerArchitecture,May1997.
....or compiler restructuring application data and did not compare performance against a custom controller. Unfortunately, recent studies have shown that the high occupancy of a programmable controller can result in a performance penalty of 4 to 93 in an SMP node compared to a custom controller [10, 21]. In a generic workstation environment, this can impact the performance of applications that do not use shared memory. This brings into question whether the benefits of this degree of flexibility are outweighed by the higher overheads caused by using a general purpose processor as opposed to a ....
....controller is not particularly sensitive to the size of the modeled configuration. 5. 5 Custom versus Programmable DSM Controllers Although programmable DSM controllers have appeared in research machines [8, 25] an argument has been raised that occupancy issues make them inherently inefficient [21, 10]. Since support for multiple coherence mechanisms would be relatively easy to support on a programmable DSM controller, we simulated three DSM controller configurations: i) a single protocol custom controller (SPCC) ii) a multiple protocol custom controller (MPCC) and (iii) a programmable DSM ....
[Article contains additional citation context not shown here]
M. M. Michael, A. K. Nanda, B.H. Lim, and M. L. Scott. Coherence controller architectures for SMP-based CC-NUMA multiprocessors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 219--228, June 1997.
....code. Adding the compute and miss component times, we get the times in Table 2. The numbers in the table do not include any contention time. We have written the handlers to minimize the latency, even though that may have increased the occupancy. These numbers are similar to those of Michael et al. [15]. Handler Latency Occupancy Read 40 80 80 Read Exclusive 40 80 80 10 per inval. Acknowledgment 40 40 140 Write Back 40 140 Table 2: Latencies and occupancies in processor cycles for the major types of protocol handlers in AGG. NUMA and COMA execute the protocol in hardware, while AGG ....
M. Michael, A. Nanda, B.-H. Lim, and M. Scott. Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 219-228, June 1997.
....and node controller as follows: L: the latency experienced in each communication event, 10 cycles for long messages (which include 64 bytes of data, i.e. one cache line) and 5 cycles for all other messages. This represents a fast network, comparable to the point to point latency used in [11]. o: the occupancy of the node controller. Like Holt et al. 6] we have adapted the LogP model to recognise the importance of the occupancy of a node controller, rather than just the overhead of sending and receiving messages. The processes which cause occupancy are simulated in more detail ....
....better performance can be obtained for some benchmarks, the strategy relies on judicious marking of widely shared data for each application. 5 Related Work A number of measures are available to alleviate the effects of contention for a node, such as improving the node controller service rate [11], and combining in the interconnection network for fetch and update operations [4] Architectures based on clusters of bus based multiprocessor nodes provide an element of read combining since caches in the same cluster snoop their shared bus. Caching extra copies of data to speed up retrieval ....
Maged M. Michael, Ashwini K. Nanda, Beng-Hong Lim, and Michael L. Scott. Coherence controller architectures for SMP-based CC-NUMA multiprocessors. 24th Annual International Symposium on Computer Architecture, Denver, in Computer Architecture News, 25(2):219--228, June 1997.
....shows an example of message exchange and state transitions in two caches and a directory. Unfortunately, the finite state machine that implements the coherence logic often incurs multiple long latency operations. These latencies can become severe if coherence actions are implemented in software [104, 100, 80] or firmware [71] Additionally, a directory may need to exchange messages with other caches before it can respond to a processor s request for a memory block. Such message exchange can also introduce substantial delay in the critical path of a remote access. For example, Figure 6 1a shows that a ....
Maged M. Michael, Ashwini K. Nanda, Beng-Hong Lim, and Michael L. Scott. Coherence Controller Architecture for SMP-Based CC-NUMA Multiprocessors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 219--228, 1997.
....uniprocessor node parallel computers. As such, the protocol handlers either executed on the node s commodity processor along with computation or an embedded processor on the network interface card. Multiple SMP processors, however, increase the demand on fine grain protocol execution on a node [23,16,14,6]. To maintain the balance between computation and communication, protocol execution performance must increase commensurate to the number of SMP processors. One approach to increase software protocol performance is to execute protocol handlers in parallel. Legacy stack protocols (e.g. TCP IP) have ....
....significant shortcoming of the multiple protocol queues model is that individual processors do not take advantage of the tight coupling of resources within an SMP. Michael et al. recently observed that static partitioning of messages into two protocol queues leads to a significant load imbalance [16]. Rather than partition the resources among protocol queues, processors on one node can collaborate handling messages from a single queue. It follows from a well known queueing theory result that single queue multi server systems inherently outperform multi queue multi server systems [13] In ....
[Article contains additional citation context not shown here]
M. Michael, A. K. Nanda, B.-H. Lim, and M. L. Scott. Coherence controller architectures for SMP-based CCNUMA mulitprocessors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, May 1997.
....and message passing, e.g. the Alewife machine, do not consider system level issues like protection and multitasking. 17 Our use of a commodity microprocessor to execute coherence protocol code (also used in Typhoon[26] is controversial. Many feel that global cache miss latency would be too high [23] to be practical, motivating designers of commercial DSM s such as the Origin 2000 [17] to implement the entire cache protocol in hardware. This belief, however, may be based on inefficient designs for hardware supporting the protocol processor. For example, requiring the sP to poll for completion ....
M. M. Michael, A. K. Nanda, B.-H. Lim, and M. L. Scott. Coherence Controller Architectures for SMP-based CC-NUMA Multiprocessors. In The 24th Annual International Symposium on Computer Architecture Conference Proceedings, Denver, CO, June 1997.
....and Hennessy in [11] In particularly severe cases, such as the Gaussian Elimination benchmark discussed below, contention for the node controller to which ownership has migrated dominates execution time (Bianchini and LeBlanc [4] A number of measures are available to avoid this. Michael et al. [16] evaluate the effect of improving the node controller service rate. Combining in the interconnection network has been used to avoid contention for reads, writes and fetch and update operations, for example in the NYU Ultracomputer [7] and the Saarbrucken SB PRAM prototype [1] Attempts have been ....
....similar issues [15] uses hardware rather than simulation to study the bottlenecks in a workstation cluster architecture (rather than shared memory) Scalable distributed shared memory architectures rely on the node controllers to synthesise cachecoherent shared memory across the entire machine. In [16] an investigation is made into the performance tradeoffs between using customised hardware or a programmable protocol processor to implement the coherence protocol. Proxies allow read requests for data to be combined in controllers away from the home node which suffering from contention. This is a ....
Maged M. Michael, Ashwini K. Nanda, Beng-Hong Lim, and Michael L. Scott. Coherence controller architectures for SMP-based CC-NUMA multiprocessors. In 24th International Symposium on Computer Architecture (ISCA), Denver, CO, in June, 1997.
.... the flexibility and generality of a programmable controller leads to slower coherence protocol execution, which in turn increases controller occupancy and memory latency [11] The extent to which this degrades application performance has been the subject of several detailed simulation studies [12, 23, 19]. The analytic model can quickly assess the impact of higher controller occupancy. We evaluate the impact of programmable controllers by modeling a decoupled node architecture with increased di 0.0 50.0 100.0 150.0 200.0 network occupancy 1.0 1.2 1.4 1.6 1.8 2.0 fft, baseline fft, prog ....
M. Michael, A. Nanda, B. Lim, and M. Scott. Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors. In Proc. 24th Int'l Symp. on Computer Architecture, pages 219--229, June 1997.
....1 shows an example of message exchange and state transitions in two caches and a directory. Unfortunately, the finite state machine that implements the coherence logic often incurs multiple long latency operations. These latencies can become severe if coherence actions are implemented in software [31, 30, 23] or firmware [22] Additionally, a directory may need to exchange messages with other caches before it can respond to a processor s request for a memory block. Such message exchange can also introduce substantial delay in the critical path of a remote access. For example, Figure 1a shows that a ....
Maged M. Michael, Ashwini K. Nanda, Beng-Hong Lim, and Michael L. Scott. Coherence Controller Architecture for SMP-BasedCC-NUMA Multiprocessors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 219--228, 1997.
....compared to FLASH because hardware implements most communication mechanisms as well as accesses to data in local DRAM that requires no further remote actions 1 . Commercial DSM s, such as the Origin 2000 [11] implement the entire cache protocol in hardware to reduce remote cache miss latency [16]. Our design provides hardware to handle common case scenarios and to reduce software overheads to the point that it is feasible to execute coherence protocol code on a commodity microprocessor, as proposed in the Typhoon design[19] After an initial discussion of the requirements for an NIU that ....
M. M. Michael, A. K. Nanda, B.-H. Lim, and M. L. Scott. Coherence Controller Architectures for SMP-based CC-NUMA Multiprocessors. In Conference Proceedings of the 24th Annual International Symposium on Computer Architecture , Denver, CO, June 1997.
....Center as part of the HighT project. FLASH [3] and the MIT Alewife [1] machines. A key component of this type of machines is the coherence controller on each node that provides cache coherent access to memory that is distributed among the nodes of the multiprocessor. Recent research results [4, 11] show that the occupancy of the coherence controller (CC) can be the performance bottleneck for applications with high communication requirements. Motivated by these results, we study three approaches to alleviating this problem: multiple protocol engines (PEs) split request response streams; and ....
....to local addresses and one for remote addresses. The architects of the Sequent STiNG [8] system also considered a similar approach as one of the ways to reduce controller occupancy. However, the impact of this approach on the performance of these systems was not studied. Michael et al. [11, 10] evaluate systems with one local and one remote PEs in the context of comparing the performance of hardwired and programmable coherence controllers. Their results show significant improvements in the performance of such systems over systems with single protocol engine controllers. They also find ....
[Article contains additional citation context not shown here]
M.M.Michael,A.K.Nanda,B.-H.Lim,andM.L.Scott. Coherence Controller Architectures for SMP-Based CCNUMA Multiprocessors. In Proceedings of the 24rd International Symposium on Computer Architecture, pages 219-- 228, June 1997.
No context found.
M.M. Michael, A.K. Nanda, B-H. Lim, and M.L. Scott. Coherence Controller Architectures for SMP-based CC-NUMA Multiprocessors. In Proc. of the 24th Annual Int'l Symposium on Computer Architecture (ISCA'97), pages 219--228, June 1997.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC