| HOLT, C., HEINRICH, M., SINGH, J. P., ROTHBERG, E., AND HENNESSY., J. The Effects of Latency, Occupancy, and Bandwidth in DistributedShared Memory Multiprocessors. Tech. Rep. CSL-TR-95-660, Stanford University, Jan. 1995. |
....can significantly increase performance over uniprocessor systems by coordinating work among multiple processor nodes. A key feature of cachecoherent scalable shared memory systems is a Coherence Controller (CC) at each node, which ensures that cached data are kept coherent. Past research [6, 13] shows that the occupancy of CCs can be a performance bottleneck for applications with high communication requirements. A high CC occupancy hinders performance by inducing contention and reducing CC throughput. As microprocessors used in multiprocessors become more aggressive by generating This ....
....for coherence protocol handling. Except for a prior study [14] the performance impact of pipelining in coherence controllers has not been discussed in the literature. We focus on hardwired CCs because they have been shown to yield higher performance than programmable protocol processors [6, 13]. 7 Conclusion Based on a comparison between general purpose microprocessors and coherence controllers (CCs) we have proposed three optimizations to enhance the throughput and reduce the occupancy of hardwired CCs: nonblocking execution, early fetching, and Protocol Engine (PE) superpipelining. ....
C. Holt, M. Heinrich, J. P. Singh, E. Rothberg, and J. Hennessy. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors. Technical report, Stanford University, January 1995.
....and or memory. Furthermore, these cards typically do not provide support for simultaneous dispatch of multiple messages. Handler execution, therefore, is inherently a serial operation and parallelizing it would incur high software synchronization overhead, result in high protocol occupancies [HHS 95] and lower overall performance. Blizzard serializes protocol handler execution on SMP nodes. Serializing handler execution obviates the need for synchronization of accesses for resources used only by the protocol, i.e. the message queues and the directory state. Accesses to fine grain tags, ....
Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg, and John Hennessy. The effects of latency, occupancy, and bandwidth in distributed shared memory multiprocessors. Technical Report CSL-TR-95-660, Computer Systems Laboratory, Stanford University, January 1995.
....are not tightly synchronized. For example, Dusseau et al. used LogP to analyze a variety of sorting algorithms with irregular communication patterns [11] They found that some of their models underestimated execution time and attributed the difference to contention costs. Furthermore, Holt et al. [18] used LogP as a framework for an experimental study of contention in memory controllers for shared memory; for a variety of SPLASH benchmark applications and a variety of controller speeds and network latencies they find that contention in the memory controller dominates the costs of handler ....
....the processor nodes. The model can be extended to include analysis of contention in the network, as noted in Section 3.2. However, a number of researchers have found that for many real applications contention in current interconnection networks accounts for only a minimal portion of total runtime [10, 18, 26]. Furthermore, for the algorithms investigated in this paper we found that network contention is insignificant. To simplify the explanation of the LoPC model, the analyses in this paper will assume: 1) a single CPU per node, 2) a single computation thread per CPU, 3) message handlers run on the ....
[Article contains additional citation context not shown here]
Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg, and John Hennessy. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors. Technical Report CSL-TR-95-660, Stanford Computer Systems Laboratory, January 1995.
....it can potentially be better than adaptive sequential prefetching for a software only directory protocol. Unfortunately, the results in [9] show that the most common stride is one, and then stride and adaptive sequential prefetching behave similarly in terms of read stall time reduction. In [22], Holt et al. evaluate how the occupancy of the directory controller, the network latency, and the bandwidth impact the performance of distributed shared memory systems. They evaluate a range of controller implementations, from a general purpose coprocessor on the I#O bus to hardwired controllers, ....
C. Holt, M. Heinrich, J. P. Singh, E. Rothberg, and J. Hennessy, The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors," Technical Report CSL-TR-95-660, Computer Systems Laboratory, Stanford University, January 1995. 832 GRAHN AND STENSTRO#M
....wide area and or low bandwidth networks, the use of LANs in distributed systems (DSM, RPC, etc. causes most operating systems researchers to focus on LANs. There is nothing inherently wrong with this, and there is a strong correlation between message latency and (for example) DSM performance [11]. However, those operating systems researchers who make claims regarding WAN applications (such as HTTP servers) seem to ignore the special characteristics of WANs, especially the long connection durations. 2.3. Emerging realistic benchmarks I do not want to give the impression that nobody is ....
C. Holt, M. Heinrich, J. P. Singh, E. Rothberg, and J. Hennessy. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors. Tech. Report CSL-TR-95-660, Computer Science Laboratory, Stanford University, Jan., 1995.
....The read and write accesses are both fine grained, so it su#ers high false sharing and fragmentation in SVM systems. It also has a lot of lock based synchronization when building its shared tree, which is very expensive in SVM systems. Radix sorts a series of integer keys in ascending order [26, 85]. Its inherent data referencing pattern is one producer with one consumer, but its induced pattern at page granularity (in the permutation phase of the sort) is multiple producers with one consumer. Read accesses are coarse grained but write accesses are fine grained and scattered. It su#ers ....
C. Holt, M. Heinrich, J. P. Singh, E. Rothberg, and J. Hennessy. The e#ects of latency, occupancy, and bandwidth in distributed shared memory multiprocessors. In Technical Report No. CSL-TR-95-660, Stanford University., January 1995.
....or compiler restructuring application data and did not compare performance against a custom controller. Unfortunately, recent studies have shown that the high occupancy of a programmable controller can result in a performance penalty of 4 to 93 in an SMP node compared to a custom controller [10, 21]. In a generic workstation environment, this can impact the performance of applications that do not use shared memory. This brings into question whether the benefits of this degree of flexibility are outweighed by the higher overheads caused by using a general purpose processor as opposed to a ....
....controller is not particularly sensitive to the size of the modeled configuration. 5. 5 Custom versus Programmable DSM Controllers Although programmable DSM controllers have appeared in research machines [8, 25] an argument has been raised that occupancy issues make them inherently inefficient [21, 10]. Since support for multiple coherence mechanisms would be relatively easy to support on a programmable DSM controller, we simulated three DSM controller configurations: i) a single protocol custom controller (SPCC) ii) a multiple protocol custom controller (MPCC) and (iii) a programmable DSM ....
C. Holt, M. Heinrich, J. P. Singh, E. Rothberg, and J. Hennessy. The effects of latency, occupancy, and bandwidth in distributed shared memory multiprocessors. Technical Report CSL-TR-95-660, Stanford University, 1995.
....Center as part of the HighT project. FLASH [3] and the MIT Alewife [1] machines. A key component of this type of machines is the coherence controller on each node that provides cache coherent access to memory that is distributed among the nodes of the multiprocessor. Recent research results [4, 11] show that the occupancy of the coherence controller (CC) can be the performance bottleneck for applications with high communication requirements. Motivated by these results, we study three approaches to alleviating this problem: multiple protocol engines (PEs) split request response streams; and ....
....is custom architected for coherence protocol handling. The performance impact of pipelining in coherence controllers has not been discussed in the literature in the past. We focus on hardwired coherence controllers because they have been shown to yield superior performance over protocol processors [4, 11]. The results show that each approach is highly effective at reducing controller occupancy by as much as 28 and improving execution time by as much as 16 , for applications with high communication bandwidth requirement on a system with 1 SMP node (or 4 processors) per CC. A combination of ....
C. Holt, M. Heinrich, J. P. Singh, E. Rothberg, and J. Hennessy. The Effects of Latency, Occupance, and Bandwidth in Distributed Shared Memory Multiprocessors. Technical report, Stanford University, January 1995.
....assume the underlying hardware supports wormhole routing and is deadlock free. In order to give a qualitative analysis, we adopt 4 parameters to compare different multicast schemes: the number of invalidation messages n inv , the number of acknowledgement messages n ack , the host occupancy h o [16], i.e. the time spent on sending or receiving messages at the host processor, and the number of communication steps step in the invalidation message sending phase. Although those parameters will be analyzed respectively, they are interdependent with each other. The first two parameters can be ....
C.Holt, Mark Heinrich, J.P.Singh, E.Rothberg, and John Hennessy. The effects of latency, occupancy,and bandwidth in distributed shared memory multiprocessors. technique reports csl-tr-95-660. Technical report, Stanford University, 1995.
....assume that low level compilers take care of details such as batching messages when possible. Several studies have examined how different aspects of network performance affect program performance. Cypher et. al [5] examined the performance of several message passing scientific codes. Holt et. al [13] used simulation to examine the performance of the FLASH multiprocessor as its architectural parameters were varied and found that performance was heavily dependent on message latency and overhead. The Wisconsin Wind Tunnel was also built to examine the impact of different communication ....
C. Holt, M. Heinrich, J. Singh, E. Rothberg, and J. Hennessy. The effects of latency, occupancy, and bandwidth in distributed shared memory multiprocessors. Technical Report CSL-TR-95-660, Computer Systems Laboratory, Stanford University, January 1995.
....into the local bus, like in Typhoon 0 [22] or START NG [5] Our results show that a software implemented protocol engine can perform very well, even when compared to an ideal hardware implementation. Addressing some of the concerns about performing protocol actions on the main processor [10][11], we show that efficient switching of the main processor between application and protocol can be done without hardware support for context switching. We also show that the overhead of protocol actions does not disrupt application performance to a significant degree. The protocol engine occupancy, ....
C. Holt, M. Heinrich, J.P. Singh, E. Rothberg, J.Hennessy. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors. Technical Report CSL-TR-95-660, Computer Systems Laboratory, Stanford University, January 1995.
....by software overhead. The only change will be where the delay takes place. While the importance of reducing the overhead on the main processor associated with messaging has already been mentioned, it has also been shown to also be important to reduce messaging overhead on the network processor [17]. One way to reduce the overhead, as measured by the amount of time spent by the NI processor per message sent, is to increase the performance of the NI processor itself. If the network interface processor performance is increased, the time delay between when a message is queued and when the ....
C. Holt et al. The effects of latency, occupancy, and bandwidth in distributed shared memory multiprocessors. Technical Report CSL-TR-95-660, Stanford University, Computer Systems Laboratory, Jan. 1995.
....FFT. Other differences between the two studies are: i) they compare Simple COMA systems, while we compare CC NUMA systems, ii) they assume a slower network with a latency of 500 ns, which mitigates the penalty of protocol processors, and iii) they considered only uniprocessor nodes. Holt et al. [3] perform a study similar to ours on comparing various coherence controller architectures and the effect of latency, occupancy and bandwidth on application performance. They also find that the occupancy of coherence controllers is critical to the performance of high bandwidth applications. However, ....
C. Holt, M. Heinrich, J. P. Singh, E. Rothberg, and J. Hennessy. The Effects of Latency, Occupance, and Bandwidth in Distributed Shared Memory Multiprocessors. Technical report, Stanford University, January 1995.
....limited directory coherence protocols have been proposed. Most of these protocols only use a point to point communication mechanism. However, when a write invalidation occurs, such an approach requires a large number of control messages, generates heavy network traffic, and increases occupancy [11] at home nodes. These factors lead to severe performance degradation. Recently, multidestination message passing mechanisms have been introduced for wormhole networks to achieve low latency multicast [12, 18] and gather [17] operations on distributed memory systems. Most recently, we have ....
C. Holt, M. Heinrich, J. P. Singh, E. Rothberg, and J. Hennessy. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors. Technical Report CSL-TR-95-660, Computer Systems Laboratory, Stanford University, 1995.
....are not tightly synchronized. For example, Dusseau et al. used LogP to analyze a variety of sorting algorithms with irregular communication patterns [11] They found that some of their models underestimated execution time and attributed the difference to contention costs. Furthermore, Holt et al. [18] used LogP as a framework for an experimental study of contention in memory controllers for shared memory; for a variety of SPLASH benchmark applications and a variety of controller speeds and network latencies they find that contention in the memory controller dominates the costs of handler ....
....the processor nodes. The model can be extended to include analysis of contention in the network, as noted in Section 3.2. However, a number of researchers have found that for many real applications contention in current interconnection networks accounts for only a minimal portion of total runtime [10, 18, 26]. Furthermore, for the algorithms investigated in this paper we found that network contention is insignificant. To simplify the explanation of the LoPC model, the analyses in this paper will assume: 1) a single CPU per node, 2) a single computation thread per CPU, 3) message handlers run on the ....
[Article contains additional citation context not shown here]
Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg, and John Hennessy. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors. Technical Report CSL-TR-95-660, Stanford Computer Systems Laboratory, January 1995.
....31] estimating accuracy vs. performance in simulating DSM systems [2] software DSM systems [23, 1, 19] and explicit communication primitives [29] Research towards reducing network latency has been largely left to the (interconnection) network community. However, most recently, several papers [18, 9, 31] have reported that network latency is becoming a key architectural bottleneck in designing large scale DSM systems after integrating some of the above techniques. Under a closer examination, it can be observed that network latency contains two components: minimal communication latency and ....
C. Holt et al. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors. Technical Report CSL-TR-95-660, Stanford University, 1995.
....messages to all sharing nodes and receives unicast acknowledgments from them. Such unicast message passing incurs high traffic and contention in the network. It also makes the home nodes as hot spots in the system. This has considerable impact on the occupancy of messages at directories [3]. Such overheads get reflected as high latency for write operations, leading to degradation on the overall system performance. In WWT [5] the invalidation requests are broadcasted using a dedicated broadcast network. This leads to a costly design. Recently, Bhuyan et al. 6] have proposed an ....
....in an e cube mesh DSM supporting unicast communication: a) the request phase and (b) the acknowledgment phase. 2.3 Latency and Traffic Estimates In order to analyze the latency of an invalidation transaction quantitatively, let us use the following four simple measures. Home node occupancy [3] reflects the amount of processing time required at a home node in order to send the requests and receive the acknowledgments. It is proportional to the number of messages sent from and received by the home node. Average distance of a sharer from the home node reflects the network latency ....
C. Holt et al. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors. CSLTR -95-660, Stanford University, 1995.
....controllers and active memory elements. Our initial work with active memory controllers using the FLASH prototype indicates that the occupancy of an active memory controller would be significantly reduced by the introduction of active memory elements, thereby improving overall system performance [15]. Weintroduced a two level approachtoactive memory systems that focuses on designing active memory elements that can assist an active memory controller in performing dataintensive operations in the memory system itself. While the data intensive calculations are best performed in the active memory ....
C. Holt, M. Heinrich, J. P. Singh, et al. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors. Technical Report CSL-TR-95-660, Computer Systems Laboratory, Stanford University,January 1995.
No context found.
Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg, and John Hennessy. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors. Stanford University Technical Report No. CSLTR -95-660. January 1995.
No context found.
HOLT, C., HEINRICH, M., SINGH, J. P., ROTHBERG, E., AND HENNESSY., J. The Effects of Latency, Occupancy, and Bandwidth in DistributedShared Memory Multiprocessors. Tech. Rep. CSL-TR-95-660, Stanford University, Jan. 1995.
No context found.
HOLT, C., HEINRICH, M., SINGH, J. P., ROTHBERG, E., AND HENNESSY., J. The Effects of Latency, Occupancy, and Bandwidth in DistributedShared Memory Multiprocessors. Tech. Rep. CSL-TR-95-660, Stanford University, Jan. 1995.
No context found.
Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg, and John Hennessy. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors. Stanford University Technical Report No. CSLTR- 95-660. January 1995.
No context found.
Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg, and John Hennessy. The E#ects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors. Technical Report CSL-TR-95-660, Stanford University, January 1995.
No context found.
Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg, and John Hennessy. The effects of latency, occupancy, and bandwidth in distributed shared memory multiprocessors. Technical Report CSL-TR-95-660, Computer Systems Laboratory, Stanford University, January 1995. 123
No context found.
C. Holt, M. Heinrich, J.P. Singh, E. Rothberg, J. Hennessy. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors. Technical Report CSL-TR-95-660, Computer Systems Laboratory, Stanford University, January 1995.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC