| J. B. Andrews, C. J. Beckmann, and D. K. Poulsen, "Notification and multicast networks for synchronization and coherence," Journal of Parallel and Distributed Computing, vol. 15, no. 4, pp. 332-350, August 1992. |
....at each processor improved hot spot performance considerably with adaptive routing, though the routing bottleneck dominated when dimension order routing was used [12] Hardware resources in the network may also be used to support synchronization in a way that cuts down on hot spot accesses. In [8], Andrews, Beckmann and Poulsen described hardware designs that cut down on hot spot accesses by allowing notification and multicast to be used instead of polling or spinwaiting on a synchronization variable. Two hardware designs for implementing notification and multicast in packet switched ....
John B. Andrews, Carl J. Beckmann, and David K. Poulsen. Notification and multicast networks for synchronization and coherence. Journal of Parallel and Distributed Computing, 15, August 1992.
....to efficiently perform reliable multicasting. Hardware multicast has been studied for both direct [26] and indirect networks [39] Research has included switch design [37] flow control [5] and deadlock avoidance [24] Multicast has been proposed for efficient support of synchronization variables [3]. Isotach networks provide totally ordered multicasts and groups of operations which are atomic in logical time [33] Isotach networks were originally proposed to allow pipelined implementations of sequential consistency without caches and powerful synchronization without locks. The ....
John B. Andrews, Carl J. Beckmann, and David K. Poulsen. Notification and Multicast Networks for Synchronization and Coherence. Journal of Parallel and Distributed Computing, 15(8):332--350, February 1992.
....This method is effective only for short messages as prevalent in a distributed shared memory system. The architectures proposed in the current paper allow multicast for packets as large as the buffer size at the switches and the technique works well for both long and short messages. Andrews et al. [1] have proposed a method for tree based multicast using bit string encoding in the context of dance hall architectures. However this work only focuses on store and forward networks and short message lengths. Some parallel systems like the CM 5 [19] Meiko CS2 [4] etc. provide facilities for ....
J. B. Andrews, C. J. Beckmann, and D. K.Poulsen. Notification and multicast networks for synchronization and coherence. Journal of Parallel and Distributed Computing, 15:332--350, Aug. 1992.
....This method is effective only for short messages as prevalent in a distributed shared memory system. The architectures proposed in the current paper allow multicast for packets as large as the buffer size at the switches and the technique works well for both long and short messages. Andrews et al. [1] have proposed a method for tree based multicast using bit string encoding in the context of dance hall architectures. However this work only focuses on store and forward networks and short message lengths. Some parallel systems like the CM 5 [16] Meiko CS2 [4] etc. provide facilities for ....
ANDREWS, J. B., BECKMANN, C. J., AND POULSEN, D. K. Notification and multicast networks for synchronization and coherence. Journal of Parallel and Distributed Computing 15 (Aug. 1992), 332-- 350.
....The authors claim that such change allows a Cray Y MP like system to scale up to 64 processors. Franklin and Dhar [2] present some considerations on physical constraints and modularity issues in the design of a large (2048 Theta 2048) interconnection network. Andrews, Beckmann and Poulsen [1] have developed some networks that provide efficient cache coherence schemes for systems with hundreds and thousands of processors. Since the use of caches reduces the memory bandwidth required by processors, this is another solution for the problem of providing enough memory bandwidth. Hsu and ....
John B. Andrews, Carl J. Beckmann, and David K. Poulsen. Notification and Multicast Networks for Synchronization and Coherence. Journal of Parallel and Disrtributed Computing, (15), 1992.
....packets. The hardware modifications and protocols are equally applicable to both direct and indirect (switch based) networks. We do not focus on the parallel application performance improvement possible via network hardware barrier operations. Several papers have adequately addressed this topic [3, 1, 20, 21] and our hardware scheme should be similar in performance to other tree based hardware combining schemes. However, we do provide a potential implementation, and we compare the barrier sync latency of this implementation to software based schemes. In addition, we assess the effect of moving ....
J. B. Andrews, C. J. Beckmann, and D. K.Poulsen. Notification and multicast networks for synchronization and coherence. J. of Parallel and Distributed Computing, 15:332-- 350, Aug. 1992.
....packets. The hardware modifications and protocols are equally applicable to both direct and indirect (switch based) networks. We do not focus on the parallel application performance improvement possible via network hardware barrier operations. Several papers have adequately addressed this topic [1, 10] and our hardware scheme should be similar in performance to other treebased hardware combining schemes. However, we do provide a potential implementation, and we compare the barrier sync latency of this implementation to software based schemes. Section 2 describes our architectural requirements ....
....wormhole routing and flit wide links, where a flit is the sub packet unit upon which network flowcontrol is performed [3] Each input port contains a FIFO that can buffer 16 flits. A switch modified for packlet multicast and combining is shown in Fig. 2 (this is similar to the switch described in [1]) One additional input and output have been added to the in (k 1) x (k 1) Crossbar Barrier Unit Output 0 Input 0 Input k 1 Output k 1 Figure 2. A switch incorporating packlet barrier logic. ternal crossbar, providing paths to and from a new functional unit: the barrier unit. The barrier unit ....
J. B. Andrews, C. J. Beckmann, and D. K.Poulsen. Notification and multicast networks for synchronization and coherence. Journal of Parallel and Distributed Computing, 15:332--350, Aug. 1992.
....of Access Sequences ################################################################################# the read mask to determine whether to route the response on one or both outputs. Designs for switches that can decode read masks have already been proposed in support of unordered multicasts [ABP92, Ste89]. An additional benefit of using the network to fan out responses is a reduction in traffic in the stages closest to memory. In computing the space overhead for the heap implementation we assume, for simplicity, that each variable is of the same size vsize. The space overhead per MM is then ....
....[KMR86] are write invalidate write update hybrids. Because the concurrency of write invalidate protocols is inherently limited, the delta cache protocols use a write update policy. A disadvantage of the update policy is the cost of distributing cache updates. Hardware support for multicasting [ABP92, Ste89] can reduce this cost in a way that is compatible with isotach networks. To maintain consistency among the copies of the same block and ensure processes observe updates to different blocks in a consistent order, updates must appear to be executed as an indivisible step. One way to obtain this ....
J. B. Andrews, C. J. Beckmann and D. K. Poulsen, Notification and Multicast Networks for Synchronization and Coherence, Journal of Parallel and Distributed Computing 15(August 1992), 332-350.
....a single shared bus, making it difficult to extend them to large scale multiprocessors. In recent years, a number of coherence schemes have been proposed and implemented for large scale multiprocessors which allow the coexistence of updating and invalidating. The scheme proposed by Andrews et al. [3] provides the supporting hardware for updating and invalidating within the interconnection network. In addition, it recognizes the potential for the compiler to select updating instead of invalidating. The scheme proposed by Goshe and Simhadri [14] uses only run time information to choose be2 ....
....selection of updating or invalidating eliminates the need for a counter or some other additional hardware mechanism as is required in the Competitive [17] the EDWP [4] and the Galactica Net [31] schemes. Furthermore, this compile time optimization does not require a sophisticated network [3, 14] to maintain cache coherence. It requires only a coherence directory to keep track of the processors with a valid copy of each block. There are several variations of directories [5, 9, 10, 16] any one of which can be used with this compiler optimization. This study uses a directory structure ....
J. B. Andrews, C. J. Beckmann, and D. K. Poulsen. Notification and multicast networks for synchronization and coherence. Journal of parallel and Distributed Computing, 15(4):332--350, August 1992.
No context found.
J. B. Andrews, C. J. Beckmann, and D. K. Poulsen, "Notification and multicast networks for synchronization and coherence," Journal of Parallel and Distributed Computing, vol. 15, no. 4, pp. 332-350, August 1992.
....requesting caches from other caches. Caches can hide memory latency for sharing accesses only by exploiting spatial locality; this may lead to undesirable false sharing [4] Data prefetching has the potential to hide memory latency for both sharing and nonsharing accesses; however, data forwarding [5] may be a more effective technique than prefetching for reducing the latency of sharing accesses. Many different prefetching architectures and algorithms have been described in the literature. This paper focuses on software initiated non binding prefetching into cache [2, 3] In these schemes, ....
....a copy of a cache block to explicitly specified cluster caches. Other types of forwarding mechanisms have been proposed; for example, directory based schemes allow receiver initiated forwarding without requiring that sending processors explicitly specify the processors to receive forwarded data [5]. This paper presents two different multiprocessor software initiated data prefetching algorithms and a data forwarding scheme for a cache coherent, shared memory multiprocessor executing large, parallel, numerical application codes. The effectiveness of these schemes in reducing memory latency ....
[Article contains additional citation context not shown here]
Andrews, J. B., Beckmann, C. J., and Poulsen, D. K., "Notification and Multicast Networks for Synchronization and Coherence", Journal of Parallel and Distributed Computing, 15(4), August 1992, pp. 332-350.
.... sizes may lead to undesirable false sharing [3] Data prefetching has been shown to be effective in reducing memory latency in shared memory multiprocessors [4, 5] however, while data prefetching has the ability to hide memory latency for both sharing and nonsharing accesses, data forwarding [6, 7] may be a more effective technique than data prefetching for reducing the latency of sharing accesses. While many of the data forwarding mechanisms previously described in the literature have been intended primarily for use in optimizing synchronization operations, this paper proposes the use of ....
Andrews, J. B., Beckmann, C. J., and Poulsen, D. K., "Notification and Multicast Networks for Synchronization and Coherence", Journal of Parallel and Distributed Computing, 15(4), August 1992, pp. 332-350.
.... sizes may lead to undesirable false sharing [3] Data prefetching has been shown to be effective in reducing memory latency in shared memory multiprocessors [4, 5] however, while data prefetching has the ability to hide memory latency for both sharing and nonsharing accesses, data forwarding [6, 7] may be a more effective technique than data prefetching for reducing the latency of sharing accesses. This paper studies and compares the performance advantages of data prefetching and data forwarding for reducing memory latency caused by interprocessor communication in cache coherent, shared ....
....loop iterations. This assumption facilitates the construction of a forwarding compiler algorithm and of a hybrid scheme that can integrate data forwarding with data prefetching. Other mechanisms have been proposed that could be used for data forwarding in a dynamic, self scheduling environment [7, 14, 16, 18]. 3. PREFETCHING AND FORWARDING SCHEMES This section describes the data prefetching algorithm, the data forwarding scheme, and the new hybrid prefetching and forwarding scheme. These schemes are intended for use in shared memory multiprocessors executing parallel codes such as Cedar Fortran [11] ....
Andrews, J. B., Beckmann, C. J., and Poulsen, D. K. Notification and multicast networks for synchronization and coherence. J. Parallel Distrib. Comput. 15, 4 (August 1992), 332-350.
....nonsharing misses. Sharing accesses cause communication between processors when cache misses cause shared blocks to be brought to requesting caches from other caches. Data prefetching has the potential to hide memory latency for both sharing and nonsharing accesses; however, data forwarding [5] may be a more effective technique than prefetching for reducing the latency of sharing accesses. Many different prefetching architectures and algorithms have been described in the literature. This paper focuses on software initiated non binding prefetching into cache [2, 3] In these schemes, ....
Andrews, J. B., Beckmann, C. J., and Poulsen, D. K., "Notification and Multicast Networks for Synchronization and Coherence", Journal of Parallel and Distributed Computing, 15(4), August 1992, pp. 332-350.
.... sizes may lead to undesirable false sharing [3] Data prefetching has been shown to be effective in reducing memory latency in shared memory multiprocessors [4, 5] however, while data prefetching has the ability to hide memory latency for both sharing and nonsharing accesses, data forwarding [6, 7] may be a more effective technique than data prefetching for reducing the latency of sharing accesses. This paper studies and compares the performance advantages of data prefetching and data forwarding for reducing memory latency caused by interprocessor communication in cache coherent, shared ....
....loop iterations. This assumption facilitates the construction of a forwarding compiler algorithm and of a hybrid scheme that can integrate data forwarding with data prefetching. Other mechanisms have been proposed that could be used for data forwarding in a dynamic, self scheduling environment [7, 14, 16, 18]. 3. PREFETCHING AND FORWARDING SCHEMES This section describes the data prefetching algorithm, the data forwarding scheme, and the new hybrid prefetching and forwarding scheme. These schemes are intended for use in shared memory multiprocessors executing parallel codes such as Cedar Fortran [11] ....
Andrews, J. B., Beckmann, C. J., and Poulsen, D. K. Notification and multicast networks for synchronization and coherence. J. Parallel Distrib. Comput. 15, 4 (August 1992), 332-350.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC