| Cheriton, D.R., H.A. Goosen and P.D. Boyle, "Multi-Level Shared Caching Techniques for Scalability in VMP-MC," Proc. 16th International Symposium on High-Performance Computer Architecture, Jerusalem, Israel, May 1989, p. 16-24. |
....without the severe bus contention or network expense problems of existing multiprocessors. DSM systems pass messages, but hide the underlying mechanism. They maintain local memory, or cache, copies of shared data for rapid reading. Older demand driven mechanisms[LGL 90, GW88, CD90, CGB89, ALKK90] for DSM idle the processor for a long time whenever remote data is fetched across a network, but min33 2 4 7 10 20 40 70 100 200 400 #cpu 20 40 60 80 100 120 140 Netwk power Eager Share Prefetch Demand Fetch Figure 2: Network Power (in CPUs) for Gaussian Elimination: 400 ....
D.R. Cheriton, H.A. Goosen, and P.D. Boyle. Multi-Level Shared Caching Techniques for Scalability in VMP-MC. The 16th Ann. Int. Symp. on Comp. Arch., pages 16--24, May 1989.
....of scalability, including uniformly, architecturally and implementationally scalable. On the other hand, Hill [Hill90] questions whether scalability can be usefully defined at all. He challenges the technical community to either define the term rigorously, or stop using it altogether. Others [Leno90, Cher89, Hage89] use the term without accompanying definition, relying on the readers intuitive definitions. I believe that a rigorous definition of scalability may be of little use, but that we can arrive at a useful working definition. Several qualifications need to be offered at the start, however. First, ....
....An inclusion cache simply contains tags for each cache line residing within the subtree rooted at the corresponding node. The logical inclusion bit for a line is considered to be set if an entry for that line exists in the inclusion cache, and cleared if there is no entry. The VMP MC system[Cher89] enforces multi level inclusion within a VMP node, but uses directory entries similar to pruning vectors rather than inclusion bits. A pruning vector at a given Ch. 5 88 node is simply the collection of its childrens inclusion bits for the same line. The directory entries in VMP MC are associated ....
Cheriton, D. R., H. A. Goosen, and P. D. Boyle, Multi-Level Shared Caching Techniques for Scalability in VMP-MC, Proc. 16th Annual International Symposium on Computer Architecture, May 1989, 16-24.
....correctness, allowing a higher degree of speculation in DPC s access stream. The use of shared data caches in multiprocessor systems has been described in detail by Nefay, Olukotun and Singh [34] Systems based on the closely related idea of hierarchical caches have 41 been proposed and built [10,56]. The shared caches in these systems were introduced to provide a natural location for data that was shared among the nodes of a multiprocessor. Although a separate prefetch processor was not used, shared data is implicitly prefetched by the first processor to reference the given data block. It ....
Cheriton, D.R., H.A. Goosen and P.D. Boyle, "Multi-Level Shared Caching Techniques for Scalability in VMP-MC," Proc. 16th International Symposium on High-Performance Computer Architecture, Jerusalem, Israel, May 1989, p. 16-24.
....cache system, and a RAMpage system. Note that the major components are the same. Both systems use the same amount of SRAM, and both have a TLB to cache recent page translations; the difference is in the way they are managed. The RAMpage hierarchy extends earlier work on software managed caches [8, 7, 6, 19] by going all the way to implementing what was previously the lowest level of cache as a paged memory. There are two major differences between the RAMpage strategy and earlier work on software menaged caches: ffl hits can be handled immediately if the TLB hits, without the overhead of cache tag ....
....various approaches to software managed caches, though these approaches are focussed at different problems than the RAMpage model. Software managed caches on VMP were designed to reduce memory traffic in a multiprocessor context, and most of the issues involved do not apply to a uniprocessor system[8, 7, 6]. Unlike RAMpage which implements full associativity in software, VMP used 4 way set associative caches [6] Other more recent work on software controlled caches was designed to do efficient address translation on a miss with a virtually addressed cache, and is therefore not closely related to ....
[Article contains additional citation context not shown here]
D Cheriton, H Goosen, and P Boyle. `Multi-level shared caching techniques for scalability VMPMC '. In Proceedings of the 16th International Symposium on Computer Architecture, pp. 16--24, Jerusalem, (May/June 1989).
....the affected block or not. Because snoopy protocols rely on broadcasting, they scale poorly. The ICN, typically but not necessarily a shared bus, becomes saturated with only a few PE s. Researchers are exploring ways to delay this saturation point by assuming multiple buses arranged hierarchically [CGB89, HaH89, Wil87] or in a grid [CaD90, GoW88] This approach is promising for programs with access patterns that allow most broadcasts to be restricted to a local cluster of PE s. Directory protocols represent a more general approach to the scalability problem. These protocols do not require broadcasting. They ....
....snoopy protocols. Some protocols are directory snoopy hybrids, e.g. the limited directory protocols proposed by Agarwal, et al. Aga88] and several protocols for multiple bus or hybrid bus MIN architectures, such as the protocol proposed by Algudady, et al. ADT90] and the DASH [Len90] VMP MC [CGB89], and Aquarius [CaD90] protocols. 3 Directory protocols are so named because they maintain a directory for each block listing the location of all copies of the block. The first directory protocol, due to Tang [Tan76] specified a single centralized directory. Censier and Feautrier improved on ....
D. R. Cheriton, H. A. Goosen and P. D. Boyle, Multi-Level Shared Caching Techniques for Scalability in VMP-MC, Proc. 16th ISCA, June 1989, 16-24.
....of associating state information with each cache block, state information is stored within the nodes of the interconnection. Each node in the hierarchy typically keeps information about cached memory locations in the subsystem below. Examples of such systems include the VMP multiprocessor [20][21] and the KSR [45] In the VMP system design, each node of the hierarchy contains an intermediate cache, and each intermediate cache contains a superset of all data contained within the caches below it in the hierarchy. At the lowest level of the hierarchy, caches are kept consistent using a snoopy ....
....intermediate shared caches to supplement private processor caches. Two levels in the private shared cache hierarchy are very common, as in DASH [51] and NUMAchine [80] More than two cache levels (excluding private secondary caches as a separate level) are less common, but have also been proposed [21][83] Within each of these categories of systems, a variety of coherence alternatives are available that will be described in the subsequent sections of this chapter. Our emphasis will be on the three cache based systems. Replica based systems are beyond the scope of this dissertation. Note also ....
[Article contains additional citation context not shown here]
D. Cheriton, H. Goosen, and P. Boyle. Multi-level shared caching techniques for scalability in VMP-MC. In 15th Int'l. Symp. on Compter Architecture, 1989.
....remote access. Network traffic is minimized by passing only needed data and data requests. At the other extreme are eagersharing mechanisms, built on the principle that whenever shared data changes, it is immediately sent to all processors that may need it. There are many demand driven DSM systems[11, 19, 9, 5, 6, 1]. Notable is the Stanford Directory Architecture for SHared memory (DASH) 11] DASH is a scalable two level on demand shared memory network. Each processing node is a bus based multiprocessor with a snooping protocol for cache coherence. Nodes are connected through an interconnection network with ....
D.R. Cheriton, H.A. Goosen, and P.D. Boyle. Multi-Level Shared Caching Techniques for Scalability in VMP-MC. 16th Int. Symp. on Computer Arch., pages 16--24, May 1989.
....advantages of shared memory while scaling to high degrees of parallelism. DSM systems pass messages, but hide the underlying mechanisms to provide users a shared memory programming paradigm. Based on how to handle remote memory accesses, a DSM system can be classified as a demanddriven DSM system[8, 5, 3, 4] or an eager DSM system[11, 13, 14, 1] Whenever a remote datum is needed, a processor in a demand driven system fetches the datum over the network, causing a long delay. Eager DSM systems pass shared data values whenever they change. The Sesame[Scalable Eager ShAring MEmory] project[14] is ....
D.R. Cheriton, H.A. Goosen, and P.D. Boyle. Multi-Level Shared Caching Techniques for Scalability in VMP-MC. The 16th Annual International Symposium on Computer Architecture, 16--24, May 1989.
....sharing. Older on demand mechanisms introduce long delays whenever processors wait to fetch data across the network. On the other hand, as the name indicates, eagersharing memories immediately ( eagerly ) send each changed shared datum to whatever processors may need it. Demand driven mechanisms[10, 6, 2, 3] for DSM minimize network traffic by passing only needed data, whereas eager sharing may pass data values that are never used. Eager sharing is simple. Each processor sharing variable s has a local copy of s. Whenever it changes s, the new value is sent to the other processors that share it. With ....
D.R. Cheriton, H.A. Goosen, and P.D. Boyle. Multi-Level Shared Caching Techniques for Scalability in VMP-MC. The 16th Ann. Int. Symp. on Comp. Arch., pages 16--24, May 1989.
....idle the processor and introduce long latency whenever remote data is fetched across the network, but minimize network traffic by passing only needed data. Eagersharing systems immediately send shared data changes to all processors that may need it. There are many demand driven DSM systems[12, 10, 4, 5, 1]. Most notable is the Stanford Directory Architecture for SHared memory (DASH) 12] DASH is a 64 processor two level on demand shared memory network and the origin of release consistency[9] A few parallel computer systems provide some hardware support for eager sharing, which has been made more ....
D.R. Cheriton, H.A. Goosen, and P.D. Boyle. Multi-Level Shared Caching Techniques for Scalability in VMP-MC. The 16th Ann. Int. Symp. on Comp. Arch., 16--24, May 1989.
....software based. Hybrid hardware software support for shared memory was first employed in VMP [CSB86, CGBG88] a bus based shared memory system that uses software at the processor caches to handle misses and support bus snooping. A follow on system, Paradigm [CGB91] originally called VMP MC [CGB89] adds a simple hardware directory at main memory to efficiently support a hierarchical bus organization. 41 Hybrid protocols for distributed shared memory were first proposed for Alewife [ABC 95] whose LimitLESS protocol [CKA91] implements a few pointers in hardware and traps to software ....
David R. Cheriton, Hendrik A. Goosen, and Patrick D. Boyle. Multi-level shared caching techniques for scalability in VMP-MC. In Proceedings of the 16th Annual International Symposium on Computer Architecture, pages 16--24, June 1989.
....and Motivation As sequential computer technology approaches its physical limitations, parallel computing on multicomputers and multiprocessors becomes a more attractive method for meeting demands for additional computational speed. Distributed Shared Memory(DSM) systems have been proposed[33, 61, 22, 15, 16, 1, 36, 37, 54, 53, 55, 10, 40] to keep the programming advantages of shared memory systems while scaling to high degrees of parallelism. DSM systems pass messages, but hide the underlying mechanisms to provide users a shared memory programming paradigm. DSM solves many contention and scaling problems of parallel systems. Based ....
....hide the underlying mechanisms to provide users a shared memory programming paradigm. DSM solves many contention and scaling problems of parallel systems. Based on how remote memory accesses in a logically shared data space are handled, a DSM system can be classified as a demand driven DSM system[33, 61, 22, 15, 16, 1] or an eager DSM system[36, 37, 54, 55, 53, 10] Demand driven mechanisms halt a processor to fetch needed data across the network, introducing long delays for remote accesses. To improve system performance, large scale coherent cache prefetch techniques[20, 23, 32] are CHAPTER 1. INTRODUCTION ....
[Article contains additional citation context not shown here]
D.R. Cheriton, H.A. Goosen, and P.D. Boyle. Multi-Level Shared Caching Techniques for Scalability in VMP-MC. The 16th Annual International Symposium on Computer Architecture, pages 16--24, May 1989.
....or softvm for short. It dispenses with hardware such as the translation lookaside buffers (TLBs) found in every modern microarchitecture and the page table walking state machines found in x86 and PowerPC architectures. It uses a software handled cache miss, as in the VMP multiprocessor [11, 12, 13], except that VMP used the mechanism to explore cache coherence in a multiprocessor, while we use it to simplify memory management hardware in a uniprocessor. It also resembles the in cache address translation mechanism of SPUR [26, 43, 56] in its lack of TLBs, but takes the design one step ....
....information. However, it replaced the TLB with another specialized hardware translation mechanism a finite state machine that searched for PTEs in general purpose storage (the cache) instead of specialpurpose storage (TLB slots) 3. 5 VMP: Software controlled caches The VMP multiprocessor [11, 12, 13] places virtual caches under software control. Each processor node contains several hardware structures, including a central processing unit, a software controlled virtual cache, a cache controller, and special memory. Objects the system cannot afford to have causing faults, such as root page ....
D. R. Cheriton, H. A. Goosen, and P. D. Boyle. "Multi-level shared caching techniques for scalability in VMP-MC." In Proc. 16th Annual International Symposium on Computer Architecture (ISCA 16), June 1989.
....each cache update or invalidation, goes to all PE s, whether they hold a copy of the affected block or not. Since they rely on broadcasting, snoopy protocols scale poorly. Researchers are exploring ways to improve the scalability of snoopy protocols by using multiple buses arranged hierarchically [CGB89, HLH92, Wil87] or in a grid [CaD90, GoW88] This approach is promising for programs with access patterns that allow most broadcasts to be restricted to a local cluster of PE s. Directory protocols represent a more general approach to the scalability problem. These protocols do not require broadcasting. They are ....
....block. Some protocols are directory snoopy hybrids, e.g. the limited directory protocols proposed by Agarwal, et al. Aga88] and several protocols for multiple bus or 97 hybrid bus MIN architectures, such as the protocol proposed by Algudady, et al. ADT90] and the DASH [Len90] VMP MC [CGB89], Aquarius [CaD90] and Galactica Net [WiL92] protocols. Directory protocols are so named because they maintain a directory for each block listing the location of all copies of the block. The first directory protocol, due to Tang [Tan76] specified a single centralized directory. Censier and ....
D. R. Cheriton, H. A. Goosen and P. D. Boyle, Multi-Level Shared Caching Techniques for Scalability in VMP-MC, Proc. 16th ISCA, June 1989, 16-24.
....shown in Table 3, between 52 and 85 of the prefetch streams initiated at run time contain multiple strides. Such reference patterns typically arise from indexing multi dimensional arrays or arrays that are accessed in a striped or blocked manner. Shared data caches [15] and hierarchical caches [5,30] have been proposed as a natural store for data that is shared among the nodes of a multiprocessor. In such a system, it may be possible to have one processor to mimic the functionality of a DPC by prefetching data for another processor sharing the same cache, although it is not clear how the ....
Cheriton, D.R., H.A. Goosen and P.D. Boyle, "Multi-Level Shared Caching Techniques for Scalability in VMP-MC," Proc. 16th International Symposium on High-Performance Computer Architecture (ISCA 89), Jerusalem, Israel, May 1989, p. 16-24.
....manner. We show that this approach can reduce the number of misses on shared data by about 10 on average. i 1 Introduction Scalable machines that support a shared memory paradigm are a promising way of attaining the benefits of large scale multiprocessing without surrendering programmability [1, 2, 3, 4, 5, 6]. An interesting subclass of these machines is the class that provides hardware cache coherence, which makes programming easier, while reducing storage access latencies by caching shared data. While these machines can do well on problems with low levels of data sharing, it is unclear how well ....
D. R. Cheriton, H. A. Goosen, and P. D. Boyle, "Multi-Level Shared Caching Techniques for Scalability in VMP-MC," in Proceedings of the 16th Annual International Symposium on Computer Architecture, pp. 16--24, June 1989.
....networks. Examples include Hector [23] and Cedar [10] ffl hierarchy of shared caches: Another type of hierarchical architecture, though not strictly NUMA, consists of processors connected by multiple levels of buses with shared caches, and global memory at the bottom level. An example is VMP MC [6]. ffl multi grid of buses: processors are connected by buses running in two or more dimensions which are used to maintain cache consistency. An example is Multicube [11] ffl general network: Each processor has a limited number of links through which processors are connected, either directly or ....
David R. Cheriton, Hendrik A. Goosen, and Patrick D. Boyle. Multi-level shared caching techniques for scalability in VMP-MC. In Proc. ACM Intl. Conf. on Computer Architecture, 1989.
....performance on multigrain systems. Lastly, the Cox study uses bus based multiprocessors. Our study provides an implementation for NUMA machines. Other systems exploit clustering at a level closer to the processor, typically in the first or second level cache. These systems include VMP MC [12], DASH [13] and KSR [14] The benefit of clustering on these systems has been studied [15] Interference misses due to limited cache capacity and associativity can reduce the benefits of clustering close to the processor. MGS clusters at the main memory level and thus does not suffer from these ....
David R. Cheriton, Hendrik A. Goosen, and Patrick D. Boyle. MultiLevel Shared Caching Techniques for Scalability in VMP-MC. In Proceedings of the 16th International Symposiumon Computer Architecture, pages 16--24, Jerusalem, Israel, June 1989.
....application structuring techniques that maximize locality and minimize contention. This paper describes the design of ParaDiGM, focusing on the novel techniques which support scalability. We identify the key performance issues with this design and summarize some results from our work to date [2] and experience with the VMP architecture [4, 5] design. We argue that ParaDiGM provides a promising approach to a highly scalable architecture. The next section describes the ParaDiGM system model. Section 3 describes the building block components used to assemble a ParaDiGM system. Section 4 ....
....clusters of processors within a node. Determining the appropriate configuration of the shared bus and cache hierarchy is a key aspect of our research, and is discussed further in Section 4. The following sections describe these modules and their interaction; additional detail is available in [2]. The MM is described first to present the bus data transfer and consistency protocols. 3 We will refer to modules closer to the processor as higher level modules. 3.1 Memory Module (MM) The memory module (MM) is physically addressed and provides the bulk memory for the system. It includes ....
[Article contains additional citation context not shown here]
D.R. Cheriton, H.A. Goosen, and P.D. Boyle. Multi-level shared caching techniques for scalability in VMP-MC. In Proc. 16th Int. Symp. on Computer Architecture, pages 16--24, May 1989.
....interconnect. Moreover, optimized memory based messaging obviates the need for conventional interprocessor interrupts, separate message mechanisms and I O subsystem hardware. 7 Related Work The original architectural support for optimized memory based messaging was described by Cheriton et al. [10] in a design that was refined and implemented as the ParaDiGM architecture [11] While the basic design has remained largely the same, a number of refinements were made as part of the ParaDiGM implementation and measurements. As an example, we discovered that it was faster to invalidate a received ....
D.R. Cheriton, H.A. Goosen, and P.D. Boyle. Multi-level shared caching techniques for scalability in VMP-MC. In Proc. 16th Int. Symp. on Computer Architecture, pages 16--24, May 1989.
No context found.
Cheriton, D.R., H.A. Goosen and P.D. Boyle, "Multi-Level Shared Caching Techniques for Scalability in VMP-MC," Proc. 16th International Symposium on High-Performance Computer Architecture, Jerusalem, Israel, May 1989, p. 16-24.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC