| D. R. Cheriton, A. Gupta, P. D. Boyle, and H. A. Goosen. The VMP Multiprocessor: Initial Experience, Refinements and Performance Evaluation. Proceedings of the 15th Annual International Symposium on Computer Architecture, 410--421, 1988. |
....halve the page size. The various algorithms described here might perform best at different. page sizes. The effect. of a varying page size can be accomplished on hardware with a small page size. In the VMP system, the translation buffer and the cache are the same thing. with a 128 byte liue size [8]: this archi tecture might be well suited to many of the algorithms described in this paper. For PO and U,OT operations, the small pages would be used; for disk paging, 2This algoritkrn must be carefully implemented to handle the case in which a page is referenced after it is put in the reserve ....
David R. Cheriton. The vmp multiprocessor: Initial ex- perience, refinements and performance evaluation. In Proceedings of the lStb Annual Symposium onComputer Architecture, 1988.
....to maintain coherehey) for a write invalidate protocol increased by over 75 for two applications, and decreased by 48 and 66 for the other two programs. Simulation results of three parallel applications on the VMP multiprocessor also show that no single block size is best for all programs [6]. The best block size for the three programs considered varied between 32 and 132 bytes. Gupta and Weber [11] examined the effect of cache block size on the number and size of invalidations in a multiprocessor system with a directory based cache coherency proto col. With four byte blocks, most ....
D.R. Chefitoh, A. Gupta, P.D. Boyle, and H.A. Goosen. The VMP multiprocessor: Initial experience, refinements, and performance evaluation. In 15 International Symposium on Computer Architecture, pages 410-421, June 1988.
....to maintain coherency) for a write invMidate protocol increased by over 75 for two applications, and decreased by 48 and 66 for the other two programs. Simulation results of three parallel applications on the VMP multiprocessor also show that no single block size is best for all programs [7]. The best block size for the three programs considered varied between 32 and 132 bytes. Gupta and Weber [13] examined the effect of cache block size on the number and size of invMidations in a multiprocessor system with a directory based cache coherency protocol. With four byte blocks, most ....
D.R. Chefitoh, A. Gupta, P.D. Boyle, and H.A. Goosen. The VMP multiprocessor: Initial experience, refinements, and performance evaluation. In Proc. 15th International Symposium on Computer Architecture, pages 410-421, June 1988.
.... Alpha processor includes instruction an set extension mechanism called PALcode (Priveleged Architecture Library) to support features like memory management [3] In the 1980s, there was some work on software based management of caches, with emphasis on reduction of misses in a sharedmemory system [9, 8]. More recently, work on managing the interface between cache and DRAM in software has focused on address translation [35, 22] Other software based approaches have not gone as far as treating the lowest level of cache as a fully software managed paged memory (in effect, an SRAM main memory) 4 ....
D.R. Cheriton, A. Gupta, P.D. Boyle, and H.A Goosen. The VMP multiprocessor: Initial experience, refinements and performance evaluation. In Proc. 15th Int. Symp. on Computer Architecture (ISCA '88), pages 410--421, Honolulu, May/June 1988.
....halve the page size. The various algorithms described here might perform best at different page sizes. The effect of a varying page size can be accomplished on hardware with a small page size. In the VMP system, the translation buffer and the cache are the same thing, with a 128 byte line size [8]; this architecture might be well suited to many of the algorithms described in this paper. For prot and unprot operations, the small pages would be used; for disk paging, contiguous multi page blocks would be used (as is now common on the Vax) When small pages are used, it is particularly ....
David R. Cheriton. The vmp multiprocessor: Initial experience, refinements and performance evaluation. In Proceedings of the 14th Annual Symposium on Computer Architecture, 1988.
....starting point for the reader interested in learning more about cache coherence algorithms. A number of other alternative cache management strategies have also been proposed. For example, the cache management for the VMP multiprocessor being developed at Stanford is controlled by software [CSB86, CGBG88] Owicki and Agarwal compare software controlled cache coherency mechanisms to hardware mechanisms in [OA89] Their results show that software schemes scale well, but their performance is more sensitive to the sharing patterns of the workload than hardware schemes. For workloads in which a ....
....caching, however. The VMP MC system proposed by Cheriton, Goosen, and Boyle in [CGB89] and the Encore Gigamax architecture (previously known as the Ultramax) proposed by Wilson [Wil87] are based on hierarchical busses and caches for scalability. The VMPMC, like its ancestor the VMP [CSB86, CGBG88] uses software controlled caching. The Gigamax uses an extended snoopy caching protocol that ensures consistency throughout the bus hierarchy. The Data Diffusion Machine (DDM) at the Swedish Institute of Computer Science is another proposed architecture based on hierarchical busses [HH89] In ....
D. Cheriton, A. Gupta, P. Boyle, and H. Goosen. The VMP multiprocessor: Initial experience, refinements and performance evaluation. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 410--421, June 1988.
....are possible (hardware vs. software implementations) and the ability (in a NUMA architecture) to directly access remote memory. Software controlled hardware caching with a large cache line size has been studied in the context of the VMP multiprocessor both experimentally and with simulation [18]. The idea of providing a shared memory abstraction on distributed memory, message passing architectures in software has been studied by several groups. Kai Li s work [36, 37] based on an ondemand copying of pages between memories (page granularity software caching) with a directorybased ....
D. Cheriton, A. Gupta, P. Boyle, and H. Goosen. The VMP multiprocessor: Initial experience, refinements and performance evaluation. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 410--421, June 1988.
....for different setups. In all cases, the total cache size is 8 Kbytes and the line size 32 bytes. An alternative to the previous scheme is to provide a very small cache dedicated to the important sections of the operating system only. A similar idea has been suggested for the VMP multiprocessor [10]. We have set up a 1 Kbyte such cache (about the size of SelfConfFree) where the most important parts of the sequences are saved. A 7 Kbyte cache has been made available to the application and rest of the operating system. The operating system is now laid out without SelfConfFree area. The total ....
D. Cheriton, A. Gupta, P. Boyle, and H. Goosen. The VMP Multiprocessor: Initial Experience, Refinements and Performance Evaluation. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 410--421, May 1988.
....during a miss to DRAM. Characteristics of DRAM are very different to those of disk, which means that the opportunities for interesting activities on a miss are not as great as with a traditional page fault to disk, but kinds of activities proposed in the 1980s for software managed caches [CSB86, CGBG88] are now more feasible, and there is the potential to explore even more interesting possibilities. This paper presents some data on performance of the RAMpage hierarchy in which DRAM is modelled as a simplified version of the proposed Direct Rambus design [Cri97] In order to illustrate the ....
....without requiring operating system changes. Software based approaches on the other hand may require operating system modification, which is clearly a harder sell. In the 1980s, there was some work on software managed caches, with emphasis on reduction of misses in a shared memory system [CSB86, CGBG88] More recently, work on managing the interface between cache and DRAM in software has focussed on address translation [JM97] In neither case has the major focus been on achieving a higher degree of associativity than is common in caches; the space created by high miss costs has been exploited ....
D.R. Cheriton, A. Gupta, P.D. Boyle, and H.A Goosen. The VMP multiprocessor: Initial experience, refinements and performance evaluation. In Proceedings of the 15th International Symposium on Computer Architecture, pages 410--421, Honolulu, May/June 1988.
....efficiently. The traditional problem with shared memory machines is that contention for shared resources becomes a limiting bottleneck beyond a few tens of processors. Recent research efforts have investigated the feasibility of providing efficient large scale shared memory multiprocessors [54, 3, 20, 42]. This work augments these efforts with a view to studying the advantages of exploiting the snooping ability of bus based architectures. Preliminary results contributing to the design of the Willow multiprocessor are presented in [12] We explore the advantages of an architecture based on a ....
D. R. Cheriton, A. Gupta, P. D. Boyle, and H. A. Goosen. The VMP Multiprocessor: Initial Experience, Refinements, and Performance. In Proceedings of the 15th International Symposium on Computer Architecture. IEEE, May 1988.
....consistency. To force strong ordering amongst processors, the programmer can use explicit fence operations. PLUS provides a set of delayed read modify write operations for synchronisation. The delayed synchronisation allows software pipelining and fast context switching. VMP and Paradigm The VMP [46] is precursor to the Paradigm multiprocessor project [45] the main feature of which is the software based cache management. Each VMP processor board contains a virtually addressed cache with a 128 byte block size and a local memory that stores cache management code and data structures. The VMP ....
D. R. Cheriton, A. Gupta, P. D. Boyle, and H. A. Goosen. The VMP Multiprocessor: Initial Experience, Refinements and Performance Evaluation. Proceedings of the 15th Annual International Symposium on Computer Architecture, 410--421, 1988.
....systems one of the major hindrances is excessive coherency overhead, i.e. additional bus traffic generated by coherency related operations. This overhead is largely determined by the pattern of memory references to write shared data. Several studies have documented that the pattern is bimodal [1, 5, 9]: under processor locality [1] a single processor makes multiple, consecutive accesses, most importantly, writes, to the words within a cache block, uninterrupted by accesses from other processors; in inter processor contention, multiple processors contend for one or more words within the block, ....
D.R. Cheriton, A. Gupta, P.D. Boyle, and H.A. Goosen. The vmp multiprocessor: Initial experience, refinements and performance evaluation. In 15th Annual International Symposium on Computer Architecture, pages 410--421, 1988.
....on these machines is the overhead of maintaining cache coherency, i.e. the additional bus traffic generated by the protocols. This overhead is largely determined by the pattern of memory references to write shared data. Several studies have documented that 2 the pattern is bimodal [AG88, CGBG88, EK89b] under processor locality [AG88] or migratory sharing [WG89] a single processor makes multiple, consecutive accesses, most importantly, writes, to the words within a cache block, uninterrupted by accesses from other processors; in inter processor contention, multiple processors contend ....
....within them, since cache coherency is very often maintained on a cache block basis. Several studies have shown that large cache blocks may often increase coherency overhead, even to the point where it more than negates any benefit of the prefetching provided by the larger cache block size [LYL87, CGBG88, EK89a] This additional coherency overhead is caused by a phenomenon known as false sharing [TLH90, EJ91] False sharing in multiprocessor caches occurs when the cache block is larger than a single word, and different processors access (read and write) different words in the same cache block. It ....
D.R. Cheriton, A. Gupta, P.D. Boyle, and H.A. Goosen. The vmp multiprocessor: Initial experience, refinements and performance evaluation. In 15th Annual International Symposium on Computer Architecture, pages 410--421, May 1988.
....and shared memory primitives interact in a fixed way and do not allow arbitrary coherence policies. The Tempest mechanisms may be implemented in hardware, but Tempest protocols are by definition software based. Hybrid hardware software support for shared memory was first employed in VMP [CSB86, CGBG88] a bus based shared memory system that uses software at the processor caches to handle misses and support bus snooping. A follow on system, Paradigm [CGB91] originally called VMP MC [CGB89] adds a simple hardware directory at main memory to efficiently support a hierarchical bus ....
David R. Cheriton, Anoop Gupta, Patrick D. Boyle, and Hendrik A. Goosen. The VMP multiprocessor: Initial experience, refinements and performance evaluation. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 410--421, June 1988.
....in lost bandwidth of initiating a new, non contiguous access is also considerably lower. Accordingly, the opportunities for interesting activities on a miss are not as great as with a traditional page fault to disk, but kinds of activities proposed in the 1980s for software managed caches [CSB86, CGBG88] are now more feasible, and there is the potential to explore even more interesting possibilities, such as context switches on misses. At time of writing, the first author was on sabbatical at Advanced Computer Architecture Laboratory (ACAL) Electrical Engineering and Computer Science ....
....requiring operating system changes. Software based approaches on the other hand may require operating system modification, which is clearly a harder sell. In the 1980s, there was some work on software based management of caches, with emphasis on reduction of misses in a sharedmemory system [CSB86, CGBG88] More recently, work on managing the interface between cache and DRAM in software has focused on address translation [JM97] In neither case has the major focus been on achieving a higher degree of associativity than is common in caches; the space created by high miss costs has been exploited ....
D.R. Cheriton, A. Gupta, P.D. Boyle, and H.A Goosen. The VMP multiprocessor: Initial experience, refinements and performance evaluation. In Proc. 15th Int. Symp. on Computer Architecture (ISCA '88), pages 410--421, Honolulu, May/June 1988.
....block as read only in the TLB; when a write is attempted, an exception occurs and the copy is performed. Unfortunately, this scheme cannot be easily applied to blocks smaller than a page. However, Cheriton et al. proposed a deferred copy scheme for blocks of various granularities in the VMP machine [10]. The VMP machine has special cache management mechanisms that support deferred copy. The authors, however, did not evaluate the gains of this mechanism. To evaluate this mechanism, we first identify all operations that copy blocks whose size is smaller than a page. As shown in the first row of ....
D. Cheriton, A. Gupta, P. Boyle, and H. Goosen. The VMP Multiprocessor: Initial Experience, Refinements and Performance Evaluation. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 410--421, May 1988.
....work has been done towards eliminating it. In discussions on support for block operations, for example, while Torrellas et al. [41] suggested cache bypassing and prefetching, Chapin et al. [10] suggested cache bypassing and some OS policies to reduce the remote caching of data, and Cheriton et al. [14] proposed the deferred copy scheme, none of them actually evaluated their proposed schemes. 1.2.2 Comparison of OS and Application Cache Performance Some researchers have studied and compared the different characteristics of OS and applications, and have pointed out that OS performs poorer than ....
....as expected, the Sep setup is not desirable. The total number of misses increases relative to OptA in all workloads. An alternative to the previous scheme is to provide a very small cache dedicated to the important sections of the OS code only. A similar idea has been suggested in the literature [14]. We have set up a 1 Kbyte such cache (about the size of SelfConfFree) where the most important parts of the sequences are saved. An additional 7 Kbyte cache has been made available to the application and rest of the OS code. The OS code is now laid out without SelfConfFree area. The number of ....
[Article contains additional citation context not shown here]
D. Cheriton, A. Gupta, P. Boyle, and H. Goosen. The VMP Multiprocessor: Initial Experience, Refinements and Performance Evaluation. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 410--421, May 1988.
....of misses for different setups. In all cases, the total cache size is 8 Kbytes and the line size 32 bytes. An alternative to the previous scheme is to provide a very small cache dedicated to the important sections of the operating system only. A similar idea has been suggested in the literature [9]. We have set up a 1 Kbyte such cache (about the size of SelfConfFree) where the most important parts of the sequences are saved. An additional 7 Kbyte cache has been made available to the application and rest of the operating system. The operating system is now laid out without SelfConfFree area. ....
D. Cheriton, A. Gupta, P. Boyle, and H. Goosen. The VMP Multiprocessor: Initial Experience, Refinements and Performance Evaluation. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 410--421, May 1988.
No context found.
D. Chefitoh, A. Gupta, P. Boyle, and Hendrik Goosen. The VMP Multiprocessor: Initial expe- rience, refinements and performance evaluation. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 410-421, June 1988.
....affected by high latency. Specialized interconnection networks such as crossbar or shuffle exchange networks are not the Now with Digital Equipment Corporation s Western Research Laboratory 1 An earlier version of this work used the name VMP MC, indicating an extension of the original VMP [4, 5] work. solution, since they are not mainstream products, and are therefore expensive (and also suffer from the bandwidth latency tradeoff) Our solution is to cluster processors together in nodes, and provide each node with an optimized high speed shared bus cache hierarchy. This allows ....
....maximize locality and minimize contention. This paper describes the design of ParaDiGM, focusing on the novel techniques which support scalability. We identify the key performance issues with this design and summarize some results from our work to date [2] and experience with the VMP architecture [4, 5] design. We argue that ParaDiGM provides a promising approach to a highly scalable architecture. The next section describes the ParaDiGM system model. Section 3 describes the building block components used to assemble a ParaDiGM system. Section 4 evaluates the benefits of a shared cache and bus ....
[Article contains additional citation context not shown here]
D.R. Cheriton, A. Gupta, P.D. Boyle, and H.A. Goosen. The VMP multiprocessor: Initial experience, refinements and performance evaluation. In Proc. 15th Int. Symp. on Computer Architecture, pages 410--421. ACM SIGARCH, IEEE Computer Society, June 1988.
....because the logger can generate arbitrary amounts of data. 3. 3 Deferred Copy Implementation The prototype implements the deferred copy mechanism using extensions in the second level cache to associate a source address and a destination address with each cache line, as developed earlier in VMP [5]. A deferred copy mapping at the software level associates a source page address with each page of the destination segment corresponding to the appropriate page frame in the source segment. When a cache line in the destination segment is referenced, it is loaded into the second level cache from ....
D.R. Cheriton, A. Gupta, P.D. Boyle, and H.A. Goosen. The VMP multiprocessor: Initial experience, refinements and performance evaluation. In Proc. 15th Int. Symp. on Computer Architecture, pages 410--421. ACM SIGARCH, IEEE Computer Society, June 1988.
No context found.
D. R. Cheriton, A. Gupta, P. D. Boyle, and H. A. Goosen. The VMP Multiprocessor: Initial Experience, Refinements and Performance Evaluation. Proceedings of the 15th Annual International Symposium on Computer Architecture, 410--421, 1988.
No context found.
Cheriton, D. The vmp multiprocessor: Initial experience, refinements and performance evaluation. In Proceedings of the 14th Annual Sumposium on Computer Architecture, 1988.
No context found.
D. R. Cheriton, A. Gupta, P. D. Boyle, and H. A. Goosen. The VMP multiprocessor: Initial experience, refinements and performance evaluation. In Proceedings of the 15th Annual Symposium on Computer Architecture, pages 410--421, May 1988.
No context found.
D. R. Cheriton, A. Gupta, P. D. Boyle, and H. A. Goosen. The VMP Multiprocessor: Initial Experience, Refinements, and Performance. In Proceedings of the 15th International Symposium on Computer Architecture, May 1988.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC