51 citations found. Retrieving documents...
A. Gupta, W.-D. Weber, and T. Mowry, "Reducing memory and traffic requirements for scalable directory-based cache coherence schemes," Proceedings of the International Conference on Parallel Processing, pp. I-312-321, 1990.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents  Next 50

Using Compiler Assistance to Reduce the Network Traffic - Requirements Of..   (Correct)

....23, 25, 30] With extra cost, multiprocessors may employ dedicated hardware for cache coherence maintenance by allowing processors to communicate with each other about the data reference status, and to invalidate or update cached copies. Snoopy buses [12, 15, 29, 35, 38] and memory directories [2, 4, 5, 14, 37] are two prominent hardware coherence mechanisms. With run time interproces2 sor dataflow information, the coherence hardware never over invalidates the cached data like the software schemes, and therefore generally outperforms the software schemes. Moreover, since the hardware schemes do not ....

....The last subsection discusses related work. 2.1 Required Hardware Support In addition to the five instructions i write, n write, t write, c read, and m read, a directory is required to monitor which processors have valid copies of each block. There are several variations of directories [2, 4, 5, 14], any of which can be used with this compiler optimization. This study 7 uses a directory structure similar to Censier and Feautrier s [4] in which each memory module is associated with its own directory. Each directory entry consists of a P bit vector and a dirty bit, where P is the number of ....

A. Gupta, W. Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directorybased cache coherence schemes. International Conference on Parallel Processing, pages 312--321, 1990.


Design Trade-Offs in High-Throughput Coherence Controllers - Anthony-Trung Nguyen..   (Correct)

....2(a) For coherence operations to local memory lines, the PE needs to access directory entries that correspond to these lines. Since the directory maintains state for all the local memory lines, the directory is large and its entries are stored in main memory. The PE uses a Directory Cache (DC) [3, 12] to mitigate both the latency and bandwidth constraints of the directory. Our baseline design uses a 4way, 16K entry directory cache per node. Each line in the directory cache contains state for 16 memory lines. The Tag Cache (TC) in Figure 2(a) is a new structure that we propose to reduce the ....

A. Gupta, W.-D. Weber, and T. Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. In International Conference on Parallel Processing, pages 312--321, August 1990.


Efficient Integration of Compiler-directed Cache Coherence And.. - Lim, Yew (2000)   (1 citation)  (Correct)

....applying stale reference analysis alone. Finally, the HWD scheme uses a full map hardware directory [1] with a standard threestate (invalid, read shared, write exclusive) invalidation based coherence protocol. The directories are distributed across the nodes and are organized as pointer caches [14] to reduce 21 storage. We augment the HWD scheme with software controlled data prefetching so as to provide a fair comparison with the CCDP scheme. However, the HWD scheme cannot differentiate between potentially stale and nonstale data references, and it does not use the prefetch hardware ....

A. Gupta, W.-D. Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proceedings of the 1990.


A New Limited Directory Cache Coherence Scheme for Shared .. - Mannava, Kumar, Bhuyan (1995)   (Correct)

....of this block. The directory entry for a block consists of the state of the block and a presence bit vector indicating the caches with a copy of the block. The scalability of such a system is limited by the size of the directory. This drawback can be overcome by designing limited directory schemes [4, 5, 3]. Limiting the directory size forces these schemes to resort to either broadcasting or limit the sharing of a memory block. It is observed that limiting the directory size could result in significant degradation in performance for some applications [6] Recently some researchers have proposed ....

....a write on a readshared copy where invalidations have to be sent individually to each node having a valid copy and the acknowledgments from these have to be processed serially at home once again. Several limited directory schemes have been proposed in the literature to reduce the storage overhead [1, 4, 5]. They all have a common theme: limiting the number of entries in the directory. They differ in how directory overflow situation is handled. In one scheme [1] called Dir i B scheme, the pointer overflow is handled by adding a broadcast bit to the state information of each block. If the number of ....

[Article contains additional citation context not shown here]

A. Gupta, W.D.Weber, and T.Mowry, "Reducing memory and traffic requirements for scalable directory-based cache coherence schemes," In Proceedings of International Conference on Parallel Processing, volume I, pp. 312--321, 1990.


Owner Prediction for Accelerating Cache-to-Cache.. - Acacio, Gonzalez, .. (2002)   (2 citations)  (Correct)

....transfer misses than traditional directory based multiprocessors by exploiting the extra ordering properties of the switch. On the contrary, our proposal does not require any interconnection network with special ordering. Finally, caching directory information was originally proposed in [9] and [20] as a means of reducing the memory overhead entailed by directories. In [1] it is proposed a two level directory architecture as a means of obtaining the performance of a non scalable full map directory. Subsequently, we studied the effect that the integration into the processor die of ....

A. Gupta, W.-D. Weber and T. Mowry. "Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes". Proc. Int'l Conference on Parallel Processing (ICPP'90), pp. 312--321, August 1990.


ADAM: A Decentralized Parallel Computer Architecture Featuring.. - Huang (2002)   (Correct)

....they also have their limits. With a 64 byte block size, a simple directory based cache coherence protocol has a memory overhead of over 200 for a 1024 processor system [CS99] p.565. Techniques such as limited pointer schemes [ASHH88] extended pointer schemes [ALKK91] and sparse directories [GWM90] can all be used to mitigate the overhead of cache coherence in large parallel systems, but at the cost of more complex protocols or the need for special mechanisms to handle corner cases where the protocol breaks down. The other problem with caches is that technology scaling is not ideal; ....

....scalability issues of the KSR 1 interlocking rings, it still relies on a directory lookup architecture. This means that either large cache lines or a high memory overhead must be paid for storing the presence bit vectors in the cache memories. While there are mechanisms such as sparse directories [GWM90] or limited pointers [ASHH88] that can reduce this overhead, these mechanisms introduce more complexity into the system. The ADAM architecture, on the other hand, presents programmers with a virtual shared memory space and no caches. Coherence in ADAM is trivial, as there is only one location for ....

A Gupta, W.D. Weber, and T Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proceedings of the International Conference on Parallel Processing, pages 312--321, August 1990.


Dynamic Pointer Allocation for Scalable Cache Coherence.. - Simoni, Horowitz (1991)   (14 citations)  (Correct)

....providing a directory entry for each block of main memory is inefficient from a storage standpoint. Perhaps the most promising solution in this regard uses a relatively small number of directory entries, each with a corresponding tag that stores the address of the block to which the entry applies [2, 9]. This organization is often called a directory cache since its addressing and tags are similar to traditional caches, though in fact no backing store is needed [9] While the length of the directory still grows O(z) with the number of processors in the system, the constant of proportionality is ....

.... small number of directory entries, each with a corresponding tag that stores the address of the block to which the entry applies [2, 9] This organization is often called a directory cache since its addressing and tags are similar to traditional caches, though in fact no backing store is needed [9]. While the length of the directory still grows O(z) with the number of processors in the system, the constant of proportionality is much smaller since it is related to the size of the caches instead of the size of main memory. There are also decentralized approaches that do not maintain all ....

Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. In Proc. of the 1990.


Design And Analysis Of Update-Based Cache Coherence Protocols For .. - Glasco (1995)   (1 citation)  (Correct)

....of cached copies of each memory line. When this limit is exceeded, the limited pointer schemes either invalidates one of the copies to make room for the new request [4] assumes all caches now have a copy of the line [4] switches to a coarse grain mode where each bit represents several caches [39] or traps to software to extend the directory list [14] With a limited pointer scheme, the centralized directory scales as O(N Limited NMemoryLines ) where N Limited is the number of bits in the limited directory entry. The other approach notes that the maximum number of cached copies of a ....

....number of bits in the limited directory entry. The other approach notes that the maximum number of cached copies of a memory line is limited by the total size of all caches and not by the size of memory. In this case, a directory cache could be used to cache this smaller set of directory entries [39]. Also, the bits for each directory entry can be dynamically allocated out of a pool of directory bits [62] Several studies have suggested that the average number of shared copies of a memory line is small [13, 57, 71, 4, 26] The results presented by the researchers demonstrate that the limited ....

[Article contains additional citation context not shown here]

Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. Technical Report No. CSL-TR-90-417, Computer Systems Laboratory, Stanford University, 1990.


A New Scalable Directory Architecture for Large-Scale .. - Acacio, González..   (Correct)

....introduced by directory increases the latency of these protocols. This overhead does not appear in snooping protocols, because they broadcast all coherence transactions to all the nodes in the system. Directory schemes must satisfy two requirements to provide support for scalable multiprocessors [9]. First, the bandwidth needed to access directory information must scale well with the number of processors. This requirement can be achieved by distributing the physical memory and the directory among all the system nodes, and by using a scalable interconnection network. In this way, each memory ....

....remote accesses remain the major hurdle on the scalability. Memory overhead is usually managed from two orthogonal points of view: reducing directory width and reducing directory height. Some authors proposed to reduce the width of directory entries by using compressed sharing codes: Coarse Vector [9], which is currently employed in the SGI Origin 2000 multiprocessor [15] Tristate [1] Gray Tristate [16] and Home [16] Others proposals reduce directory width by having a limited number of pointers per entry to keep track of sharers [1] 4] 22] Differences between them are mainly found in the ....

[Article contains additional citation context not shown here]

A. Gupta, W.-D. Weber and T. Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. Proc. Int'l Conference on Parallel Processing, pp. I: 312-321, August 1990.


A Superassociative Tagged Cache Coherence Directory - Lilja, Ambalavanan (1994)   (1 citation)  (Correct)

....with cached copies of a block are associated with each block in the memory. Since the data caches are significantly smaller than the main memory, however, most of the memory blocks will not be cached at any given time. As a result, most of these pointer bits will be unused. The tagged directories [5, 8, 10, 12], in contrast, dynamically allocate pointers from a special purpose pointer cache to individual memory blocks when the block is moved from the memory to a data cache. Each entry in the pointer cache requires two fields: 1) an address tag to identify the memory block to which the pointer is ....

A. Gupta, W.-D. Weber, and T. Mowry, "Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes," Intl. Conf. Parallel Processing, I:312-321, 1990.


Piranha: A Scalable Architecture Based on.. - Barroso.. (2000)   (53 citations)  (Correct)

....and pins, and provides simpler system scaling. In addition, we leverage the low latency, high bandwidth path provided by the integration of memory controllers on the chip. We use two different directory representations depending on the number of sharers: limited pointer [1] and coarse vector [14]. Two bits of the directory are used for state, with 42 bits available for encoding sharers. The directory is not used to maintain information about sharers at the home node. Furthermore, directory information is maintained at the granularity of a node (not individual processors) Given a 1K node ....

A. Gupta, W.-D. Weber, and T. Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. In International Conference on Parallel Processing, July 1990.


Toward The Design Of Large-Scale, Shared-Memory Multiprocessors - Scott (1992)   (3 citations)  (Correct)

....concurrent read requests; all read requests must be processed at the directory entry. This will prevent workloads with any significant read contention from performing well on large systems. 1.3. Coarse vectors Another way to limit the size of directories is to use coarse vectors for the entries [Gupt90]. A coarse vector is similar to a full width directory entry, except that each bit represents a region of two or more processors. When a processor reads a line, the bit for that processor s region is set in the line s directory entry. When a line is invalidated, messages are sent to all processors ....

....entry, except that each bit represents a region of two or more processors. When a processor reads a line, the bit for that processor s region is set in the line s directory entry. When a line is invalidated, messages are sent to all processors in the regions whose bits are set. Gupta, et al. [Gupt90], suggest that the directories be structured such that an entry can hold one or more processor pointers and then switch over to a coarse vector representation when the pointers overflow. They analyzed the performance of a coarse vector scheme for four programs running on a simulated 32 processor ....

[Article contains additional citation context not shown here]

Gupta, A., W. Weber, and T. Mowry, Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes, Proc. 1990 International Conference on Parallel Processing, August 1990, I312-I321.


Compiling Techniques for Improving Decoupled Virtual Shared Memory.. - Zhu   (Correct)

....have been proposed. For more scalable multiprocessors with general interconnection network between processors, 10 directory based protocols [2, 4, 14, 44, 73, 86] and compiler assisted software protocols [20, 51, 57, 79] have been suggested. Recently, dynamically tagged directory protocols [15, 41, 55, 56] have evolved from previous directory based schemes. Snoopy Protocols Snoopy protocols are also called bus based protocols. All processors in the system can observe any memory access by snooping on the bus. The first snoopy protocol [3] uses write through strategy to keep a consistent global ....

....broadcast bit is set. If this bit is set when exclusive access is requested, invalidations are broadcasted to all p processors in the system. It has shown that this directory can produce good performance with n = 2 to 4 pointers per block [54] 11 Tagged Directory Protocols Tagged directories [15, 41, 55, 56] dynamically allocate pointers to individual memory blocks when the block is moved from the memory to a data cache. The directory entry keeps an address tag to identify the memory block to which the pointer is allocated and an array of bits to actually point to the processors with a cached copy of ....

A. Gupta, W. Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proceedings of 1990 International Conference on Parallel Processing, Vol. I: Architecture, pages 312--321, August 1990. 36


Communication Mechanisms in Shared Memory Multiprocessors - Byrd (1998)   (Correct)

....For large systems, this can require an excessive amount of storage in the directory. Other schemes have been developed that reduce the amount of storage needed in the directory by broadcasting invalidates [8] extending the directory with software [14] tracking copies for groups of processors [40], or maintaining distributed linked lists of cache lines [41, 96] Because of scalability and performance issues, many large scale systems do not attempt to provide uniform memory access. These NUMA (Non Uniform Memory Access) architectures take on many forms, ranging from hierarchies of busses ....

Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. Technical Report CSL-TR-90-417, Computer Systems Laboratory, Stanford University, March 1990.


Highly Concurrent Cache Coherence Protocols - Williams, Reynolds, Jr. (1990)   (Correct)

....a significant bottleneck from Tang s protocol, the protocol still scales poorly since the size of the bit vector increases linearly with the number of PE s. The focus of much of the subsequent work on directory protocols has been on improving the scalability of the directory representation [Aga88, ArB84, CKA91,GWM90, Jam90, LiY90, OKN90, SiH91, Ste89, ThD91]. Although reducing the space complexity of the directory representation is an important problem, our focus is different: on improving the scalability of cache coherence protocols by increasing their concurrency. For simplicity, we assume the bit vector representation proposed by Censier and ....

A. Gupta, W. Weber and T. Mowry, Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes, Proc. 1990 ICPP, August 1990, I312 -I-321.


Multicast Snooping: A New Coherence Method Using a .. - Bilir, Dickson.. (1999)   (9 citations)  (Correct)

....predictor in action. Efficiently encoding the multicast mask is also an important implementation issue. For this paper we simply assume a full map directory entry, similar to most directory protocol studies [23] However, many of the techniques developed for limited directory protocols [16, 27] can be adapted to multicast snooping. 3 Multicast Address Networks A key technology for multicast snooping is a multicast address network. 1 A sufficient condition is that it creates the illusion of a total order of reliable multicasts. That is, multicasts can be conceptually numbered in such ....

Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. In Proceedings of the 1990 International Conference on Parallel Processing (Vol. I Architecture), pages 312--321, 1990.


Architectures For Distributed Shared Memory . . . - Dowd, al. (1994)   (Correct)

....protocols being studied in this paper. First the definitions of the states of cache and memory blocks are presented followed by the description of the two protocols. There has been no attempt to reduce the size of the tables for cache coherence protocols since the improvements suggested in [15, 16] that reduce storage requirements without a significant reduction in performance are applicable here. 4.1 Definitions Due to the high bandwidth provided by the photonic network, it is possible to increase the block size and increase the capture of spatial locality. This is possible since the per ....

A. Gupta, W.-D. Weber, and T. Mowry, "Reducing memory and traffic requirements for scalable directory-based cache coherence schemes," in International Conference on Parallel Processing, pp. I--312--I--321, 1990.


An Innovative Implementation for Directory-based Cache.. - Shi, Hu, Zhu (1997)   (Correct)

....Memory Memory I O I O I O Directory Directory Directory Scalable Interconnection Network . Dirty bit Presence bits . Figure 1: Basic multiprocessor organization and directory scheme. For directory schemes to be successful for scalable multiprocessors, they must satisfy two requirements [7]. The first is that the bandwidth to access directory information must scale well with the number of processors. This requirement can be achieved by distributing the physical memory and the corresponding nodes, and by using a scalable interconnection network. The second requirement is that the ....

Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. In ICPP'90, pp.I-312-I321.


Data Prefetching And Data Forwarding In Shared Memory.. - Poulsen, Yew (1994)   (13 citations)  (Correct)

....multiple outstanding (6) prefetches (regular reads use blocking loads) Cache coherence is implemented using a three state, directory based invalidation protocol. Directories are distributed across the main memory modules, are full mapped, and are organized as pointer caches to reduce their size [23]. 3.2 Simulation Environment Experimental results are acquired using EPG sim execution driven simulation of various parallel Perfect codes. Events simulated include global and private memory accesses, parallel loop setup and scheduling operations, and synchronization operations. Events caused by ....

Gupta, A., Weber, W.-D., and Mowry, T., "Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes", Proceedings of ICPP, 1990, pp. I-312-321.


The Interaction Of Compilation Technology And Computer.. - Edited By   (Correct)

....coherence schemes are described in Section 2. In Section 3, several previous studies, and some of their shortcomings, are discussed. The simulation methodology and results are presented in Section 4, and Section 5 concludes the paper. 2 COHERENCE SCHEMES The directory based coherence schemes [4, 5, 11] use the directory to keep track of processors with a valid copy of each memory block, by relying exclusively on run time information to maintain coherence. The compiler directed mechanisms, however, detect access to stale data at compile time through data and control dependence analysis, and ....

Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. International Conference on Parallel Processing, pages 312--321, 1990.


Integrating Fine-Grained Message Passing In Cache Coherent.. - Yew, Poulsen (1996)   (1 citation)  (Correct)

....that are non blocking to the issuing processor; regular reads block the issuing processor. Cache coherence is implemented by using a three state, directory based invalidation protocol. Directories are distributed across the main memory modules, are full mapped, and are organized as pointer caches [27] to reduce their size. 3.2 Application Codes The application codes studied in this work were selected from the Perfect Benchmarks (Table 1) The particular versions of the codes used in these experiments are parallel, Cedar Fortran versions of the applications [17] that contain parallel loops ....

Gupta, A., Weber, W.-D., and Mowry, T., "Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes", Proceedings of the International Conference on Parallel Processing, 1990, pp. I-312-321.


Improving Performance of Bus-Based Multiprocessors - Anderson (1995)   (1 citation)  (Correct)

....chapter) is used, the first re read fills all other copies of the invalidated block, making it easier to fix the cost of an invalidation. In the absence of read snarfing, we must estimate the number of re reads done. Previous work by Gupta et al. has shown that most blocks are not widely shared [GWM90] Therefore, in our experiments where we did not used read snarfing, we estimated a small number of re reads for every line. It is important to know the cost of an invalidation because it is used to calculate the invalidation ratio. The invalidation ratio (R) is the ratio between the cost in bus ....

Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In International Conference on Parallel Processing, pages (I) 312--321, 1990.


Compiler Support for the Efficient Use of Cache Coherence.. - Trung Nguyen   (Correct)

....for bus based systems. For more scalable systems that use general interconnection networks, directory based hardware schemes [2, 3, 5, 15, 35, 38] and compiler assisted software schemes [7, 18, 17, 25, 37] have been suggested. Recently, several authors have proposed dynamically tagged directories [6, 14, 22, 23, 24, 30] in which pointers to processors with a copy of a memory block are allocated only when the block is actually cached. These directories maintain a cache of pointers in each memory module. Typically, each pointer consists of an address tag to identify the block plus a bit vector to point to the ....

....the other hand, can perfectly disambiguate memory references at run time so that they invalidate only cache blocks that are actually stale. Unfortunately, directory based schemes require a large amount of memory to store the cache block sharing information. Several dynamically tagged directories [6, 14, 22, 23, 24, 30] have been proposed to reduce the memory requirements. In dynamically tagged directories, a pointer from a cache of pointers in the directory is allocated to a particular memory block only when the block is moved to the data cache. Typically, each pointer consists of an address tag and a vector of ....

[Article contains additional citation context not shown here]

A. Gupta, W. Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proc. 1990 International Conference on Parallel Processing, Vol. I: Architecture, pages 312--321, August 1990.


Extending The Scalable Coherent Interface For Large-Scale.. - Johnson (1993)   (10 citations)  (Correct)

....of the given superline can reset the corresponding bit in the directory. However, Censier and Feautrier do not mention the directory s need for an owner pointer per superline, limiting their proposal to networks that allow snooping. Coarse vectors, independently proposed by Gupta et al. [GuWM90] and Brooks and Hoag [BrHo90] group storage by caches. Instead of maintaining a bit map of size proportional to the number of caches, each bit represents a group of caches. The bit is set whenever any cache in the group has a copy of the data. Gupta et al. recommend the use of limited pointers ....

....is sent to all caches such that the bitwiseand of the mask and the cache pointer is the same as the bitwise and of the mask and the directory pointer. The pointer and mask represent a size 2 i superset of the caches that have a copy, where i is the number of unset bits in the mask. Gupta et al. [GuWM90] claim that the superset scheme performs poorly compared to a full bit map. 1.3.3.3. Software Support Two directory schemes reduce directory storage by invoking a trap to software for problem cases. The LimitLESS Directories of Chaiken et al. ChKA91] implemented in the MIT Alewife, use limited ....

[Article contains additional citation context not shown here]

Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry, "Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes," Proceedings of the 1990 International Conference on Parallel Processing (ICPP '90), August 1990, I-312-321.


Efficient Schemes for Limited Directory-Based DSMs Using.. - Dai, Panda (1996)   (Correct)

....of messages in case of directory overflow is critical to system performance. Otherwise, both network traffic and node occupancy [14] increases. This gets translated into increased write latency. and overall performance degradation. Examples of some limited directory schemes are: coarse vector [18], Limitless [22] Superset [1] and Eviction [6] These schemes use either hardware or software mechanisms to detect and manage directory overflow. It is to be noted that all the above directory schemes have been designed and evaluated with networks having only point to point (unicast) message ....

....nodes in the system. However, on systems supporting only unicast message passing, such broadcast requires a large number of message transfers in the system and it quickly leads to performance degradation. Thus, researchers have proposed non broadcast based schemes like coarse vector (dir i CV r ) [18]. In this scheme, when directory overflow occurs, the storage space for i entries are reorganized and used as region bits. During invalidation, such region bits help in significantly reducing the number of messages needed to be sent to nodes with possible sharers. In either of the above schemes, ....

[Article contains additional citation context not shown here]

A. Gupta, W.-D. Weber, and T. Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. In International Conference on Parallel Processing, pages I:312--321, Aug 1990.


Notification And Multicast Networks For Synchronization.. - Andrews, Beckmann.. (1992)   (9 citations)  (Correct)

....network allows a single message to be sent to an arbitrary set of recipient processors. Hardware in the memory modules controls the multicast network and implements notification. Each of the two designs contains the notion of a directory, similar to that of a directory based coherence scheme [19, 22, 24], to keep track of the recipient processors for each variable. The first design uses an implicit, network based directory, distributed over the reverse network switches, and the second uses an explicit, memory based directory, distributed over the memory modules. A directory eliminates the need ....

....cache coherence. Other researchers have presented notification schemes that integrate synchronization with cache coherence using bus based broadcast [9, 17] and MIN based multicast [29] Work has also been done on inexpensive directory based invalidate coherence schemes for MIN based systems [19, 22, 24], but these schemes rely on individual messages or broadcast instead of multicast for invalidation. For the algorithms discussed in this paper, the proposed hardware performs efficient synchronization and communication in MIN based systems using multicast and notification, without the requirement ....

[Article contains additional citation context not shown here]

Gupta A., Weber W-D., and Mowry T. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. Proc. 1990 International Conference on Parallel Processing. Penn State University, University Park, PA, 1990, Vol. I, pp. 312-321.


Relaxed Consistency and Synchronization in Parallel Processors - Zucker (1992)   (3 citations)  (Correct)

....that has 4 a cached copy of that line, the corresponding bit is set. When a processor wants to write the line, an invalidation message must be sent to each processor with a copy of the line (there are a number of variations on this scheme for dealing with its excessive memory requirements [11, 31, 25, 58, 81]) Directories are often used when there is no single broadcast medium like a bus. The other architecture that I will examine uses a directory scheme to maintain cache coherence since it uses a multistage interconnection network (MIN) arranged in an Omega network to interconnect the processors to ....

....and directories in any form complicate the memory controller. Also, the hardware enforced cache coherence (HWCC) that directories provide can lead to a loss of performance because of false sharing [44] There have been many suggestions on ways to deal with the scalability issue with directories [11, 25, 31, 58, 81]. However, these proposals still result in false sharing and at least some increased hardware complexity and directory memory. All these problems are dealt with by using software controlled cache coherence (SCCC) 2, 36, 38, 82] In an SCCC system the hardware is much simpler than in an HWCC ....

Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In 1990 International Conference on Parallel Processing, pages I--312--321, 1990.


Improving Memory Utilization in Cache Coherence Directories - Lilja, Yew (1993)   (3 citations)  (Correct)

....coherence schemes are summarized in Table 1. In the conventional hardware directories [3, 4, 7, 19] pointer resources are statically associated with each block in the main memory fixing the total number of pointers to the size of the memory. Recently proposed dynamically tagged directories [9, 17, 26, 30] take advantage of the observation that only blocks that are actually cached in one or more processors need to be allocated pointers. In these tagged directories, pointers are dynamically associated with memory blocks using an address tag field only as the blocks are moved from the memory to a ....

....delayed allocation marking, is a significant extension of this idea of combining hardware and software coherence enforcement. It uses the predictive power of the compiler to delay the allocation of a coherence pointer as long as possible. This optimization requires a dynamically tagged directory [9, 17, 26, 30], but by delaying the allocation of pointers, they are in use for a shorter period of time, and thus can be reused more frequently. This reuse allows the size of the directory to be reduced by a factor of approximately 50 to 100 while increasing the average memory delay by less than a few percent. ....

[Article contains additional citation context not shown here]

Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry, "Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes," International Conference on Parallel Processing, Vol. I: Architecture, pp. 312-321, 1990.


Interaction of Cache Coherency and Media Access Protocols in the .. - John Chu   (Correct)

....m o resets the bits of M in the OD specified by the replacement message. It also resets CB o [i] where i is the node which sent the replacement message. There has been no attempt to reduce the size and complexities of the tables for this directory based scheme since the improvements suggested in [12, 13, 14] are applicable here. 3.2 The Protocol The cache coherence protocol defines the action that is taken for the cases of a read hit, a read miss, a write hit and a write miss. In the case of a VM miss where the block is not in physical memory, it is brought from virtual memory directly to the owner. ....

A. Gupta, W.-D. Weber, and T. Mowry, "Reducing memoryand traffic requirements for scalable directory-based cache coherenceschemes," in InternationalConferenceon Parallel Processing, pp. I--312--I--321, 1990.


Towards A Shared-Memory Massively Parallel Multiprocessor - Litaize, Mzoughi.. (1992)   (1 citation)  (Correct)

.... right direction with technology : high speed technology is needed, high integration level is needed, point to point serial links are needed [StCo91] well suited to optical fiber, any directory data coherency algorithm actually known can be used and recent works in this important area [GuWM90] SiHo91] can be applied to SMM, latency time grows proportionally slower with the number of processors than in other networks and this point is also a major one [GHGM91] spare bandwidth makes prefetching more attractive, SMM architectures throw a bridge between multiprocessors and ....

A. Gupta, W-D Weber ,T. Mowry: "Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes". Proc. of the 1990 ICPP, August 1990, vol. I, pp. 312-321.


Concurrency Control in Asynchronous Computations - Williams (1993)   (9 citations)  (Correct)

....a significant bottleneck from Tang s protocol, the protocol still scales poorly since the size of the bit vector increases linearly with the number of PE s. The principal focus of the subsequent work on directory protocols has been on improving the scalability of the directory representation [Aga88, ArB84, CKA91,GWM90, Jam90, LiY90, OKN90, SiH91, Ste89, ThD91]. Although reducing the space complexity of the directory representation is an important problem, our focus is different on improving the concurrency of cache coherence protocols. For simplicity, we assume the bit vector representation proposed by Censier and Feautrier [CeF78] but delta cache ....

A. Gupta, W. Weber and T. Mowry, Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes, Proc. 1990 ICPP, August 1990, I312 -I-321.


Enhancing the Data Sharing Flexibility of Tagged Cache.. - Lilja, Ambalavanan   (Correct)

.... the shared memory to keep track of which processors have cached copies of which blocks [2, 21, 24] These traditional static directories typically maintain this information for each block in the shared memory, which can require an inordinate amount of hardware [12] Dynamically tagged directories [5, 10, 13, 18] have been proposed to reduce the hardware requirements of a coherence directory. These tagged directories use special purpose caches of pointers that are dynamically allocated to memory blocks to point to processors with a copy of a specific memory block only when that block is actually cached ....

....with cached copies of a block are associated with each block in the memory. Since the data caches are significantly smaller than the main memory, however, most of the memory blocks will not be cached at any given time. As a result, most of these pointer bits will be unused. The tagged directories [5, 10, 13, 18] take advantage of the fact that pointers from the directory to the processors are necessary only when a memory block is actually cached in one or more processors. The tagged directories dynamically allocate pointers to individual memory blocks when the block is moved from the memory to a data ....

Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry, "Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes," International Conference on Parallel Processing, Vol. I: Architecture, pp. 312-321, 1990.


Integrating Fine-Grained Message Passing In Cache Coherent.. - Poulsen, Yew (1996)   (3 citations)  (Correct)

....that are non blocking to the issuing processor; regular reads block the issuing processor. Cache coherence is implemented by using a three state, directory based invalidation protocol. Directories are distributed across the main memory modules, are full mapped, and are organized as pointer caches [27] to reduce their size. 4.2 Application Codes The application codes studied in this work were selected from the Perfect Benchmarks (Table 1) The particular versions of the codes used in these experiments are parallel, Cedar Fortran versions of the applications [11] that contain parallel loops and ....

Gupta, A., Weber, W.-D., and Mowry, T. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. Proc. 1990 International Conference on Parallel Processing. 1990, pp. I-312-321.


A Stochastic Model of Cache Coherency Overhead in SCI Rings - Field, Harrison   (Correct)

....are maintained by a doubly linked sharing list; the principal is similar to that of directory based protocols [1] except that the directory is distributed over the processors memories and caches. Directory based systems are typified by DSMM architectures such as the Stanford DASH multiprocessor [13, 4] and as first proposed by [3] Each node contains a cache and a portion of the shared memory which is divided into blocks of 64 bytes. The caches are addressed in units of one block (called cache lines) Cache lines carry state information and two pointers to maintain the sharing list for the ....

A. Gupta, W-D Weber and T. Mowry, Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Systems


An Evaluation of Directory Protocols for Medium-Scale.. - Mukherjee, Hill (1994)   (5 citations)  (Correct)

....proposed multicast protocols Dir i B, Tristate, and Coarse Vector. Dir i B, 1 i N , uses i Theta logN bits to exactly identify upto i sharers and broadcasts otherwise [1] Coarse Vector uses N=K bits, where a bit is set if any of the processors in a K processor group cached the block [9]. Tristate [1] also called the superset scheme by Gupta et al. 9] uses a logN digit code requiring 2 bits per digit. The j th digit of the code is 0 if the j th bit of all sharers is 0; the digit is 1 if all sharers have 1; the digit is both otherwise. On an invalidation event, Tristate sends ....

.... Dir i B, 1 i N , uses i Theta logN bits to exactly identify upto i sharers and broadcasts otherwise [1] Coarse Vector uses N=K bits, where a bit is set if any of the processors in a K processor group cached the block [9] Tristate [1] also called the superset scheme by Gupta et al. [9], uses a logN digit code requiring 2 bits per digit. The j th digit of the code is 0 if the j th bit of all sharers is 0; the digit is 1 if all sharers have 1; the digit is both otherwise. On an invalidation event, Tristate sends invalidation messages to all processors covered by its sharing ....

[Article contains additional citation context not shown here]

Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. In Proceedings of the 1990 International Conference on Parallel Processing (Vol. I Architecture), pages 312--321, 1990.


The Potential of Compile-Time Analysis to Adapt the Cache.. - Mounes-Toussi, Lilja (1995)   (1 citation)  (Correct)

....Galactica Net [31] schemes. Furthermore, this compile time optimization does not require a sophisticated network [3, 14] to maintain cache coherence. It requires only a coherence directory to keep track of the processors with a valid copy of each block. There are several variations of directories [5, 9, 10, 16], any one of which can be used with this compiler optimization. This study uses a directory structure similar to Censier and Feautrier s [9] in which each memory module is associated with its own directory. As a result of this compiler optimization, three different types of write instructions, ....

A. Gupta, W. Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. International Conference on Parallel Processing, pages 312--321, 1990.


An Evaluation of a Compiler Optimization for Improving.. - Mounes-Toussi, Lilja, Li (1994)   (Correct)

.... using a stale memory value [19] Current solutions to the cache coherence problem for large scale multiprocessor systems interconnected with multistage networks can be classified into two main types, specifically, software controlled mechanisms [6, 7, 8, 9, 22] and hardware directory mechanisms [1, 3, 4, 16, 29]. Softwarecontrolled mechanisms use compile time analysis to insert extra instructions into the program that force each processor to invalidate stale entries in their caches before they are referenced. Due to the limitations of compile time analysis, such as the need to predict branch outcomes, ....

....other write references are marked as i write. 2.2 Required Hardware Support In addition to the five instructions i write, n write, t write, c read, and m read, a directory is required to monitor which processors have valid copies of each block. There are several variations of directories [1, 3, 4, 16], any of which can be used with this compiler optimization. This study uses a directory structure similar to Censier and Feautrier s [3] in which each memory module is associated with its own directory. Each directory entry consists of a P bit vector and a dirty bit, where P is the number of ....

A. Gupta, W. Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. International Conference on Parallel Processing, pages 312--321, 1990.


Integrating Fine-Grained Message Passing In Cache Coherent.. - Poulsen, Yew (1996)   (3 citations)  (Correct)

....that are non blocking to the issuing processor; regular reads block the issuing processor. Cache coherence is implemented by using a three state, directory based invalidation protocol. Directories are distributed across the main memory modules, are full mapped, and are organized as pointer caches [27] to reduce their size. 10) 4.2 Application Codes The application codes studied in this work were selected from the Perfect Benchmarks (Table I) The particular versions of the codes used in these experiments are parallel, Cedar Fortran versions of the applications [11] that contain parallel ....

Gupta, A., Weber, W.-D., and Mowry, T. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. Proc. 1990 International Conference on Parallel Processing. 1990, pp. I-312-321.


A Distributed Directory Based Cache Coherence Scheme - Gupta (1994)   Self-citation (Gupta)   (Correct)

....based systems since directory based schemes are significantly more complex than snoopy schemes. Furthermore, in a bus based system, the performance of directory based schemes is not as good as that of the snoopy schemes. Many people have proposed a variety of directory based schemes [1] 2] 5] [11] [23] 29] 31] The schemes differ in both the protocol and the hardware used. The first directory scheme was proposed by C. K. Tang in 1976 [31] Soon after that, Censier and Feautrier [5] presented the Full Vector Scheme in 1978 which has become the most widely used directory scheme. This ....

....Vector Directory scheme [5] since it is one of the most basic and highest performing protocols. Over the years, many new directory based protocols have been proposed. The Two bit Directory Scheme [2] Limited Pointer Directory Scheme [1] Sectored Directory Scheme [23] Sparse Directory Scheme [11], Stenstrom s Scheme [29] and Scalable Cache Coherent Interface [13] are a few examples of some of the newer schemes. However, except for Stenstrom s Scheme and the Scalable Cache Coherence Interface, the schemes mentioned above simply reduce the memory overhead of storing the block states in the ....

A. Gupta, W. D. Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In International Conference on Parallel Processing, pages I--312--I--321, 1990.


The DASH Prototype: Logic Overhead and Performance - Lenoski, Laudon, Joe.. (1993)   (92 citations)  Self-citation (Gupta)   (Correct)

....0.7 0.0 2.4 Performance Mon. 1.8 5.3 10.8 3.5 Total 535 2420 83 8 would reduce the directory DRAM overhead to 6.9 and 3.4 respectively, or it could allow the system to grow to 128 or 256 processors with the same 13.7 overhead. For larger systems, a more scalable directory structure [1, 8, 13] could be used to keep the directory overhead at or below the level in the prototype. The directory s overhead in SRAM could also be improved. The 128KB remote access cache (RAC) is the primary use of SRAM in the directory. The size of the RAC could be significantly reduced if the processor caches ....

A. Gupta, W.-D. Weber, and T. Mowry. Reducing Memory and Traffic Requirements for Scalable DirectoryBased Cache Coherence Schemes. in Proc. 1990 Int. Conf. on Parallel Processing. pages I:312-321, August 1990.


A Distributed Directory Cache Coherence Scheme and its Effects.. - Gupta, al. (1995)   (1 citation)  Self-citation (Gupta)   (Correct)

....presented the Full Vector Scheme in 1978, which has become the most widely used directory scheme. Over the years, many new directory based protocols have been proposed. The Two bit Directory Scheme [2] Limited Pointer Directory Scheme [1] Sectored Directory Scheme [16] Sparse Directory Scheme [8], Stenstrom s Scheme [19] and Scalable Coherent Interface [10] are a few examples of some of the newer schemes. However, except for Stenstrom s Scheme and the Scalable Coherence Interface, the schemes mentioned above simply reduce the memory overhead of storing the block states in the directory. ....

A. Gupta, W. D. Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In International Conference on Parallel Processing, pages I--312--I--321, 1990.


Memory Latency Rediction via Data Prefetching and Data Forwarding .. - Poulsen (1994)   (Correct)

No context found.

A. Gupta, W.-D. Weber, and T. Mowry, "Reducing memory and traffic requirements for scalable directory-based cache coherence schemes," Proceedings of the International Conference on Parallel Processing, pp. I-312-321, 1990.


Journal of Instruction-Level Parallelism 6 (2004) 1-23.. - Ai Access Foundation (2004)   (Correct)

No context found.

A. Gupta, W.-D. Weber, and T. Mowry. "Reducing memory and traffic requirements for scalable directory-based cache-coherence schemes." In Proc. of the Intl. Conf. on Parallel Processing I, pages 312--321, August 1990.


Constraint Graph Analysis of Multithreaded Programs - Cain, al. (2004)   (Correct)

No context found.

A. Gupta, W.-D. Weber, and T. Mowry. "Reducing memory and traffic requirements for scalable directory-based cache-coherence schemes." In Proc. of the Intl. Conf. on Parallel Processing I, pages 312--321, August 1990.


The Use of Prediction for Accelerating Upgrade Misses in.. - Multiprocessors Manuel..   (Correct)

No context found.

A. Gupta, W.-D. Weber and T. Mowry. "Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes". Proc. Int'l Conference on Parallel Processing, pp. 312--321, August 1990.


The Coherence Predictor Cache: A Resource-Efficient and .. - Nilsson, Landin..   (Correct)

No context found.

A. Gupta, W.-D. Weber, and T. Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. In Proc. of ICPP'90, pages 312--321, 1990.


Random Key Predistribution Schemes for Sensor Networks - Haowen Chan Adrian (2003)   (54 citations)  (Correct)

No context found.

Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proceedings of the 1990.


Shared Memory for Distributed Systems - Of The Requirements   (Correct)

No context found.

A. Gupta, W-D Weber and T. Mowry, "Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes", Proc. 1990 Int'l Conf. Parallel Processing, IEEE Computer Society Press, Los Alamitos, Calif., Order No. 2101, pp. 312-321.


Random Key Predistribution Schemes for Sensor Networks - Haowen Chan Adrian (2003)   (54 citations)  (Correct)

No context found.

Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proceedings of the 1990.


Cache Coherence in Large-Scale Shared Memory Multiprocessors.. - Lilja (1993)   (34 citations)  (Correct)

No context found.

Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry, "Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes," International Conference on Parallel Processing, Vol. I: Architecture, pp. 312-321, 1990.

First 50 documents  Next 50

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC