85 citations found. Retrieving documents...
N. Jouppi, "Cache Write Policies and Performance", Intl. Symp. on Computer Architecture, pp. 191-201, 1993.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents  Next 50

Understanding and Designing for Dependent Store/Load Pairs in.. - Bhargava (2000)   (Correct)

....and have the abilitytocombine several stores with contiguous addresses or the same address. Skadron and Clark discuss the issues and tradeo s involved in such a write bu er [53] Martonosi and Shaw did a study of the e ect of compilation techniques on the performance of a write bu er [35] Jouppi [29] and Bray [5] consider structures they call write caches with similar properties. The issues addressed in these papers and similar papers focus on 69 reducing the number of writes that are performed o chip and sometimes on chip. It is possible that mechanisms like the write bu er can be referred ....

N. P. Jouppi. Cache write policies and performance. In Proc. 20th Intl. Sym. on Computer Architecture, pages 191-201, May 1993.


Application-Driven Synthesis of Memory-Intensive.. - Kirovski, Lee.. (1999)   (1 citation)  (Correct)

....high level architecture and ASIC evaluation models. For example, The Microprocessor Report presents a monthly summary of the area and performance of numerous commercial processors [25] The dominant impact of instruction and data cache size on system performance has been thoroughly studied [16] [17]. Memory hierarchy synthesis techniques for multimedia systems have been developed [22] Recently, a number of compiler optimization strategies have been introduced for optimizing code generated for embedded systems [1] Compiler techniques for reduction of cache misses in such studies have ....

....frequency of the system and external bus width and clock for each system investigated. This penalty ranges between four and 20 system clock cycles. Write back is adopted in contrast to write through, since it provides superior performance in uniprocessor systems though at increased hardware cost [17]. Each of the processors considered is constrained by the Flynn limit [13] and thus is able to issue at most a single instruction per clock period. As a consequence, the caches are designed to have a single access port. We used blocking I caches since even for general purpose applications it has ....

N. P. Jouppi, "Cache write policies and performance," in Proc. Int. Symp. Computer Architecture, vol. 21, 1993, no. 2, pp. 191--201.


Power Optimization of Variable-Voltage Core-Based Systems - Hong, Kirovski, Qu.. (1999)   (78 citations)  (Correct)

....bus width and clock for each system investigated. This penalty ranged between four and 20 system clock cycles. Write back was adopted as opposed to write through, since it is proven to provide superior performance and especially power savings in uniprocessor systems at increased hardware cost [24]. Each of the processors considered is constrained by the Flynn s limit [13] and is able to issue at most a single instruction per clock period. Thus, caches were designed to have a single access port. Cache access delay and a power consumption model were computed for a number of organizations and ....

N. P. Jouppi, "Cache write policies and performance," in Proc. Int. Symp. Computer Architecture, 1993, pp. 191--201.


Reducing Power with Dynamic Critical Path Information - Seng, Tune, Tullsen (2001)   (10 citations)  (Correct)

....density ratio for a functional block which is a potential hot spot. This metric is appropriate assuming the power density of the targeted structure is a constraint on the total design (and is similar in spirit to a cache performance study that assumes the cache sets the cycle time of the processor [12]) In that scenario, the optimization which has the best performance to component power ratio, and reduces power density to acceptable levels, would represent the best design. 5. Optimization This section examines two optimizations which exploit information available from a critical path ....

N. Jouppi. Cache write policies and performance. In 20th International Symposium on Computer Architecture, May 1993.


Silent Stores for Free: Reducing the Cost of Store Verification - Lepak (2000)   (1 citation)  (Correct)

....a similar process occurs the load operation allocates the line in the LSQ cache, and stores to the same line are squashed from it. We will show in Section 4.4 that a small LSQ cache is especially effective in the case of WAW dependences. The LSQ cache is similar to the write cache proposed in [12], except it contains entire cache lines as opposed to 8 byte quantities and it buffers both load allocated and store allocated lines. Also note that since issuing stores is generally not as time critical as issuing loads (because the stores can be buffered at commit) we serialize the lookup in the ....

....error detection over 64 data bits as provided by 64 bit SEC DED. If an error is detected in the L1 data cache via parity, the correct value is fetched from the ECC L2 cache. Of course, a major caveat of this approach is the additional bus traffic generated by implementing a store through L1 cache [12]. This traffic can be reduced with techniques like aggressive write combining and other buffering techniques, but special care must be taken to handle the extra L1 to L2 bandwidth requirements. Weaker consistency models allow greater freedom for store combining than stricter models. In the case of ....

Norman P. Jouppi. Cache Write Policies and Performance. In Proceedings of the 20th Annual International Symposium on Computer Architecture , 1993


Eliminating Useless Messages in Write-Update Protocols .. - Bianchini, LeBlanc.. (1994)   (3 citations)  (Correct)

....may have been part of a larger program that used these values for simulation statistics. Our analysis of updates uncovered the fact that the source code distributed in the SPLASH suite contains this useless code. 14 5. 2 Merging Updates with Coalescing Write Buffers A coalescing write buffer [Jouppi, 1993; Thacker et al. 1992] is simply a cache block wide buffer capable of merging writes to the same cache block. In the context of a WU protocol, this feature allows for a reduction in the number of updates propagated outside the processor. A coalescing write buffer is also useful for WI, since it ....

Norman P. Jouppi, "Cache Write Policies and Performance," In Proceedings of the 20th International Symposium on Computer Architecture, pages 191--201, May 1993.


Improving Context-Based Load Value Prediction - Burtscher (2000)   (Correct)

....done to address the load latency problem. As opposed to load instructions, latency is not an issue with store instructions because their (slow) memory access takes place after the execution of the store, i.e. the CPU can proceed without having to wait for the store to complete. Write buffers [Jou93] perform the actual store operation at some later time and make sure that consistency is maintained. Unfortunately, nothing similar can be done for load instructions because the fetched values are often almost instantly needed by the immediately following instructions. These instructions cannot ....

N. P. Jouppi. "Cache write policies and performance". Computer Architecture News, Proceedings of ISCA '20, 191-201. May, 1993.


A Microprocessor Survey Course: Exploring Advanced Computer.. - Skadron (2000)   (Correct)

....of delay slots, since they create difficulties for multi issue architectures. ffl The UltraSPARC III s novel treatment of writes. Unlike the other processors in our survey, the UltraSPARC III uses a write through first level 3 cache with an unusual write buffer organization called a write cache [11]. It can be read directly [20] rather than requiring some kind of flush when needed data still resides in the buffer. The write cache is unusually large and associative compared to a conventional write buffer, and follows an LRU rather than a FIFO writeback policy. Although we have not yet ....

N. P. Jouppi. Cache Write Policies and Performance. In Proc. of the 20th International Symposium on Computer Architecture, pages 191--201, May 1993.


Do Object-Oriented Languages Need Special Hardware Support? - Hölzle, Ungar (1995)   (6 citations)  (Correct)

.... et al. [DTM94] who have measured allocationintensive ML programs and found very low data cache overheads for the same cache organization (write allocate, subblock placement) Similar results have also been reported by Reinhold for Scheme programs [Rei93] by Jouppi for the SPEC benchmark suite [Jou93], and by Koopman et al. for combinator graph reduction [KLS92] Such low data cache overheads leave little room for improvement through special cache features (e.g. PS89, WW90] Figure 6. Instruction cache miss ratios of SELF programs (direct mapped cache, 32 byte lines) l l l l l l l 2 2 2 2 ....

....small caches, and thus there is little need for special object oriented caches. Write allocate caches with subblock placement reduce read miss ratios by up to a factor of two, and write miss ratios by a factor of ten. These findings are consistent with other work for non object oriented languages [KLS92, Jou93, Rei93, DTM94]. Instruction cache size significantly impacts performance. For example, doubling the instruction cache from 32K to 64K improves performance by 15 on a SPARCstation 2. This improvement is higher than that of any OO specific architectural feature we considered. Our results contradict (and, we ....

Norm Jouppi. Cache Write Policies and Performance. In ISCA'20 Conference Proceedings, pp. 191-201, San Diego, CA, 1993. Published as Computer Architecture News 21(2), May 1993.


Annotated Memory References: A Mechanism for Informed Cache .. - Alvin Lebeck David (1998)   (Correct)

....the above information on temporal locality, we augment a conventional cache organization with support for uncached accesses and with a 32 entry fully associative auxiliary buffer that uses LRU replacement. We 13 assume a base configuration consisting of a 16KB, direct mapped, write around cache [9] with 32 byte blocks. The cache can have up to 8 outstanding misses, and we assume an 8 word write buffer with a high water mark of 5 [17] Default cache fills and auxiliary buffer fills are satisfied in a minimum of 24 cycles, 20 for the request and 1 per word of the cache block, uncached ....

Norman P. Jouppi. Cache Write Policies and Performance. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 191--201, May 1993.


Architectural Support for Compiler-Generated Data-Parallel Programs - Klaiber (1994)   (1 citation)  (Correct)

....first acquires an exclusive copy of the cache line by invalidating all other copies. In the write update protocol several nodes may hold writable copies; whenever a node writes (part of) a cache line, the changed words are forwarded to all other nodes holding a copy. We assume that a write cache [Jouppi 93] is used, i.e. successive writes to the same cache line generate a single forwarding message. For both the write invalidate and write update versions, we assume that data is always fetched in cache line units. 5.2 Implementation of C Communication Primitives As noted before, the C ....

N. P. Jouppi. Cache write policies and performance. In Proceedings of 20th International Symposium on Computer Architecture, pages 191--201, 1993.


Cache Write Generate For High-Performance Processing - Wittenbrink, Somani, Chen   (Correct)

....Hennessy and Patterson [7] and Stone [17] Caches differ in their size, control, and organization. Effects of organization, such as the associativity, can be easily investigated by trace analysis [7] 15] 17] When memory read performance is decoupled from write performance, traces are adequate [9]. In multiprocessor design write performance also becomes significant. This is because when caches are large enough reads are very efficient, and writes constitute a larger percentage of the bus traffic. Writes have more variability and are differentiated in their control on hits and misses. On ....

....not the periodic congestion of write back [1] Write updates without fetches are also provided in the RP3 processor memory element by word valid bits, a scheme similar to the Motorola 68030 aligned long word validation. Trace analysis of allocate and valid bit per word schemes was done by Jouppi [9] for a variety of cache sizes in a uniprocessor. Valid bits per word in MC68030, IBM801, and in Jouppi [9] are expensive in hardware. Moreover, following reads in the same line and cache flush operations are complicated. Smith [15] mentions alternatives for handling writes but relies primarily on ....

[Article contains additional citation context not shown here]

N. P. Jouppi, "Cache Write Policies and Performance," in 20th Annual International Symposium on Computer Architecture, San Diego, CA May 1993, pp. 191-201.


Hardware Techniques To Improve The Performance Of The.. - Burger (1998)   (10 citations)  (Correct)

....If the next reference to the loaded block is of lower priority (i.e. will be read farther in the future) than any block in the cache, the block bypasses the cache rather than evicting something of higher priority. Write policy: the traffic optimal write policy is write back, write validate [70]. A writeback policy will always produce less memory traffic than write through for caches that have one word blocks 1 . A write validate policy overwrites the contents of a block and the block s associated tag, rather than fetching the block from memory and then overwriting the word, as in a ....

Norman P. Jouppi. Cache Write Policies and Performance. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 191--201, May 1993. 162


Cache Based Fault Recovery for Distributed Systems - Mendelson, Suri (1997)   (1 citation)  (Correct)

....processor and the main memory. We refer to this partitioned entire cache as the bi directional cache. It has been shown that the bi directional cache causes lower write traffic rates than the write through cache, and furthermore displays better performance features than even the write back cache[4, 13, 9]. The bi directional cache (Fig. 2) consists of two subsystems distinguished by their functions, namely: a read subsystem and a write subsystem. The readsubsystem handles fetching information from the main memory and the write subsystem controls the updating of the main memory. We use a ....

....write subsystem controls the updating of the main memory. We use a write through cache as a read subsystem and disconnect its write signals. The write subsystem includes a small write back cache that uses an allocate on a write miss and a no allocate on a read miss policy. Based on earlier work, [9, 13], we locate write cache between the processor and the writebuffer. This configuration allows us to add the writesubsystem to existing cache based systems without the need for processor or cache enhancements. Memory Bus write cache cache read write buffer CPU read write flush write write ....

[Article contains additional citation context not shown here]

Norman P. Jouppi. Cache write policies and performance. Proc. ISCA, 1993.


Fine-Grain Producer-Initiated Communication in Cache-Coherent.. - Abdel-Shafi (1997)   (Correct)

....that supports sequential consistency. 2.3.1 Processor Cache Subsystem We describe the necessary changes at the processor and cache hierarchy to support remote writes. To improve the performance of remote writes, we propose supporting a coalescing (or merging) line write buffer [DS95, E 95, Jou93] located between the L1 and L2 caches. Such a buffer can merge multiple remote writes to different words in the same cache line, making them appear as a single remote write message to the rest of the memory system. Dirty bits per word in the buffer indicate which words in the line have been ....

Norman P. Jouppi. Cache Write Policies and Performance. In Proceedings 20th Annual International Symposium on Computer Architecture, pages 191--201, May 1993. 59


Evaluating the Impact of Coherence Protocols on Parallel.. - Costa, Bianchini, Dutra (1996)   (1 citation)  (Correct)

....scalability than the other protocols, and performs acceptably for small numbers of processors. 5 Coalescing of Updates for Andorra I An alternative approach for reducing update traffic that has been successful for scientific applications [11] is coalescing of update messages. A coalescing buffer [17] is simply a cache block wide buffer capable of merging writes to the same cache block. In the context of a WU protocol, this feature reduces the number of update messages propagated outside the processor. We repeated the same experiments we performed for the WU and WUh2 protocols using a ....

N. P. Jouppi. Cache write policies and performance. In Proceedings of the 20th International Symposium on Computer Architecture, pages 191--201, May 1993.


Complexity/Performance Tradeoffs with Non-Blocking Loads - Farkas, Jouppi (1994)   (6 citations)  Self-citation (Jouppi)   (Correct)

....line into which the data is to be stored. The second method is to use write policies other than fetch on write, such as write around, which neither fetch data on a write miss nor write the new data into the cache; instead the data is written directly to the next lower level in the memory hierarchy [6]. Both of these methods do not require very complex hardware and are becoming common in microprocessors. To allow the processor to continue to access the data cache during the processing of a nonblocking load miss, a lockup free cache [7] is required. Non blocking loads have only recently appeared ....

Norman P. Jouppi. Cache Write Policies and Performance. In The 20th Intl. Symp. on Computer Architecture, pages 191-201. May, 1993.


Store Memory-Level Parallelism Optimizations for Commercial .. - Yuan Chou Lawrence (2005)   (Correct)

No context found.

N. Jouppi, "Cache Write Policies and Performance", Intl. Symp. on Computer Architecture, pp. 191-201, 1993.


Minerva: An Adaptive Subblock Coherence Protocol for Improved .. - Rothman, Smith   (Correct)

No context found.

Norman P. Jouppi. Cache Write Policies and Performance. In Proc. 20th Annual International Symposium on Computer Architecture, pages 191-201, San Diego, California, May 16-19 1993.


How Multithreading Addresses the Memory Wall - Philip Machanick School   (Correct)

No context found.

Jouppi, N. P. (1993). Cache write policies and performance. In Proc. 20th annual Int. Symp. on Computer Architecture, pages 191--201, San Diego, California, United States.


Avoiding Store Misses to Fully Modified Cache Blocks - Unknown   (Correct)

No context found.

N. Jouppi, "Cache Write Policies and Performance", in ACM SIGARCH Computer Architecture News, V.21, No.2, May 1993, pp. 191-201.


Cost Performance Optimizations of Microprocessors - Fu (2001)   (1 citation)  (Correct)

No context found.

N. Jouppi, \Cache write policies and performance," in Proceedings of 20th International Symposium on Computer Architecture, 1993, pp. 191-95.


L1 Cache and TLB Enhancements to the RAMpage Memory Hierarchy - Machanick, Patel   (Correct)

No context found.

Norman P. Jouppi. Cache write policies and performance. In Proc. 20th annual Int. Symp. on Computer Architecture, pages 191--201, San Diego, CA, 1993.


the Garbage Collection Bibliography - Richard Jones (2003)   (Correct)

No context found.

Norman P. Jouppi. Cache write policies and performance. In ISCA [ISCA1993], pages 191--201. 43


Caches As Filters: A Framework for the Analysis of Caching Systems - Weikle (2001)   (4 citations)  (Correct)

No context found.

JOU93 N. Jouppi, "Cache Write Policies and Performance," In Proceedings of the Twentieth Annual International Symposium on Computer Architecture (ISCA-20), May 1993.

First 50 documents  Next 50

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC