14 citations found. Retrieving documents...
Tien-Fu Chen and Jean-Loup Baer. Reducing memory latency via non-blocking and prefetching caches. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS V), pages 51--61, October 1992. Also available as U. Washington CS TR 92-06-03.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Dead-Block Prediction Dead-Block Correlating Prefetchers - Lai, al. (2001)   (4 citations)  (Correct)

....Many architects have additionally relied on the prefetch memory access model to mitigate the shortcomings of the demand fetch model. Prefetching helps fetch data in advance to hide the memory latency by predicting future memory requests. While prefetching can be initiated in either hardware [17,5,10,3,15,4,6] or software [9,8,14,12] many researchers and vendors opt for hardware implementations for transparency and due to availability of runtime information which can significantly improve prefetching s effectiveness. Most previous proposals for hardware prefetchers target specific memory access ....

....researchers and vendors opt for hardware implementations for transparency and due to availability of runtime information which can significantly improve prefetching s effectiveness. Most previous proposals for hardware prefetchers target specific memory access patterns such as strided accesses [15,4,6] and accesses to linked data structures [17] While effective for the targeted access patterns, these prefetchers have limited general applicability across a wide spectrum of applications. There are a number of prefetcher proposals in the literature that target generalized memory access patterns ....

[Article contains additional citation context not shown here]

Tien-Fu Chen and Jean-Loup Baer. Reducing memory latency via non-blocking and prefetching caches. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS V), pages 51--61, October 1992. Also available as U. Washington CS TR 92-06-03.


How Useful Are Non-blocking Loads, Stream Buffers, and.. - Farkas, Jouppi, Chow (1994)   (8 citations)  (Correct)

....might be useful. Software prefetching has been most successful on numeric codes, while hardware prefetching can be used with all types of applications (including the operating system) Examples of hardware prefetch techniques include Chen and Baer s lookahead PC reference prediction method [3] and stream buffers [4, 5] We study stream buffers as we believe them to be simpler and less invasive than the lookahead scheme. The lookahead scheme is complicated by the need for additional ports into the data cache tags when used in a superscalar processor. In such a processor, the cache tag ....

....loads with non blocking caches using many of the Livermore loop kernels; they compared the results to blocking caches. Sohi and Franklin, on the other hand, studied non blocking loads [8] while 2 Callahan and Mowry have studied software prefetching for scientific codes [1, 2] Chen and Baer [3] investigated a combination of non blocking loads and prefetching, but used a lookahead PC reference prediction method with a production compiler, instrumented with Pixie, and rescheduled only at a basic block level. Comparing the effectiveness of the different techniques used in these studies is ....

Tien-Fu Chen and Jean-Loup Baer. Reducing memory latency via non-blocking and prefetching caches. Proceedings of the 5th ASPLOS Conference, pages 51--61, 1992.


Compiling Applications with the KarHPFn Compiler - Müller (2000)   (Correct)

....processors. Karpfen (without h) is the german word for carp. 1 Prefetching is not new. Previous research addresses it in the context of prefetching cache lines, non blocking loads, scheduling techniques, and speculative execution on uniprocessors or smallscale cache coherent multiprocessors [CB92, RL92, MLG92, CKP91, GGV90] Prefetching is also used by software distributed shared memory systems to prefetch whole memory pages [LCD 97, BPA98] But little is known about the effects of latency hiding applied to communication networks in massively parallel computers with distributed ....

Tien-Fu Chen and Jean-Loup Baer. Reducing memory latency via non-blocking and prefetching caches. In Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 51--61, Boston, Massachusetts, October 1992. Also available as U. Washington CS TR 92-06-03.


Complexity/Performance Tradeoffs with Non-Blocking Loads - Farkas, Jouppi (1994)   (6 citations)  (Correct)

....complexity required to implement non blocking loads. Yet studies of non blocking loads have often assumed very unrestricted models. For example, Sohi and Franklin [12] assumed an 8 way banked cache where each bank could support four outstanding fetches and several times more misses. Other studies [2, 5, 11] generally have used unrestricted models while focusing on other aspects of system performance. We investigate the performance obtainable from a number of practical non blocking load implementations and evaluate the performance obtained in the context of the hardware complexity required. Key to ....

Tien-Fu Chen and Jean-Loup Baer. Reducing Memory Latency via Non-blocking and Prefetching Caches. In Fifth ASPLOS Conference, pages 51-61. October, 1992.


Hardware Techniques To Improve The Performance Of The.. - Burger (1998)   (10 citations)  (Correct)

....two major classes of techniques for reducing the impact of long memory latencies: latency reduction and latency tolerance. Latency reduction decreases the time between the issue of a memory request and the return of the needed operand. Some latency reduction techniques include hardware prefetching [21, 43, 47] (which speculatively bring in data before they are requested) increased cache block size, larger caches (improved hit ratio) and more aggressive memory hierarchies (e.g. faster buses, sub banked caches, and lower latency DRAM cores) Latency tolerance involves performing other computation ....

Tien-Fu Chen and Jean-Loup Baer. Reducing Memory Latency via Non-blocking and Prefetching Caches. In Proceedings of the Fifth Symposium on Architectural Support for Programming Languages and Operating Systems, pages 51--61, October 1992.


The Declining Effectiveness of Dynamic Caching for.. - Douglas Burger James (1995)   (27 citations)  (Correct)

....16] The overheads of hardware prefetching are the cost for the additional hardware, and the limited ability of the dynamic units to perform any prefetching other than through arrays with linear strides. A different form of hardware prefetching consists of stream buffers [27, 36] Chen and Baer [8] evaluated the effectiveness of lockup free caches and hardware prefetching, and proposed a hybrid scheme based on a combination of these approaches. Software prefetching is much more flexible than hardware prefetching, having the advantage of compile time knowledge, but pays the price of ....

Tien-Fu Chen and Jean-Loup Baer. Reducing Memory Latency via Non-blocking and Prefetching Caches. In Proceedings of the Fifth Symposium on Architectural Support for Programming Languages and Operating Systems, pages 51--61, October 1992.


Thesis Proposal: Latency Tolerant Architectures - James Bennett   (Correct)

....dynamic scheduling. Much recent work has been done using prefetching to tolerate memory latency. Prefetching techniques range from pure hardware based approaches (such as a stride prediction table[FPJ92] to pure software based approaches[MLG92] Various hybrid approaches have also been suggested[CB92]. Generally, both hardware and software prefetching have been shown to be effective when memory accesses by the application are regular, as in some scientific applications. In addition, the use of dynamic scheduling to hide memory latency has been studied in [GGH92] 3 Approach to the Problem ....

Tien-Fu Chen and Jean-Loup Baer. Reducing memory latency via non-blocking and prefetching caches. In SIGPLAN Notices, pages 51--61, September 1992.


A Framework for Qualitative Performance Prediction - Hsu, Kremer (1998)   (Correct)

....memory accesses are usually estimated in terms of the total amount of cache misses or its miss rate [HP96] Recent advances introduces the non blocking cache [Kro81, SD91] as a way to reduce the penalty associated with each cache miss. While the performance of the non blocking cache is studied [CB92, FJ94, Bal94, WO95] we find in the experiments that the new model accounting for the e#ect of non blockness of the cache is needed. In the report, a simple model based on the tra#c pressure is presented and demonstrated to be better suited than the conventional basis of cache miss counts. It ....

Tien-Fu Chen and Jean-Loup Baer. Reducing memory latency via non-blocking and prefetching caches. In Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 51--61, October 1992. Cs-TR-92-06-03.


Comparing Static and Dynamic Scheduling on Superscalar Processors - Lo (1995)   (Correct)

....of load instructions. Each level of the memory hierarchy introduces a new set of cache hit and miss latencies. Unfortunately, the compiler is unable to determine detailed cache behavior at compile time. Although software solutions, such as balanced scheduling [KE93] and software prefetching[MLG92] CB92] try to address this issue, by making more intelligent static scheduling decisions, they have not yet eliminated this problem. The memory system also inhibits the effectiveness of a static scheduler in a second way, memory aliasing. Chang et al. CMCWH91] have found that when loads are allowed to ....

Tien-Fu Chen and Jean-Loup Baer. Reducing memory latency via non-blocking and prefetching caches. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 51--61, Boston, Massachusetts, 1992.


Relaxed Consistency and Synchronization in Parallel Processors - Zucker (1992)   (3 citations)  (Correct)

....scheduling of the code will be impractical most of the time) For a program like Relax a compiler can produce the same optimizations which I produced by hand [26] However, Relax is a very regular program, i.e. one which is relatively easy to analyze. Although there is some research in this area [32], there is clearly a need for more. Two Cycle Load and Branch Delays Due to limitations of the simulator, my initial studies used a load and branch delay of four cycles. It can be argued that this is overly long (although superpipelined machines may have long delays as well) I duplicated the ....

Tien-Fu Chen and Jean-Loup Baer. Reducing memory latency via non-blocking and prefetching caches. In Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 51--61, 1992.


Latency Hiding in Parallel Systems: A Quantitative Approach - Warschko, al. (1994)   (Correct)

....factored in. If, however, the parallel machine is capable of performing communication and computation concurrently, then the loss in efficiency can be reduced by overlapping communication and computation. The basic concept of hiding latency can be used with a great variety of policies [GHG 91, CB92, RL92, CKP91, GGV90] Little is known about the effects of latency hiding applied to communication networks in massively parallel computers with distributed memory. This paper reports on simulation experiments that quantify the effects of latency hiding on real programs, namely parallel versions ....

....another level in the memory hierarchy with an extremely large access time. There is a large body of related work discussing various techniques, such as prefetching cache lines, non blocking loads, scheduling techniques, and speculative executions on uniprocessors or small scale multiprocessors [CB92, RL92, MLG92, CKP91, GGV90] The basic idea of latency hiding by thread switching is to de activate a thread stalled by communication and to switch to another thread. When the communication operation has finished, the system can continue processing the original thread. The cause for a stall can ....

Tien-Fu Chen and Jean-Loup Baer. Reducing memory latency via non-blocking and prefetching caches. In The 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 51-- 61, October 1992.


Exploring Cache Performance in Multithreaded Processors - Lioupis, Milios   (Correct)

....Caches go a long way to reduce the effect of this speed difference, but as it keeps growing larger (up to 30 cycles) the miss penalty becomes a major issue, since it introduces latency of several tens of clock cycles. Multithreading can be used in a single processor system with non blocking cache [25] to hide this latency, thus increasing system performance. Using multithreading in a single processor system, however, introduces changes in the memory referencing pattern, which effects the locality of reference and thus reducing the effectiveness of the cache [12] In this paper we examine the ....

Tien-Fu Chen, J-L. Baer, "Reducing Memory Latency via Non-blocking and Prefetching Caches" Tech. Rep. 92-06-03, Dept. of Computer Science & Engineering, Univ. of Washington, June 1992.


Data Prefetching: A Cost/Performance Analysis - Metcalf (1993)   (1 citation)  (Correct)

....unlike a compiler; and even when it can look ahead, on some problems it will be boundedby the size of hardware tables as to how much prefetching it can perform. Additionally, hardware prefetching is expensive in terms of design time and chip area. 5. 1 Architectural Model Chen and Baer [2, 6, 5] propose what is essentially a vector strideprefetching scheme, and spell out the details of how to identify vector strides in cache. They break down accesses for each instruction into four categories, as shown in Table 2; we will assume for the sake of example that the instruction is nested in ....

....table misprediction, we flush the buffered preload requests. Similarly, note that we need to check the ORL for ordinary reads and writes as well for prefetches. 5. 2 Results Chen and Baer use a trace driven simulation, using pixie on a DECstation 5000 to trace a set of benchmarks from SPEC (in [6]) and Perfect Club and a few others (in [2] Cache warm start was simulated by ignoring the first 500,000 references. They use directmapped, 16 byte line caches for all the various caches and tables. Various different memory models are used: nonoverlapped (10 cycles line) overlapped (20 ....

[Article contains additional citation context not shown here]

Tien-Fu Chen andJean-LoupBaer. Reducingmemory latencyvia nonblocking and prefetching caches. In Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 51--61, October 1992. Also available as U. Washington CS TR 92-06-03.


Virtual Memory on Data Diffusion Architectures - Buenabad-Chávez (1998)   (Correct)

No context found.

Tien-Fu Chen and Jean-Loup Baer. Reducing Memory Latency via Non-blockingand Prefectching Caches. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS-V, pages 51--61, 1992.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC