| E. H. Gornish, E. D. Granston, and A. V. Veidenbaum, "Compiler-directed data prefetching in multiprocessors with memory hierarchies," Proceedings of the International Conference on Supercomputing, pp. 354-368, 1990. |
....intuitively see that there is a significant amount of idle time in the blocking code due to synchronization and communication. Non blocking reduces this idle time using overlapping techniques and looser synchronization. The effect that we are obtaining is somewhat similar to block pre fetching [26] in shared memory architectures. The important difference here is that blocks are obtained from irregularly distributed data arrays and cannot be determined at compile time. Therefore, we need to combine the ideas of runtime compilation and block pre fetching. 3.5 PILAR Implementation PILAR ....
Edward H. Gornish, Elana D. Granston, and Alexander V. Veidenbaum. Compiler-directed data prefetching in multiprocessors with memory hierarchies. Technical Report 996, CSRD, May 1990.
....by the effectiveness of its compiler support. The CCDP scheme relies on the compiler to identify potentially stale and nonstale data references, and to generate and schedule the appropriate prefetch operations. Several compiler techniques have been developed for software initiated data prefetching [2, 12, 13, 23, 24]. However, as these data prefetching schemes are used solely for memory latency hiding, the data prefetching operations are determined based on data locality considerations alone. Since these techniques do not distinguish between potentially stale and nonstale references, they cannot be applied ....
E. Gornish, E. Granston, and A. Veidenbaum. Compiler-directed data prefetching in multiprocessors with memory hierarchies. In Proceedings of the 1990.
....into the final Fortran 90 code replacing any calls to communication libraries. Prefetching is not new, most research addresses it in the context of prefetching cache lines, non blocking loads, scheduling techniques, and speculative execution on uniprocessors or small scale multiprocessors [5, 11, 9, 4, 6]. Prefetching Supported by the Graduiertenkolleg Beherrschbarkeit komplexer Systeme . is also used by software distributed shared memory systems to prefetch whole memory pages [8, 1] But little is known about the effects of latency hiding applied to communication networks in massively ....
E. Gornish, E. Granston, and A. Veidenbaum. Compiler-directed data prefetching in multiprocessor with memory hierarchies. In Fourth International Conference on Supercomputing, pages 354--368, Amsterdam, June 1990.
....it is a one way communication (producer forwards to all consumers) whereas prefetching is a two way communication (consumers request data and producers respond) An important issue in prefetch forwarding is to determine the optimal prefetch point. Prefetch techniques have been studied previously [10], albeit not in the context of OpenMP applications for Software DSM systems. We expect prefetch forwarding Normalized Execution Times of Serial and Parallel Regions 0 2 4 6 8 10 12 14 16 18 20 serial 1 serial 2 smvp 0 Normalized Execution 1 processor 2 processors 4 processors 8 ....
E.H.Gornish, E.D.Granston, and A.V.Veidenbaum. Compiler-directed Data Prefetching in Multiprocessors with Memory Hierarchies . Proceedings of ICS'90, Amsterdam, The Netherlands, 1:342--353, June 1990.
....been proposed. Coherent caches [3, 4, 18, 30] allow shared read write data to be cached and significantly reduce the memory latency seen by the processors. Relaxed memory consistency models [1, 5, 8] hide latency by allowing buffering and pipelining of memory references. Prefetching techniques [11, 16, 21, 23] hide the latency by bringing data close to the processor before it is actually needed. Multiple contexts [3, 12, 13, 26, 29] allow a processor to hide latency by switching from one context to another when a high latency operation is encountered. Our primary objective in this paper is to ....
....when a high latency operation is encountered. Our primary objective in this paper is to characterize the benefits and costs of these four latency hiding techniques in a systematic and consistent manner. Although one can find papers that focus on the performance of the individual techniques [7, 11, 29], it is not possible to use these papers to perform a comparative evaluation, since the benchmark programs differ, or the architectural assumptions differ, or both. We believe that a consistent comparative evaluation is essential to understanding the tradeoffs in the use of the different ....
E. Gornish, E. Granston, and A. Veidenbaum. Compilerdirected data prefetching in multiprocessors with memory hierarchies. In Int. Conf Supercomputing, pages 354368, 1990.
....processor. Prefetching in DASH is non binding in the sense that prefetched data remains visible to the cache coherence protocol [18] to keep it consistent until the processor actually reads the value through a binding access (e.g. a register load operation) In contrast, with binding prefetching [9, 14] the value of a later reference is bound (e.g. a processor register is loaded) at the time the prefetch completes. As a result, there are restrictions placed on when a binding prefetch can be issued, since the prefetched value may become stale if another processor modifies the same location. For ....
.... controlled by the hardware, for example through instruction look ahead in [16] software control allows the prefetching to be done selectively (thus reducing overhead) and extends the possible interval between the issue of prefetch and the actual use of that data (thus increasing effectiveness) [9, 21]. The disadvantage, of course, is that programmer or software intervention is required. The benefits due to prefetching come from several sources. The most obvious benefit occurs when a prefetch is issued early enough that the line is already in the cache by the time it is referenced. However, ....
[Article contains additional citation context not shown here]
E. Gornish, E. Granston, and A. Veidenbaum. Compiler-Directed Data Prefetching in Multiprocessors with Memory Hierarchies. In International Conference on Supercomputing, 1990.
....of their time stalled for memory accesses. 1. 2 Memory Hierarchy Optimizations Various hardware and software approaches to improve the memory performance have been proposed recently[15] A promising technique to mitigate the impact of long cache miss penalties is softwarecontrolled prefetching[5, 13, 16, 22, 23]. Software controlled prefetching requires support from both hardware and software. The processor must provide a special prefetch instruction. The soft ware uses this instruction to inform the hardware of its intent to use a particular data item; if the data is not currently in the cache, the ....
....into effective scientific engines. 1.3 An Overview This paper proposes a compiler algorithm to insert prefetch instructions in scientific code. In particular, we focus on those numerical algorithms that operate on dense matrices. Various algorithms have previously been proposed for this problem [13, 16, 23]. In this work, we improve upon previous algorithms and evaluate our algorithm in the context of a full optimizing compiler. We also study the interaction of prefetching with other data locality optimizations such as cache blocking. There are a few important concepts useful for developing ....
[Article contains additional citation context not shown here]
E. Gornish, E. Granston, and A. Veidenbaum. CompilerDirected Data Prefetching in Multiprocessors with Memory Hierarchies. In International Conference on Supercomputing, 1990.
....[10, 5, 1, 9] In software prefetching, it is the programmer or compiler who is responsible for deciding when and what is going to be brought to the cache or to a register. Most research on software prefetching has been devoted to regular access patterns as those found in numerical applications [7, 2, 11, 8], but lately there has also been research that tries to detect and prefetch recursive data structures [13, 17] which appear in non numerical applications. Software prefetching can be classified to be non binding or binding, depending on whether the data is brought to L1 or to the register file. ....
E. Gornish, E. Granston, and A. Veidenbaum. Compiler directed data prefetching in multiprocessor with memory hierarchies. International Conference on Supercomputing, June 1990.
....load instructions) and then writes the data to a pre allocated portion of cache. Meadows describes a similar scheme for the PGI i860 compiler [Mea92] and Loshin and Budge give a general description of the technique [Los92] Traditional caching and cache based software prefetching techniques [Cal91,Che92,Gor90,Kla91] may also be considered schemes. The compiler detects streams (if stream detection is performed at all) the compiler determines the order of the memory accesses (stream elements are generally accessed a cache line at a time) and the compiler decides where in the instruction stream the accesses ....
E.H. Gornish, E.D. Granston, and A.V. Veidenbaum, "Compiler-directed Data Prefetching in Multiprocessor with Memory Hierarchies", Proceedings of the ACM/IEEE International Conference on Supercomputing, pages 354-368, June 1990.
....h) is the german word for carp. 1 Prefetching is not new. Previous research addresses it in the context of prefetching cache lines, non blocking loads, scheduling techniques, and speculative execution on uniprocessors or smallscale cache coherent multiprocessors [CB92, RL92, MLG92, CKP91, GGV90] Prefetching is also used by software distributed shared memory systems to prefetch whole memory pages [LCD 97, BPA98] But little is known about the effects of latency hiding applied to communication networks in massively parallel computers with distributed memory. And to our knowledge, ....
Edward H. Gornish, Elana D. Granston, and Alexander V. Veidenbaum. Compilerdirected data prefetching in multiprocessor with memory hierarchies. In Proceedings 1990 International Conference on Supercomputing, pages 354--368, Amsterdam, June11--15 1990.
....with distributed memory. Prefetching is the method considered in this paper. Prefetching is not new, but most research addresses it in the concept of prefetching cache lines, non blocking loads, scheduling techniques, and speculative execution on uniprocessors or small scale multiprocessors [3, 13, 11, 2, 6]. It is also used by software distributed shared memory systems to prefetch whole memory pages Project 11615 on the Cray T3E 512 at HLRS in Stuttgart. y Supported by the Graduiertenkolleg Karlsruhe Beherrschbarkeit komplexer Systeme . To appear in the 12th International Conference on ....
Gornish, E., Granston, E., and Veidenbaum, A. Compiler-directed data prefetching in multiprocessor with memory hierarchies. In Proceedings 1990 International Conference on Supercomputing (1990), pp. 354-- 368.
....processor to refetch the most recent value. Unlike binding prefetch, a non binding prefetch does not change the program semantics and thus can be freely inserted by the compiler affecting only program performance. Several studies have since documented the efficacy of non binding prefetch [GGV90, CKP91, MLA92]. Thirdly, DASH introduced a remote access cache (RAC) to allow remote accesses to be combined and buffered within the individual nodes. This cache, also called a cluster cache, stored remote data that was recently accessed; if the data was requested by a processor in node, it could be retrieved ....
E. Gornish, E. Granston, and A. Veidenbaum. CompilerDirected Data Prefetching in Multiprocessors with Memory Hierarchies. In International Conference on Supercomputing, 1990
....cache line. That is, while they cluster the cache misses to fit into the same superscalar instruction window, we perform static scheduling to hide the latencies. Another approach to improve the cache hit ratio used in general purpose processors is data prefetching. Software prefetching [2] [7], 17] inserts prefetch instructions into the code, to bring data into the cache early, and improve the probability it will result in a hit. Hardware prefetching [14] 20] uses hardware stream buffers to feed the cache with data from the main memory. On a cache miss, the prefetch buffers provide ....
E. Gornish, E. Granston, and A. Veidenbaum. Compilerdirected data prefetching in multiprocessors with memory hierarchies. In ICS, 1990.
....while a memory request is being serviced, so that the memory latency for that request is partially or completely f P T P T = f L T L T T I T P ( T = f B T B T T T I ( T = 13 hidden. Some common latency tolerance techniques include software prefetching [18, 22, 31, 54, 124], dynamic scheduling [123] allowing instructions ahead of a load in the dynamic instruction stream to execute) decoupling [108, 109] allowing the memory unit to run (or slip) ahead of the execute unit) and multithreading [1, 107, 125] switching to other threads during long latency ....
Edward H. Gornish, Elana D. Granston, and Alexander V. Veidenbaum. Compiler-Directed Data Prefetching in Multiprocessor with Memory Hierarchies. In Proceedings of the 1990 International Conference on Supercomputing, pages 354--368, June 1990. 161
....proposes techniques that prefetch information into the cache at run time based on compile time information and partition iteration space, thereby, circumventing the cache size constraint problem. A lot of work has been done on hardware based [4 6] and software directed prefetching techniques [1, 2, 10 12]. Hardware based prefetching, requiring some support unit connected to the cache, detects accesses with regular .This work is partially supported by NSF grants MIP95 01006 and NSF ACS 96 12028 and JPL 961097 DO 10 n1 = 1 , N1 DO 20 n2 = 1, N2 y ( n1 n2 ) x ( n1 , n2 ) c ( 0 , 1 ) y ( ....
E.Gornish, E.Granston, and A.Veidenbaum. Compiler-directed data prefetching in multiprocessors with memory hierarchies. In Proc. 1990 Intl. Conf. on Supercomputing, pages 354-368, 1990.
....loop distribution and tiling (page indexing) It should be emphasized that in principle, our file layout determination scheme can be applied for optimizing the performance of the VM as well (by changing tile sizes to take the page size into account) Recently Malkawi et.al. 22] and Gornish et.al. [12] have proposed compiler based techniques to obtain good performance from memory hierarchy. In [9] the functionality of a ViC , a compiler like preprocessor for out of core C is described. Several compiler methods for out of core HPF programs are presented in [30] and [4] In [27] compiler ....
E. Gornish, E. Granston, and A V. Veidenbaum. Compiler Directed Data Prefetching in Multiprocessors with Memory Hierarchies. In Proc. ACM Int'l Conf. on Supercomputing, pp 354--368, Amsterdam, The Netherlands, 1990.
....to rely on a memory hierarchy to solve the memory access problem. However, the cost of cache misses is becoming prohibitive. RAMBUS and synchronous DRAM s utilize a form of on chip caching and even more aggressive approaches have been proposed [3] Latency hiding techniques, primarily prefetching [6, 7, 8], also have been utilized to help solve the problem. Projected advances in VLSI and packaging technology are expected to make the problem much worse in the near future. The number of gates on a chip is projected to reach 20M in 10 years, while a DRAM can contain 2GBytes of data. This increase ....
Edward H. Gornish, Elana D. Granston, and Alexander V. Veidenbaum. Compiler-directed data prefetching in multiprocessors with memory hierarchies. In International Conference on Supercomputing, pp. 354-368, 1990.
....For multiprocessor systems without cache coherence hardware, such as BBN Butterfly [25] IBM RP3 [24] IBM SP 2 [27] and Cray T3D [18] a software controlled cache coherence management is the only viable solution. Data prefetching has been proposed as a mechanism to hide a large memory latency [3, 12, 14, 15, 21, 23, 26] In data prefetching, future data accesses are predicted, and data is moved to the upper levels of a memory hierarchy before it is referenced, thereby hiding memory access latency. Both hardware and software prefetching methods have been proposed. Several of the above references demonstrated the ....
Edward H. Gornish, Elana D. Granston, and Alexander V. Veidenbaum. Compiler-directed data prefetching in multiprocessors with memory hierarchies. In International Conference on Supercomputing, pages 354--368, June 1990.
....future data accesses and move the data to the upper levels of the memory hierarchy before it is referenced. This either eliminates or reduces the memory latency. Both hardware and software prefetching approaches have been studied extensively. Software prefetching, which has been investigated in [3, 9, 13, 15], uses a compiler to analyze memory references and insert prefetch instructions accordingly. The execution of these instructions initiates data prefetching, and as a result, the data is fetched into caches or prefetch buffers. Software prefetching can, in theory, prefetch data with any access ....
Edward H. Gornish, Elana D. Granston, and Alexander V. Veidenbaum. Compiler-directed data prefetching in multiprocessors with memory hierarchies. In International Conference on Supercomputing, pages 354--368, June 1990.
No context found.
E. H. Gornish, E. D. Granston, and A. V. Veidenbaum, "Compiler-directed data prefetching in multiprocessors with memory hierarchies," Proceedings of the International Conference on Supercomputing, pp. 354-368, 1990.
No context found.
E. Gornish, E. Granston, A. Veidenbaum. "Compiler-directed data prefetching in multiprocessors with memory hierarchies". Proc. ICS-90, 1990: 354-368.
No context found.
Edward H. Gornish, Elana D. Granston, and Alexander V. Veidenbaum, "Compiler-directed Data Prefetching in Multiprocessors with Memory Hierarchies ," Proceedings of ICS'90, Amsterdam, The Netherlands, vol. 1, pp. 342--353, June 1990.
No context found.
E. H. Gornish, E. D. Granston, and A. V. Veidenbaum, "Compiler-directed data prefetching in multiprocessors with memory hierarchies," in Proceedings of the Fourth ACM International Conference on Supercomputing, Amsterdam, The Netherlands, May 1990, pp. 354--368.
No context found.
E.H. Gornish, E.D. Granston and A.V. Veidenbaum, "Compiler-directed Data Prefetching in Multiprocessors with Memory Hierarchies," Proc. 1990.
No context found.
Gornish, E.H., E.D. Granston and A.V. Veidenbaum, "Compiler-directed Data Prefetching in Multiprocessors with Memory Hierarchies,"Proc. International Conference on Supercomputing, Amsterdam, Netherlands, June 1990, p. 354-68.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC