DMCA
Design and Evaluation of a Compiler Algorithm for Prefetching (1992)
Cached
Download Links
- [www.eecg.toronto.edu]
- [www-2.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www.ece.cmu.edu]
- [www.cs.uiuc.edu]
- [www.cs.uiuc.edu]
- [www.cs.uiuc.edu]
- DBLP
Other Repositories/Bibliography
Venue: | in Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems |
Citations: | 501 - 20 self |
Citations
804 | A data locality optimizing algorithm,”
- Wolf, Lam
- 1991
(Show Context)
Citation Context ...ices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. Other useful transformations include unimodular loop transforms such as interchange, skewing and reversal=-=[29]-=-. Since these optimizations improve the code's data locality, they not only reduce the effective memory access time but also reduce the memory bandwidth requirement. Memory hierarchy optimizations suc... |
759 |
SPLASH: Stanford parallel applications for shared-memory”,
- Singh, Weber, et al.
- 1992
(Show Context)
Citation Context ...for a collection of scientific programs drawn from several benchmark suites. This collection includes NASA7 and TOMCATV from the SPEC benchmarks[27], OCEAN a uniprocessor version of a SPLASH benchmark=-=[25]-=-, and CG (conjugate gradient), EP (embarassingly parallel), IS (integer sort), MG (multigrid) from the NAS Parallel Benchmarks[3]. Since the NASA7 benchmark really consists of 7 independent kernels, w... |
694 | The Nas Parallel Benchmarks
- Bailey, Barszcz, et al.
- 1991
(Show Context)
Citation Context ...PEC benchmarks[27], OCEAN a uniprocessor version of a SPLASH benchmark[25], and CG (conjugate gradient), EP (embarassingly parallel), IS (integer sort), MG (multigrid) from the NAS Parallel Benchmarks=-=[3]-=-. Since the NASA7 benchmark really consists of 7 independent kernels, we study each kernel separately (MXM, CFFT2D, CHOLSKY, BTRIX, GMTRY, EMIT and VPENTA). The performance of the benchmarks was simul... |
581 | Software Pipelining: An Effective Scheduling Technique for VLIW Machines - Lam - 1988 |
573 | The cache performance and optimizations of block algorithms
- Lam, Rothberg, et al.
- 1991
(Show Context)
Citation Context ... accessing data in the same matrix with a constant stride. Such conflicts can be predicted, and can even be avoided by embedding the matrix in a larger matrix with dimensions that are less problematic=-=[19]-=-. We have not implemented this optimization in our compiler. Since such interference can greatly disturb our simulation results, we manually changed the size of some of the matrices in the benchmarks ... |
365 | Lockup-free instruction fetch/prefetch cache organization.
- Kroft
- 1981
(Show Context)
Citation Context ...e uses this instruction to inform the hardware of its intent to use a particular data item; if the data is not currently in the cache, the data is fetched in from memory. The cache must be lockup-free=-=[17]-=-; that is, the cache must allow multiple outstanding misses. While the memory services the data miss, the program can continue to execute as long as it does not need the requested data. While prefetch... |
302 | Tolerating Latency Through Software-Controlled Data Prefetching.
- Mowry
- 1993
(Show Context)
Citation Context ...e and software approaches to improve the memory performance have been proposed recently[15]. A promising technique to mitigate the impact of long cache miss penalties is softwarecontrolled prefetching=-=[5, 13, 16, 22, 23]. Software-=--controlled prefetching requires support from both hardware and software. The processor must provide a special "prefetch" instruction. The software uses this instruction to inform the hardwa... |
273 |
Software Prefetching.
- Kennedy, Porterfield
- 1991
(Show Context)
Citation Context ...f the advantages of having implemented the prefetching schemes in the compiler is that we can quantify this instruction overhead. Previous studies have only been able to estimate instruction overhead =-=[4]-=-. Table 4 shows the number of instructions required to issue each Page 7 109 Prefetch memory overhead memory access stalls instructions N C S N C S N C S N C S N CS N CS N CS N CS N CS N CS N CS N CS ... |
267 | Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. - Rau, Glaeser - 1981 |
255 | An effective on-chip preloading scheme to reduce data access penalty.
- Baer, Chen
- 1991
(Show Context)
Citation Context ...d scheme for prefetching in a multiprocessor where all shared data is uncacheable. He found that the effectiveness of the scheme was limited by branch prediction and by synchronization. Baer and Chen =-=[2] proposed -=-a scheme that uses a history buffer to detect strides. In their scheme, a "look ahead PC" speculatively walks through the program ahead of the normal PC using branch prediction. When the loo... |
254 |
Strategies for cache and local memory management by global program transformation.
- Gannon, Jalby, et al.
- 1988
(Show Context)
Citation Context ...y references, Gannon et al. observe that data reuse is exploitable only if the references are uniformly generated; that is, references whose array index expressions differ in at most the constant term=-=[11]-=-. For example, references B[3] [0] and B [ 3 +1 ] [ 0 ] are uniformly generated; references C [s] and C [ 3 ] are not. Pairs of uniformly generated references can be analyzed in a similar fashion[29].... |
200 | A VLIW Architecture for a Trace Scheduling Compiler”,
- Colwell
- 1988
(Show Context)
Citation Context ...or general-purpose code, their effectiveness for scientific applications has not. One manifestation of this is that several of the scalar machines designed for scientific computation do not use caches=-=[6, 7]-=-. 1.1 Cache Performance on Scientific Code To illustrate the need for improving the cache performance of microprocessor-based systems, we present results below for a set of scientific programs. For th... |
173 |
An architecture for software-controlled data prefetching.
- Klaiber, Levy
- 1991
(Show Context)
Citation Context ...e and software approaches to improve the memory performance have been proposed recently[15]. A promising technique to mitigate the impact of long cache miss penalties is softwarecontrolled prefetching=-=[5, 13, 16, 22, 23]. Software-=--controlled prefetching requires support from both hardware and software. The processor must provide a special "prefetch" instruction. The software uses this instruction to inform the hardwa... |
137 |
Tracing with Pixie.
- Smith
- 1991
(Show Context)
Citation Context ...endent kernels, we study each kernel separately (MXM, CFFT2D, CHOLSKY, BTRIX, GMTRY, EMIT and VPENTA). The performance of the benchmarks was simulated by instrumenting the MIPS object code using pixie=-=[26]-=- and piping the resulting trace into our cache simulator. Figure 1 breaks down the total program execution time into instruction execution and stalls due to memory accesses. We observe that many of th... |
136 |
Software Methods for Improvement of Cache Performance on Supercomputer Applications.
- Porterfield
- 1989
(Show Context)
Citation Context ...e and software approaches to improve the memory performance have been proposed recently[15]. A promising technique to mitigate the impact of long cache miss penalties is softwarecontrolled prefetching=-=[5, 13, 16, 22, 23]. Software-=--controlled prefetching requires support from both hardware and software. The processor must provide a special "prefetch" instruction. The software uses this instruction to inform the hardwa... |
124 |
Overlapped loop support in the cydra 5. In
- Dehnert, Hsu, et al.
- 1989
(Show Context)
Citation Context ...or general-purpose code, their effectiveness for scientific applications has not. One manifestation of this is that several of the scalar machines designed for scientific computation do not use caches=-=[6, 7]-=-. 1.1 Cache Performance on Scientific Code To illustrate the need for improving the cache performance of microprocessor-based systems, we present results below for a set of scientific programs. For th... |
112 | Comparative evaluation of latency reducing and tolerating techniques
- Gupta, Hennessy, et al.
- 1991
(Show Context)
Citation Context ...pend more than half of their time stalled for memory accesses. 1.2 Memory Hierarchy Optimizations Various hardware and software approaches to improve the memory performance have been proposed recently=-=[15]-=-. A promising technique to mitigate the impact of long cache miss penalties is softwarecontrolled prefetching[5, 13, 16, 22, 23]. Software-controlled prefetching requires support from both hardware an... |
106 |
On estimating and enhancing cache effectiveness.
- Ferrante, Sarkar, et al.
- 1991
(Show Context)
Citation Context ...es not exceed the cache size. We estimate the amount of data used for each level of loop nesting, using the reuse vector information. Our algorithm is a simplified version of those proposed previously=-=[8, 11, 23]-=-. We assume loop iteration counts that cannot be determined at compile time to be small this tends to minimize the number of prefetches. (Later, in Section 4.2, we present results where unknown loop i... |
92 | Compiler-directed Data Prefetching in Multiprocessor with Memory Hierarchies
- Gornish, Granston, et al.
- 1990
(Show Context)
Citation Context ...e and software approaches to improve the memory performance have been proposed recently[15]. A promising technique to mitigate the impact of long cache miss penalties is softwarecontrolled prefetching=-=[5, 13, 16, 22, 23]. Software-=--controlled prefetching requires support from both hardware and software. The processor must provide a special "prefetch" instruction. The software uses this instruction to inform the hardwa... |
79 | Sharlit--A Tool for Building Optimizers
- Hennessy, Tjiang
- 1992
(Show Context)
Citation Context ...our prefetch algorithm in the SUIF (Stanford University Intermediate Form) compiler. The SUIF compiler includes many of the standard optimizations and generates code competitive with the MIPS compiler=-=[28]-=-. Using this compiler system, we have been able to generate fully functional and optimized code with prefetching. (For the sake of simulation, prefetch instructions are encoded as loads to R0.) By sim... |
70 | Data access microarchitecture for superscalar processors with compiler-assisted data prefetching.
- Chen, Mahlke, et al.
- 1991
(Show Context)
Citation Context ...e and software approaches to improve the memory performance have been proposed recently[15]. A promising technique to mitigate the impact of long cache miss penalties is softwarecontrolled prefetching=-=[5, 13, 16, 22, 23]. Software-=--controlled prefetching requires support from both hardware and software. The processor must provide a special "prefetch" instruction. The software uses this instruction to inform the hardwa... |
69 | Impact of hierarchical memory systems on linear algebra algorithm design, - Gallivan, Jalby, et al. - 1988 |
26 | Automatic Program Transformations for Virtual Memory Computers,” - Abu-Sufah, Lawrie - 1979 |
26 |
The effectiveness of caches and data prefetch buffers in largescale shared memory multiprocessors, ph.d
- Lee
- 1987
(Show Context)
Citation Context ...uated several cacheline-based hardware prefetching schemes. In some cases they were quite effective at reducing miss rates, but at the same time they often increased memory traffic substantially. Lee =-=[20]-=- proposed an elaborate lookahead scheme for prefetching in a multiprocessor where all shared data is uncacheable. He found that the effectiveness of the scheme was limited by branch prediction and by ... |
17 | The organization of matrices and matrix operations in a paged multiprogramming environment - McKeller, Coffman - 1969 |
16 | The influence of memory hierarchy on algorithm organization: Programming ffts on a vector multiprocessor. In The Characteristics of Parallel Algorithms - Gannon, Jalby - 1987 |
12 |
Compile time analysis for data prefetching
- Gornish
- 1989
(Show Context)
Citation Context ...strated that prefetching directly into the cache can provide impressive speedups, and without the disadvantage of sacrificing cache size to accommodate a fetchbuffer. Gornish, Granston and Veidenbaum =-=[13, 14]-=- presented an algorithm for determining the earliest time when it is safe to prefetch shared data in a multiprocessor with software-controlled cache coherency. This work is targeted for a block prefet... |
4 |
The SPEC Benchmark Report. Waterside Associates
- SPEC
- 1990
(Show Context)
Citation Context ...it in the primary instruction cache. We present results for a collection of scientific programs drawn from several benchmark suites. This collection includes NASA7 and TOMCATV from the SPEC benchmarks=-=[27]-=-, OCEAN a uniprocessor version of a SPLASH benchmark[25], and CG (conjugate gradient), EP (embarassingly parallel), IS (integer sort), MG (multigrid) from the NAS Parallel Benchmarks[3]. Since the NAS... |
1 | 16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26 - C, Klaiber, et al. - 1987 |
1 | The Effectiveness of Caches and Data Prefetch Buffers in Lurge-Scale Shared Memory Multiprocessors - Lee - 1987 |
1 | Software Methods for Improvement of Cache Performance on Supercomputer Applications - field - 1989 |
1 | Sharlic A tool for buildlng optimizers - Tjiang, Hennessy - 1992 |