9 citations found. Retrieving documents...
D. Gannon and W. Jalby, "The influence of memory hierarchy on algorithm organization: Programming FFT's on a vector multiprocessor," in The Characteristics of Parallel Algorithms, L. Jamieson, D. Gannon, and R. Douglass, Eds. Cambridge, MA: MIT Press, 1987, ch. 11, pp. 277--301.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Maximizing Memory Bandwidth for Streamed Computations - McKee (1995)   (7 citations)  (Correct)

....locality of reference makes caching less effective than it might be for other parts of the program. Chapter 1: Introduction 7 In addition to traditional caching, other proposed solutions to the memory bandwidth problem range from software prefetching [Cal91,Kla91,Mow92] and iteration space tiling [Car89,Gal87,Gan87,Lam91,Por89,Wol89], to prefetching or non blocking caches [Bae91,Che92,Soh91] unusual memory systems [Bud71,Gao93,Rau91,Val92, Yan92] and address transformations [Har87,Har89] The following chapters discuss the merits and limitations of each of these in the context of streaming, but all these solutions overlook ....

....accessed in the computation s natural order, even when loop unrolling is applied. Note that the effectiveness of naive ordering decreases rapidly as vector stride increases. 2.3.1. 2 Block Prefetching Blocking or tiling changes a computation so that sub blocks of data are repeatedly manipulated [And92,Gal87,Gan87,Lam91,Por89,Wol89]. This technique reduces average access latency by reusing data at faster levels of the memory hierarchy, and may be applied to registers, cache, TLB, and even virtual memory. For example, multiplication of matrices can be blocked to reuse cached data. Figure 2.5 illustrates the data access ....

[Article contains additional citation context not shown here]

D. Gannon, and W. Jalby, "The Influence of Memory Hierarchy on Algorithm Organization: Programming FFTs on a Vector Multiprocessor", in The Characteristics of Parallel Algorithms, MIT Press, 1987.


Hyperblocking: A Data Reorganization Method to Eliminate.. - Moon, Saavedra (1998)   (4 citations)  (Correct)

....In this section, we briefly review these approaches. Using the cache model presented in Section 2.2, we also show why tile selection heuristics are not effective. Technical Report USC CS 98 671 2. 1 Related Work To cope with cache conflicts of tiled loops, copy optimization has been suggested [4, 6, 10]. In copy optimization, non contiguous blocks of data to be reused are copied into a contiguous area in memory. With copy optimization, self interference within the data block is eliminated since each element is mapped into a different cache frame. However, the overhead of copying can be ....

Dennis Gannon and William Jalby. The influence of memory hierarchy on algorithm organization: Programming fft on a vector multiprocessor. In The Characteristics of Parallel Algorithms. MIT Press, 1987.


Access Order and Memory-Conscious Cache Utilization - McKee, Wulf (1995)   (2 citations)  (Correct)

....Processor speeds are increasing much faster than memory speeds, thus memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly scientific computations. Proposed solutions range from software prefetching [4, 16, 27] and iteration space tiling [5, 8, 9, 18, 32, 38], to address transformations [12, 13] unusual memory systems [3, 10, 33, 36] and prefetching or non blocking caches [1, 6, 34] Here we take one technique, access ordering, and examine it in depth by analyzing the performance of five different access ordering schemes. Our techniques for ....

....are accessed in the computation s natural order, even when loop unrolling is applied. Note that the effectiveness of naive ordering decreases rapidly as vector stride increases. 2. 2 Block prefetching Blocking or tiling changes a computation so that subblocks of data are repeatedly manipulated [8, 9, 18, 32, 38]. A familiar example is multiplication of matrices stored in row major order: for i = 1 to n do for j = 1 to n do load A[i,j] into register r for k = 1 to n do C[i,k] C[i,k] r B[j,k] Unless the cache is large enough to hold at least one of the matrices, the elements of B in the inner loop ....

[Article contains additional citation context not shown here]

Gannon, D., and Jalby, W., "The Influence of Memory Hierarchy on Algorithm Organization: Programming FFTs on a Vector Multiprocessor", in The Characteristics of Parallel Algorithms. MIT Press, 1987.


The Uniform Memory Hierarchy Model of Computation - Alpern, Carter, Feig, Selker (1992)   (45 citations)  (Correct)

....U) The time to bring the first data items to M 0 essentially the time to move N 1 2 blocks down the topmost bus plus the time to move one block down each subsequent bus is O(N 3 4 log N ) r It is not at all clear this program would be competitive on real computers. Some authors [GJ87, B90] try to avoid bit reversal permutations like the plague; in practice, a single transpose is more efficient than a bit reversal. The penalty of recursive transposes is not large O(log log N ) Bailey [B90] reports good results for the Four Step program for N from 2 8 to 2 20 on Cray ....

Gannon, D., and W. Jalby, "The Influence of Memory Hierarchy on Algorithm Organization: Programming FFTs on a Vector Multiprocessor," The Characteristics of Parallel Algorithms, Jamieson, Gannon, and Douglass, ed., MIT Press, 1987.


Data Relocation And Prefetching For Programs With Large Data Sets - Yamada (1995)   (32 citations)  Self-citation (Programs)   (Correct)

....Caches : 107 ix CHAPTER 1 INTRODUCTION 1.1 Overview Numerical applications frequently contain nested loop structures that process large arrays. The execution of these loop structures has been shown to produce memory reference patterns that poorly utilize data caches [3][4] At least three problems have been identified as the cause of poor cache utilization. The first problem involves an insufficient capacity of the cache: The data accessed by each loop may exceed the cache size, resulting in cache misses. Limited associativity of the cache leads to a second ....

....of the access is larger than one, Example: A[0] A[2] A[4] A[6] ffl or when different arrays are accessed in turn. Example: A[0] B[0] C[0] A[1] B[1] C[1] These characteristics of numerical programs introduce some inefficiency cache as shown in Section 1.2.1 1.2.3. 4 A[0] A[1] A[3] A[2] Access A[0] Miss,purge A[4] A[5] Access A[1] Hit Access A[2] Miss,purge A[6] A[7] Access A[3] Hit Access A[4] Miss,purge A[0] A[1] Access A[5] Hit Access A[6] Miss,purge A[2] A[3] Access A[7] Hit Access A[0] Miss Access A[1] Hit Access A[2] Miss Access A[3] Hit A[5] A[6] A[7] ....

[Article contains additional citation context not shown here]

D. Gannon and W. Jalby, The characteristics of parallel programs, ch. The influence of memory hierarchy on algorithm organization: Programming FFTs on a vector multiprocessor. MIT press, 1987.


Data Relocation and Prefetching for Programs with Large.. - Yamada, Gyllenhaal.. (1994)   (32 citations)  Self-citation (Programs)   (Correct)

....data relocation, program optimization, software prefetching. 1 Introduction Numerical applications frequently contain nested loop structures that process large arrays. The execution of these loop structures has been shown to produce memory preference patterns that poorly utilize data caches [3][4] The first of three problems involves an insufficient capacity of the cache: The data accessed by each loop may exceed the cache size, resulting in cache misses. Limited associativity of the cache presents a second problem: accesses to different arrays, or even to different elements of a ....

D. Gannon and W. Jalby, The characteristics of parallel programs, ch. The influence of memory hierarchy on algorithm organization: Programming FFTs on a vector multiprocessor. MIT press, 1987.


A Low-Power, High-Performance, 1024-Point FFT Processor - Baas (1999)   (Correct)

No context found.

D. Gannon and W. Jalby, "The influence of memory hierarchy on algorithm organization: Programming FFT's on a vector multiprocessor," in The Characteristics of Parallel Algorithms, L. Jamieson, D. Gannon, and R. Douglass, Eds. Cambridge, MA: MIT Press, 1987, ch. 11, pp. 277--301.


A Blocked All-Pairs Shortest-Paths Algorithm - Gayathri Venkataraman Sartaj   (Correct)

No context found.

D. Gannon and W. Jalby. The influence of memory hierarchy on algorithm organization: Programming FFTs on a vector multiprocessor. In The Characteristics of Parallel Algorithms, MIT Press, Cambridge, 1987.


Design and Evaluation of a Compiler Algorithm for Prefetching - Mowry, Lam, Gupta (1992)   (320 citations)  (Correct)

No context found.

D. Gannon and W. Jalby. The influence of memory hierarchy on algorithm organization: Programming FFTs on a vector multiprocessor. In The Characteristics of Parallel Algorithms. MIT Press, 1987.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC