#### DMCA

## Cache-oblivious algorithms (1999)

### Cached

### Download Links

- [dspace.mit.edu]
- [supertech.lcs.mit.edu]
- [ocw.mit.edu]
- [ocw.mit.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 85 - 1 self |

### Citations

10589 | Introduction to Algorithms
- Cormen, Leiserson, et al.
- 2001
(Show Context)
Citation Context ...rts them recursively and then merges the two sorted subsequences into one sorted sequence. The following pseudocode is the standard description of mergesort and can be found in a variety of textbooks =-=[16, 34]-=-. MERGESORT(A, p, r) 1 if p < r 2 then q -[(p + r)/2J 3 MERGESORT (A, p, q) 4 MERGESORT (A, q + 1, r) 5 MERGE (A, p, q, r) 49 We assume that the input array A of length n is stored in 0(n) contiguous ... |

4929 | Computer Architecture - A Quantitative Approach, 2nd Edition - HENNESSY, PATTERSON - 1996 |

2889 | The Design and Analysis of Computer Algorithms - Aho, Hopcroft, et al. - 1974 |

2192 | Randomized Algorithms - Motwani, Raghavan - 1995 |

823 | Tarjan. Amortized efficiency of list update and paging rules.
- Sleator
- 1985
(Show Context)
Citation Context ...ses on a problem of size n using a (Z, L) ideal cache. Then, the same algorithm incurs Q(n; Z, L) < 2Q* (n; Z/2, L) cache misses on a (Z, L) cache that uses LRU replacement. Proof. Sleator and Tarjan =-=[37]-=- have shown that the cache misses on a (Z, L) cache using LRU replacement is (Z/ (Z - Z* + 1))-competitive with optimal replacement on a (Z*, L) ideal if both caches start with an empty cache. It foll... |

809 |
Online Computation and Competitive Analysis
- Borodin, El-Yaniv
- 1998
(Show Context)
Citation Context ... the same for LRU and optimal replacement. Proof. This corollary follows directly from Lemma 18 and the regularity condition. El The same argument extends to a variety of other replacement strategies =-=[11]-=-, including: flush when full: Whenever there is a cache miss and there is no space left in the cache, evict all lines currently in the cache (call this action a "flush"). clock replacement: An approxi... |

602 | FFTW: An Adaptive Software Architecture for the FFT
- Frigo, Johnson
- 1998
(Show Context)
Citation Context ...mposition, but with pivoting. For n x n matrices, Toledo's algorithm uses 8(n 3 ) work and incurs E(1 + n2/L + n3 /LVZy) cache misses. More recently, our group has produced an FFT library called FFTW =-=[20]-=-, which in its most recent incarnation [19], employs a register-allocation and scheduling algorithm inspired by our cache-oblivious FFT algorithm. The general idea that divide-and-conquer enhances mem... |

601 |
A study of replacement algorithms for virtual storage
- Belady
- 1966
(Show Context)
Citation Context ...d anywhere in the cache. If the cache is full, a cache line must be evicted. The ideal cache uses the optimal off-line strategy of replacing the cache line whose next access is farthest in the future =-=[7]-=-, and thus it exploits temporal locality perfectly. An algorithm with an input of size n is measured in the ideal-cache model in terms of its work complexity W(n)-its conventional running time in a RA... |

599 |
The input/output complexity of sorting and related problems
- Aggarwal, Jeffrey
- 1988
(Show Context)
Citation Context ...h it is cache oblivious, algorithms like familiar two-way merge sort (see Section 7.3) are not asymptotically optimal with respect to cache misses. The Zway mergesort mentioned by Aggarwal and Vitter =-=[3]-=- is optimal in terms of cache complexity, but it is cache aware. This section describes a cache-oblivious sorting algorithm called "funnelsort." This algorithm has an asymptotically optimal work compl... |

515 |
Algorithms in C
- Sedgewick
- 1998
(Show Context)
Citation Context ...rts them recursively and then merges the two sorted subsequences into one sorted sequence. The following pseudocode is the standard description of mergesort and can be found in a variety of textbooks =-=[16, 34]-=-. MERGESORT(A, p, r) 1 if p < r 2 then q -[(p + r)/2J 3 MERGESORT (A, p, q) 4 MERGESORT (A, q + 1, r) 5 MERGE (A, p, q, r) 49 We assume that the input array A of length n is stored in 0(n) contiguous ... |

350 | External memory algorithms and data structures: dealing with massive data
- Vitter
(Show Context)
Citation Context ...introduce parallelism, and they give algorithms for matrix multiplication, FFT, sorting, and other problems in both a two-level model [43] and several parallel hierarchical memory models [44]. Vitter =-=[41]-=- provides a comprehensive survey of external-memory algorithms. 58 Wir stehen selbst enttauscht und sehn betroffen Den Vorhang zu und alle Fragen offen. [...] Verehrtes Publikum, los, such dir selbst ... |

235 | Algorithms for parallel memory I: Two-level memories.
- Vitter, Shriver
- 1994
(Show Context)
Citation Context .../O at different levels to proceed in parallel. Vitter and Shriver introduce parallelism, and they give algorithms for matrix multiplication, FFT, sorting, and other problems in both a two-level model =-=[43]-=- and several parallel hierarchical memory models [44]. Vitter [41] provides a comprehensive survey of external-memory algorithms. 58 Wir stehen selbst enttauscht und sehn betroffen Den Vorhang zu und ... |

199 | A fast Fourier transform compiler.
- Frigo
- 1999
(Show Context)
Citation Context ...rices, Toledo's algorithm uses 8(n 3 ) work and incurs E(1 + n2/L + n3 /LVZy) cache misses. More recently, our group has produced an FFT library called FFTW [20], which in its most recent incarnation =-=[19]-=-, employs a register-allocation and scheduling algorithm inspired by our cache-oblivious FFT algorithm. The general idea that divide-and-conquer enhances memory locality has been known for a long time... |

192 |
An Algorithm for the Machine Computation of Complex Fourier Series
- Cooley, Tukey
- 1965
(Show Context)
Citation Context ...ting the discrete Fourier transform of a complex array of n elements, where n is an exact power of 2. The basic algorithm is the well-known "six-step" variant [6,44] of the Cooley-Tukey FFT algorithm =-=[15]-=-. By using the cache-oblivious transposition algorithm, however, we can make the FFT cache oblivious, and its performance matches the lower bound by Hong and Kung [25]. Recall that the discrete Fourie... |

188 |
I/O complexity: The red-blue pebble game
- Hong, Kung
- 1981
(Show Context)
Citation Context ...+ n2 /L + n3/LV/Z) of the cache-oblivious matrix multiplication algorithm is the same as the cache complexity of the cache-aware BLOCK-MULT algorithm and also matches the lower bound by Hong and Kung =-=[25]-=-. This lower bound holds for all algorithms that execute the 8(n3) operations given by the definition of matrix multiplication n ci; = aikbk; k=1 No tight lower bounds for the general problem of matri... |

153 | Ffts in external or hierarchical memory,” - Bailey - 1990 |

139 | Fast Fourier transforms: a tutorial review and a state of the art
- Duhamel, Vetterli
- 1990
(Show Context)
Citation Context ...rray Y given by n-1 Y[il = j X[j]wn , (3.1) j=0 where wn = e27r-i/n is a primitive nth root of unity, and 0 < i < n. Many known algorithms evaluate Equation (3.1) in time O(n lg n) for all integers n =-=[17]-=-. In this thesis, however, we assume that n is an exact power of 2, and compute Equation (3.1) according to the Cooley-Tukey algorithm, which works recursively as follows. In the base case where n = 0... |

131 |
A model for hierarchical memory
- Aggarwal, Alpern, et al.
- 1987
(Show Context)
Citation Context ...ally no weaker than ours. Specifically, we prove (with only minor assumptions) that optimal cache-oblivious algorithms in the ideal-cache model are also optimal in the hierarchical memory model (HMM) =-=[1]-=- and in the serial uniform memory hierarchy (SUMH) model [5, 42]. Section 9 discusses related work, and Section 10 offers some concluding remarks. Many of the results in this thesis are based on a joi... |

131 |
Sorting and Searching,
- Knuth
- 1998
(Show Context)
Citation Context ...es? Although I do not yet know the answer to this question, I have been able to devise a cache-oblivious layout for static binary search trees that is 0(1)-competitive with the performance of B-Trees =-=[28]-=-, which are used in file systems and other out-of-core applications because of their low cache complexity. Figure 10-2 shows the cache-oblivious layout for a complete binary search tree of height 4. L... |

102 |
Hierarchical memory with block transfer
- Aggarwal, Chandra, et al.
- 1987
(Show Context)
Citation Context ...oblems. The hierarchical memory model (HMM) by Aggarwal et al. [1] treats memory as a linear array, where the cost of an access to element at location x is given by a cost function f(x). The BT model =-=[2]-=- extends HMM to support block transfers. The UMH model by Alpern et al. [5] is a multilevel model that allows I/O at different levels to proceed in parallel. Vitter and Shriver introduce parallelism, ... |

98 | An analysis of dag-consistent distributed shared-memory algorithms
- Blumofe, Frigo, et al.
- 1996
(Show Context)
Citation Context ...ing n x n matrices, which uses (8(nl27) work, incurs E(1 + n2 /L + nlo27 /Lv/Z) cache misses. The following algorithm extends the optimal divide-and-conquer algorithm for square matrices described in =-=[9]-=- to rectangular matrices. To multiply an m x n matrix A by an n x p matrix B, the algorithm halves the largest of the three dimensions and recurs according to one of the following three cases: AB = A1... |

90 | Locality of reference in LU decomposition with partial pivoting.
- Toledo
- 1997
(Show Context)
Citation Context ... 1997. This matrix-multiplication algorithm, as well as a cache-oblivious algorithm for LU-decomposition without pivoting, eventually appeared in [9]. Shortly after leaving our research group, Toledo =-=[40]-=- independently proposed a cache-oblivious algorithm for LU-decomposition, but with pivoting. For n x n matrices, Toledo's algorithm uses 8(n 3 ) work and incurs E(1 + n2/L + n3 /LVZy) cache misses. Mo... |

86 | Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code.
- Frens, Wise
- 1997
(Show Context)
Citation Context ... but no tuning parameter need be set, since submatrices of size 8( vr x vET) are cache-obliviously stored on one cache line. The advantages of bit-interleaved and related layouts have been studied in =-=[18]-=- and [12, 13]. One of the practical disadvantages of bit-interleaved layouts is that index calculations on conventional microprocessors can be costly. For square matrices, the cache complexity Q(n) = ... |

76 | Nonlinear Array Layouts for Hierarchical Memory Systems.
- Chatterjee, Jain, et al.
- 1999
(Show Context)
Citation Context ...uning parameter need be set, since submatrices of size 8( vr x vET) are cache-obliviously stored on one cache line. The advantages of bit-interleaved and related layouts have been studied in [18] and =-=[12, 13]-=-. One of the practical disadvantages of bit-interleaved layouts is that index calculations on conventional microprocessors can be costly. For square matrices, the cache complexity Q(n) = E(n + n2 /L +... |

67 | Algorithms for parallel memory ii: Hierarchical multilevel memories
- Vitter, Shriver
- 1994
(Show Context)
Citation Context ... optimal cache-oblivious algorithm for transposing an m x n matrix. The algorithm uses 8(mn) work and incurs 8(1 + mn/L) cache misses. Using matrix transposition as a subroutine, we convert a variant =-=[44]-=- of the "six-step" fast Fourier transform (FFT) algorithm [6] into an optimal cacheoblivious algorithm. This FFT algorithm uses O(n lg n) work and incurs 0(1 + (n/L) (1 + logzn)) cache misses. The pro... |

65 |
Writing Efficient Programs
- BENTLEY
- 1982
(Show Context)
Citation Context ...ms. Most algorithms given in this thesis are divide-and-conquer algorithms. Conventional wisdom says that recursive procedures should be converted into iterative loops in order to improve performance =-=[8]-=-. While this strategy was effective ten years ago, many recursive programs now actually run faster than their iterative counterparts. So far most of the work by architects and compiler writers is conc... |

64 |
Gaussian elimination is not optimal. Numerische Mathematik
- Strassen
- 1969
(Show Context)
Citation Context ...ese results require the tall-cache assumption (1.1) for matrices stored in row-major layout format, but the assumption can be relaxed for certain other layouts. We also show that Strassen's algorithm =-=[38]-=- for multiplying n x n matrices, which uses (8(nl27) work, incurs E(1 + n2 /L + nlo27 /Lv/Z) cache misses. The following algorithm extends the optimal divide-and-conquer algorithm for square matrices ... |

63 | An Algorithm for Computing the Mixed Radix Fast Fourier Transform,”
- Singleton
- 1969
(Show Context)
Citation Context ... employs a register-allocation and scheduling algorithm inspired by our cache-oblivious FFT algorithm. The general idea that divide-and-conquer enhances memory locality has been known for a long time =-=[36]-=-. Previous theoretical work on understanding hierarchical memories and the I/O-complexity of algorithms has been studied in cache-aware models lacking an automatic replacement strategy. Hong and Kung ... |

62 | Dag-consistent distributed shared memory - Blumofe, Frigo, et al. - 1996 |

54 | Recursive Array Layouts and Fast Parallel Matrix Multiplication.
- Chatterjee, Lebeck, et al.
- 1999
(Show Context)
Citation Context ...uning parameter need be set, since submatrices of size 8( vr x vET) are cache-obliviously stored on one cache line. The advantages of bit-interleaved and related layouts have been studied in [18] and =-=[12, 13]-=-. One of the practical disadvantages of bit-interleaved layouts is that index calculations on conventional microprocessors can be costly. For square matrices, the cache complexity Q(n) = E(n + n2 /L +... |

54 | Automatic Parallelization of Divide and Conquer Algorithms
- Rugina, Rinard
- 2000
(Show Context)
Citation Context ... are actually executed in parallel, a Cilk program is processor oblivious. It can be effectively executed on many processors, as long as the problem has enough inherent parallelism. Rugina and Rinard =-=[32]-=- have experimented with automatic parallelization from C to Cilk and achieved good speedups on divide-and-conquer programs. Recursive calls can often be replaced by recursive spawns, which allow the c... |

50 |
Deterministic distribution sort in shared and distributed memory multiprocessors.
- Nodine, Vitter
- 1993
(Show Context)
Citation Context ...ionsorting algorithm uses O(n lg n) work to sort n elements, and it incurs 8 (1 +(n/L) (1 +logzn)) . cache misses if the cache is tall. Unlike previous cache-efficient distribution-sorting algorithms =-=[1, 3, 30, 42, 44]-=-, which use sampling or other techniques to find the partitioning elements before the distribution step, our algorithm uses a "bucketsplitting" technique to select pivots incrementally during the dist... |

44 |
Uniform memory hierarchies
- Alpern, Carter, et al.
- 1990
(Show Context)
Citation Context ...minor assumptions) that optimal cache-oblivious algorithms in the ideal-cache model are also optimal in the hierarchical memory model (HMM) [1] and in the serial uniform memory hierarchy (SUMH) model =-=[5, 42]-=-. Section 9 discusses related work, and Section 10 offers some concluding remarks. Many of the results in this thesis are based on a joint paper [21] coauthored by Matteo Frigo, Charles E. Leiserson, ... |

32 |
Extending the Hong-Kung model to memory hierarchies
- Savage
- 1995
(Show Context)
Citation Context ...ve lower bounds on the I/O-complexity of matrix multiplication, FFT, and other problems. The red-blue pebble game models temporal locality using two levels of memory. The model was extended by Savage =-=[33]-=- for deeper memory hierarchies. Aggarwal and Vitter [3] introduced spatial locality and investigated a two-level memory in which a block of P contiguous items can be transferred in 57 one step. They o... |

25 | Large-scale sorting in uniform memory hierarchies
- VITTER, NODINE
(Show Context)
Citation Context ...minor assumptions) that optimal cache-oblivious algorithms in the ideal-cache model are also optimal in the hierarchical memory model (HMM) [1] and in the serial uniform memory hierarchy (SUMH) model =-=[5, 42]-=-. Section 9 discusses related work, and Section 10 offers some concluding remarks. Many of the results in this thesis are based on a joint paper [21] coauthored by Matteo Frigo, Charles E. Leiserson, ... |

15 | Cache-oblivious algorithms. Extended abstract submitted for publication
- FRIGO, LEISERSON, et al.
- 1999
(Show Context)
Citation Context ...he serial uniform memory hierarchy (SUMH) model [5, 42]. Section 9 discusses related work, and Section 10 offers some concluding remarks. Many of the results in this thesis are based on a joint paper =-=[21]-=- coauthored by Matteo Frigo, Charles E. Leiserson, and Sridhar Ramachandran. 12 SECTION 2 Matrix multiplication This section describes and analyzes an algorithm for multiplying an m x n matrix by an n... |

8 |
to the future: Time to return to some long standing problems in computer systems? Plenary talk at FCRC'99
- HENNESSY
(Show Context)
Citation Context ...ingle user are starting to appear [35]. These machines will become more common over the next few years, and it is expected that we will see a shared-memory multiprocessor-on-a-chip within a few years =-=[23, 27]-=-. Writing efficient parallel programs is considered hard. Caching problems are more pronounced in these machines than they are in single-processor machines. Memory hierarchies will be bigger and steep... |

7 | The algebraic complexity of functions - Winograd - 1970 |

1 | Uniform memory hierarchies. Pro - ALPERN, CARTER, et al. - 1990 |

1 |
Future investment in information technology research: Report of the president's information technology advisory committee. Plenary talk at FCRC'99
- KENNEDY
(Show Context)
Citation Context ...ingle user are starting to appear [35]. These machines will become more common over the next few years, and it is expected that we will see a shared-memory multiprocessor-on-a-chip within a few years =-=[23, 27]-=-. Writing efficient parallel programs is considered hard. Caching problems are more pronounced in these machines than they are in single-processor machines. Memory hierarchies will be bigger and steep... |

1 | Gaussian elimination is not optimal. Numerische Mathematik 13 - STRASSE, V - 1969 |

1 |
http: //f tp. digital. com/pub/Digital/inf o/semiconductor/ ... literature/dsc-library .html
- COMPAQ
(Show Context)
Citation Context ...pronounced in these machines than they are in single-processor machines. Memory hierarchies will be bigger and steeper in the future, and cache misses will be more expensive. The new Alpha 21264 chip =-=[14]-=-, for example, can deliver 2 words from Li-cache in one cycle, but it takes around 100 cycles to fetch from main memory. Divide-and-conquer seems to provide a way to write processor- and cache-oblivio... |

1 |
http: //www.pc.ibm. com/us/netf inity/index.html
- IBM
(Show Context)
Citation Context ... algorithms given in this thesis can be proven to satisfy all three optimality requirements. Small shared-memory multiprocessors are readily avaiable: A 4-processor ma64 chine costs less than $20,000 =-=[26]-=-. Most of these machines are designed to be servers, but workstations intended to be used by a single user are starting to appear [35]. These machines will become more common over the next few years, ... |