Results 1  10
of
85
Cacheoblivious Btrees
, 2005
"... This paper presents two dynamic search trees attaining nearoptimal performance on any hierarchical memory. The data structures are independent of the parameters of the memory hierarchy, e.g., the number of memory levels, the blocktransfer size at each level, and the relative speeds of memory leve ..."
Abstract

Cited by 149 (20 self)
 Add to MetaCart
This paper presents two dynamic search trees attaining nearoptimal performance on any hierarchical memory. The data structures are independent of the parameters of the memory hierarchy, e.g., the number of memory levels, the blocktransfer size at each level, and the relative speeds of memory levels. The performance is analyzed in terms of the number of memory transfers between two memory levels with an arbitrary blocktransfer size of B; this analysis can then be applied to every adjacent pair of levels in a multilevel memory hierarchy. Both search trees match the optimal search bound of Θ(1+logB+1 N) memory transfers. This bound is also achieved by the classic Btree data structure on a twolevel memory hierarchy with a known blocktransfer size B. The first search tree supports insertions and deletions in Θ(1 + logB+1 N) amortized memory transfers, which matches the Btree’s worstcase bounds. The second search tree supports scanning S consecutive elements optimally in Θ(1 + S/B) memory transfers and supports insertions and deletions in Θ(1 + logB+1 N + log2 N) amortized memory transfers, matching the performance of the Btree for B = B Ω(log N log log N).
Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings
 International Journal of Parallel Programming
, 2001
"... The performance of irregular applications on modern computer systems is hurt by the wide gap between CPU and memory speeds because these applications typically underutilize multilevel memory hierarchies, which help hide this gap. This paper investigates using data and computation reorderings to i ..."
Abstract

Cited by 104 (2 self)
 Add to MetaCart
(Show Context)
The performance of irregular applications on modern computer systems is hurt by the wide gap between CPU and memory speeds because these applications typically underutilize multilevel memory hierarchies, which help hide this gap. This paper investigates using data and computation reorderings to improve memory hierarchy utilization for irregular applications. We evaluate the impact of reordering on data reuse at different levels in the memory hierarchy. We focus on coordinated data and computation reordering based on spacefilling curves and we introduce a new architectureindependent multilevel blocking strategy for irregular applications. For two particle codes we studied, the most effective reorderings reduced overall execution time by a factor of two and four, respectively. Preliminary experience with a scatter benchmark derived from a large unstructured mesh application showed that careful data and computation ordering reduced primary cache misses by a factor of two compared to a random ordering.
A localitypreserving cacheoblivious dynamic dictionary
, 2002
"... This paper presents a simple dictionary structure designed for a hierarchical memory. The proposed data structure is cache oblivious and locality preserving. A cacheoblivious data structure has memory performance optimized for all levels of the memory hierarchy even though it has no memoryhierar ..."
Abstract

Cited by 63 (18 self)
 Add to MetaCart
This paper presents a simple dictionary structure designed for a hierarchical memory. The proposed data structure is cache oblivious and locality preserving. A cacheoblivious data structure has memory performance optimized for all levels of the memory hierarchy even though it has no memoryhierarchyspecific parameterization. A localitypreserving dictionary maintains elements of similar key values stored close together for fast access to ranges of data with consecutive keys. The data structure presented here is a simplification of the cacheoblivious Btree of Bender, Demaine, and FarachColton. Like the cacheoblivious Btree, this structure supports search operations using only O(logB N) block operations at a level of the memory hierarchy with block size B. Insertion and deletion operations use O(logB N + log2 N=B) amortized block transfers. Finally, the data structure returns all k data items in a given search range using O(logB N + kB) block operations. This data structure was implemented and its performance was evaluated on a simulated memory hierarchy. This paper presents the results of this simulation for various combinations of block and memory sizes.
Cache Oblivious Search Trees via Binary Trees of Small Height
"... We propose a version of cache oblivious search trees which is simpler than the previous proposal of Bender, Demaine and FarachColton and has the same complexity bounds, in particular, our data structure avoids the use of weight balanced Btrees, and can be implemented as just a single array of data ..."
Abstract

Cited by 59 (7 self)
 Add to MetaCart
(Show Context)
We propose a version of cache oblivious search trees which is simpler than the previous proposal of Bender, Demaine and FarachColton and has the same complexity bounds, in particular, our data structure avoids the use of weight balanced Btrees, and can be implemented as just a single array of data elements, without the use of pointers. The structure also improves space utilization. For storing n elements, our proposal uses (1 + e)n times the element size of memory, and performs searches in worst case O(log B n) memory transfers, updates in amortized O((log 2 n)/(eB)) memory transfers, and range queries in worst case O(log B n + k/B) memory transfers, where k is the size of the output. The basic idea of our data structure is to maintain a dynamic binary tree of height log nlO(1) using existing methods, embed this tree in a static binary tree, which in turn is embedded in an array in a cache oblivious fashion, using the van Emde Boas layout of Prokop. We also investigate the practicality of cache obliviousness in the area of search trees, by providing an empirical comparison of different methods for laying out a search tree in memory.
Telescoping languages: A strategy for automatic generation of scientific problemsolving systems from annotated libraries
, 2001
"... As machines and programs have become more complex, the process of programming applications that can exploit the power of highperformance systems has become more difficult and correspondingly more laborintensive. This has substantially widened the software gap the discrepancy between the need for n ..."
Abstract

Cited by 51 (7 self)
 Add to MetaCart
(Show Context)
As machines and programs have become more complex, the process of programming applications that can exploit the power of highperformance systems has become more difficult and correspondingly more laborintensive. This has substantially widened the software gap the discrepancy between the need for new software and the aggregate capacity of the workforce to produce it. This problem has been compounded by the slow growth of programming productivity, especially for highperformance programs, over the past two decades. One way to bridge this gap is to make it possible for end users to develop programs in highlevel domainspecific programming systems. In the past, a major impediment to the acceptance of such systems has been the poor performance of the resulting applications. To address this problem, we are developing a new compilerbased infrastructure, called
Implicit and explicit optimizations for stencil computations
 In MSPC ’06: Proceedings of the 2006 workshop on Memory system performance and correctness
, 2006
"... Stencilbased kernels constitute the core of many scientific applications on blockstructured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the convention ..."
Abstract

Cited by 49 (11 self)
 Add to MetaCart
(Show Context)
Stencilbased kernels constitute the core of many scientific applications on blockstructured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cachebased memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cacheaware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitlymanaged memory hierarchy, the Cell processor. Overall, results show that a cacheaware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster. 1.
The Pochoir Stencil Compiler
"... A stencil computation repeatedly updates each point of a ddimensionalgridasafunctionofitselfanditsnearneighbors. Parallel cacheefficient stencil algorithms based on “trapezoidal decompositions” are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a prog ..."
Abstract

Cited by 43 (2 self)
 Add to MetaCart
(Show Context)
A stencil computation repeatedly updates each point of a ddimensionalgridasafunctionofitselfanditsnearneighbors. Parallel cacheefficient stencil algorithms based on “trapezoidal decompositions” are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a programmer to write a simple specification of a stencil in a domainspecific stencil language embedded in C++ which the Pochoir compiler then translates into highperforming Cilk code that employs an efficient parallel cacheoblivious algorithm. Pochoir supports general ddimensional stencils and handles both periodic and aperiodic boundary conditions in one unified algorithm. The Pochoir system provides a C++ template library that allows the user’s stencilspecificationtobeexecuteddirectlyinC++withoutthePochoir compiler(albeitmoreslowly),whichsimplifiesuserdebuggingand greatly simplified the implementation of the Pochoir compiler itself. A host of stencil benchmarks run on a modern multicore machine demonstrates that Pochoir outperforms standard parallelloop implementations, typically running 2–10 times faster. The algorithm behind Pochoir improves on prior cacheefficient algorithms on multidimensional grids by making “hyperspace ” cuts, which yield asymptotically more parallelism for the same cache efficiency. Categories andSubjectDescriptors
Cacheoblivious algorithms and data structures
 IN LECTURE NOTES FROM THE EEF SUMMER SCHOOL ON MASSIVE DATA SETS
, 2002
"... A recent direction in the design of cacheefficient and diskefficient algorithms and data structures is the notion of cache obliviousness, introduced by Frigo, Leiserson, Prokop, and Ramachandran in 1999. Cacheoblivious algorithms perform well on a multilevel memory hierarchy without knowing any pa ..."
Abstract

Cited by 42 (2 self)
 Add to MetaCart
A recent direction in the design of cacheefficient and diskefficient algorithms and data structures is the notion of cache obliviousness, introduced by Frigo, Leiserson, Prokop, and Ramachandran in 1999. Cacheoblivious algorithms perform well on a multilevel memory hierarchy without knowing any parameters of the hierarchy, only knowing the existence of a hierarchy. Equivalently, a single cacheoblivious algorithm is efficient on all memory hierarchies simultaneously. While such results might seem impossible, a recent body of work has developed cacheoblivious algorithms and data structures that perform as well or nearly as well as standard externalmemory structures which require knowledge of the cache/memory size and block transfer size. Here we describe several of these results with the intent of elucidating the techniques behind their design. Perhaps the most exciting of these results are the data structures, which form general building blocks immediately
Cache oblivious distribution sweeping
 IN PROC. 29TH INTERNATIONAL COLLOQUIUM ON AUTOMATA, LANGUAGES, AND PROGRAMMING (ICALP), VOLUME 2380 OF LNCS
, 2002
"... We adapt the distribution sweeping method to the cache oblivious model. Distribution sweeping is the name used for a general approach for divideandconquer algorithms where the combination of solved subproblems can be viewed as a merging process of streams. We demonstrate by a series of algorithms ..."
Abstract

Cited by 40 (11 self)
 Add to MetaCart
(Show Context)
We adapt the distribution sweeping method to the cache oblivious model. Distribution sweeping is the name used for a general approach for divideandconquer algorithms where the combination of solved subproblems can be viewed as a merging process of streams. We demonstrate by a series of algorithms for specific problems the feasibility of the method in a cache oblivious setting. The problems all come from computational geometry, and are: orthogonal line segment intersection reporting, the all nearest neighbors problem, the 3D maxima problem, computing the measure of a set of axisparallel rectangles, computing the visibility of a set of line segments from a point, batched orthogonal range queries, and reporting pairwise intersections of axisparallel rectangles. Our basic building block is a simplified version of the cache oblivious sorting algorithm Funnelsort of Frigo et al., which is of independent interest.
On the limits of cacheobliviousness
 IN PROC. 35TH ANNUAL ACM SYMPOSIUM ON THEORY OF COMPUTING
, 2003
"... In this paper, we present lower bounds for permuting and sorting in the cacheoblivious model. We prove that (1) I/O optimal cacheoblivious comparison based sorting is not possible without a tall cache assumption, and (2) there does not exist an I/O optimalcacheoblivious algorithm for permuting, ..."
Abstract

Cited by 38 (7 self)
 Add to MetaCart
(Show Context)
In this paper, we present lower bounds for permuting and sorting in the cacheoblivious model. We prove that (1) I/O optimal cacheoblivious comparison based sorting is not possible without a tall cache assumption, and (2) there does not exist an I/O optimalcacheoblivious algorithm for permuting, not even in the presence of a tall cache assumption.Our results for sorting show the existence of an inherent tradeoff in the cacheoblivious model between the strength of the tall cache assumption and the overhead for the case M >> B, and show that Funnelsort and recursive binary mergesort are optimal algorithms in the sense that they attain this tradeoff.