27 citations found. Retrieving documents...
S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottenthodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of the 11th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, Saint-Malo, France, June 1999.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents

Portable High-Performance Programs - Frigo (1992)   (1 citation)  (Correct)

....bit interleaved layout (Figure 3 2(d) has the same advantage as the blocked layout, but no tuning parameter need be set, since submatrices of size ## # L# # L# are cache obliviously stored on one cache line. The advantages of bit interleaved and related layouts have been studied in [53] and [35, 36]. One of the practical disadvantages of bit interleaved layouts is that index calculations on today s conventional microprocessors can be costly. For square matrices, the cache complexity Q#n# # ### # n =L # Z# of the cacheoblivious matrix multiplication algorithm matches the lower bound by ....

S. CHATTERJEE,A.R.LEBECK,P.K.PATNALA , AND M. THOTTETHODI, Recursive array layouts and fast parallel matrix multiplication, in Proceedings of the Eleventh ACM SIGPLAN Symposium on Parallel Algorithms and Architectures, June 1999.


Supporting Multidimensional Arrays in Java - Moreira, Midkiff, Gupta (2001)   (Correct)

....a standard it is likely, and desirable, that third party implementations will be developed. Because the Array package does not specify how the elements of an Array are laid out (in the spirit of Java arrays) it is possible to implement layouts based on space filling curves and recursive blocking [5, 10, 11]. Some researchers advocate a specific layout for the elements of a multiarray, with the argument that such specification would facilitate the development of performance portable code. We note that exposing internal object layouts is against Java s philosophy. It also prevents future optimizations ....

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottenthodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of the 11th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, Saint-Malo, France, June 1999. 19


An Efficient Semi-Hierarchical Array Layout - Drakenberg, Lundevall, Lisper (2001)   (3 citations)  (Correct)

.... as C order, U order, Hilbert order and Z or Morton [17] order (an example of transposed Morton, or Z T layout is provided by the right hand side of Figure 1) Hierarchical array layouts have been developed and used for various special purposes, such as in computational subroutine libraries [5, 4], load balancing of parallel computations [13, 18] and in image processing [9, 28] Despite a non negligeable volume of results on hierarchical storage layouts, such results have typically not become widely known. Several authors seem to have reinvented such storage layouts plus associated ....

....obtain complete information on the cache behavior. 4. Related Work Array layout has received much attention in the context of automatic array alignment and distribution for distributed memory machines [3, 12, 14] In the context of uni processor memory hierarchies, the work of Chatterjee et al. [4, 5] is the most similar to ours. They investigate and evaluate an array layout which is essentially identical to the HAT layout, but which uses a set of smallest tile sizes within which linear layouts. Significant performance gains are reported for some hand tailored tiled algorithms using their ....

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proc. Eleventh ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, Saint-Malo, France, June 1999.


Cache-Oblivious Algorithms (Extended Abstract) - Frigo, al.   (Correct)

....bit interleaved layout (Figure 2(d) has the same advantage as the blocked layout, but no tuning parameter need be set, since submatrices of size O( p L) Theta O( p L) are cache obliviously stored on O(1) cache lines. The advantages of bit interleaved and related layouts have been studied in [11, 12, 16]. One of the practical disadvantages of bit interleaved layouts is that index calculations on conventional microprocessors can be costly, a deficiency we hope that processor architects will remedy. For square matrices, the cache complexity Q(n) Q(n n 2 =L n 3 =L p Z) of the REC MULT ....

....As the figure shows, the average time used per integer multiplication in the recursive algorithm is almost constant, which for large matrices, is less than 50 of the time used by the iterative variant. A similar study for Jacobi multipass filters can be found in [26] Several researchers [12, 16] have also observed that recursive algorithms exhibit performance advantages over iterative algorithms for computers with caches. A comprehensive empirical study has yet to be done, however. Do cache oblivious algorithms perform nearly as well as cache aware algorithms in practice, where constant ....

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of the Eleventh ACM Symposium on Parallel Algorithms and Architectures (SPAA), June 1999.


Morton-order Matrices Deserve Compilers' Support - Wise, Frens (1999)   (3 citations)  (Correct)

.... on dilated integers that provide the common strength reductions on cartesian indexing to Morton order matrices [41] 15] The new results are BLAS3 performance from Morton order and divide and conquer recursion, which lend themselves nicely to parallelism although they are not explored here [21, 12], as well as a review of transformations available to compilers to support this representation. Together, they suggest new formulations for old algorithms and, perhaps even, new ones. They offer high locality to solutions for many matrix problems, regardless of the memory parameters of the target ....

.... cache need never be specified, especially at compile time [23] This perspective contrasts with blocking that is dependent on page size [35, 18, 37] or cache size [3, 10, 11, 32, 29, 47] Indeed, much promising work on dense matrix computation using quadtrees and Morton order has appeared recently [12, 13, 21, 23, 28, 33, 43]. It follows earlier work on linked quadtrees that was aimed at sparse problems [6, 27] This all follows its impact on graphics, and both geographic and spatial databases [40] The beautiful features of Morton ordering are being rediscovered and redispersed because they have not yet been absorbed ....

[Article contains additional citation context not shown here]

S. Chatterjee, A. R. Lebeck, P. K. Patnala, & M. Thottenthodi. Recursive array layouts and fast parallel matrix multiplication. Proc. 11th ACM Symp. Parallel Algorithms and Architectures, 222--231. http://www.acm.org/pubs/citations/proceedings/spaa/305619/p222-chatterjee/


Portable High-Performance Programs - Frigo (1999)   (1 citation)  (Correct)

....bit interleaved layout (Figure 3 2(d) has the same advantage as the blocked layout, but no tuning parameter need be set, since submatrices of size ( p L p L) are cache obliviously stored on one cache line. The advantages of bit interleaved and related layouts have been studied in [53] and [35, 36]. One of the practical disadvantages of bit interleaved layouts is that index calculations on today s conventional microprocessors can be costly. For square matrices, the cache complexity Q(n) 1 n 2 =L n 3 =L p Z) of the cacheoblivious matrix multiplication algorithm matches the ....

S. CHATTERJEE, A. R. LEBECK, P. K. PATNALA, AND M. THOTTETHODI, Recursive array layouts and fast parallel matrix multiplication, in Proceedings of the Eleventh ACM SIGPLAN Symposium on Parallel Algorithms and Architectures, June 1999.


Towards a Theory of Cache-Efficient Algorithms (Extended Abstract) - Sen, al.   (Correct)

....influence running times. As argued in the beginning, these factors may be easier to tackle at the level of implementation than algorithm design. Some of the cache problems we observe can be traced to the simple array layout schemes used in current programming languages. It has shown elsewhere [9, 10, 27] that nonlinear array layout schemes based on quadrant based decomposition are better suited for hierarchical memory systems. Further study of such array layouts is a promising direction for future research. Acknowledgments We are grateful to Alvin Lebeck for valuable discussions related to ....

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, Saint-Malo, France, June 1999.


Cache-Oblivious Algorithms - Prokop (1999)   (3 citations)  (Correct)

....layout (Figure 2 1(d) has the same advantage as the blocked layout, but no tuning parameter need be set, since submatrices of size #( p L Theta p L) are cache obliviously stored on one cache line. The advantages of bit interleaved and related layouts have been studied in [18] and [12, 13]. One of the practical disadvantages of bit interleaved layouts is that index calculations on conventional microprocessors can be costly. For square matrices, the cache complexity Q(n) #(n n 2 =L n 3 =L p Z) of the cache oblivious matrix multiplication algorithm is the same as the cache ....

....cache, the line length must be known. The bit interleaved layout (Figure 2 1(d) however, is cache oblivious and has the same asymptotic behavior as the blocked layout for matrix multiplication. Other cache oblivious layouts for matrices exist like the Morton or Hilbert layouts discussed in [12, 13, 18]. Different data layouts can greatly affect the asymptotic behavior of an algorithm. For cache optimal matrix multiplication, as discussed in Section 2, the tall cache requirement can be relaxed if matrices are stored in blocked (Figure 2 1(c) or bit interleaved order (Figure 2 1(d) In Section ....

CHATTERJEE, S., LEBECK, A. R., PATNALA, P. K., AND THOTTETHODI, M. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA) (Saint-Malo, France, June 1999).


Recursive Array Layouts and Fast Matrix Multiplication - Chatterjee, Lebeck.. (1999)   (13 citations)  Self-citation (Chatterjee Lebeck Patnala Thottethodi)   (Correct)

No context found.

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, Saint-Malo, France, June 1999.


Nonlinear Array Layouts for Hierarchical Memory Systems - Chatterjee, Jain.. (1999)   (50 citations)  Self-citation (Chatterjee Lebeck Thottethodi)   (Correct)

No context found.

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, Saint-Malo, France, June 1999. To appear.


The Combinatorics of Cache Misses during Matrix Multiplication - Philip Hanlon Dean (2000)   (1 citation)  Self-citation (Chatterjee Lebeck)   (Correct)

....of reference for their performance. In this paper, we focus on an analysis of matrix multiplication, the workhorse of modern linear algebraic algorithms. Our previous studies demonstrated an intimate relationship between the layout of the arrays in memory and the performance of the routine [1, 2]. This early work experimentally showed the benefits of using array layout functions based on interleaving the bits in the binary expansions of the row and column indices of arrays. This paper complements our earlier empirical studies by providing an analytical framework for analyzing the cache ....

....generated in addition to being affine in the LCVs. These two conditions keep everything within the polyhedral model [3] which has been well studied and for which counting algorithms are wellknown [9] It is at this point that our work diverges from previous work. Prior empirical evidence [4, 1, 2] suggests that alternative array layout functions such as Morton order [2] provide better cache behavior than canonical layout functions for many dense linear algebra codes. Such layout functions are described in terms of interleavings of the bits in the binary expansions of the array co ordinates ....

[Article contains additional citation context not shown here]

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, Saint-Malo, France, June 1999.


Towards a Theory of Cache-Efficient Algorithms - Sen, Chatterjee, Dumir (1999)   (13 citations)  Self-citation (Chatterjee)   (Correct)

....influence running times. As argued earlier, these factors are more appropriate to tackle at the level of implementation than algorithm design. Several of the cache problems we observe can be traced to the simple array layout schemes used in current programming languages. It has shown elsewhere [10, 11, 31] that nonlinear array layout schemes based on quadrant based decomposition are better suited for hierarchical memory systems. Further study of such array layouts is a promising direction for future research. 23 Acknowledgments We are grateful to Alvin Lebeck for valuable discussions related to ....

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, Saint-Malo, France, June 1999.


Cache-Efficient Matrix Transposition - Chatterjee, Sen (2000)   (5 citations)  Self-citation (Chatterjee)   (Correct)

....L # i #t # t R #t C #= t R t C M # t #t j # where M # i# j # is the integer whose binary representation is the bitwise interleaving of the binary representations of i and j . Then, L # i# j # m# n# t R #t # = t R # t C # # t i #t j # L CM f i #f j # t R #t C : 2) See Chatterjee et al. [11, 12] for further details and implementation issues for this layout. Like the cache oblivious algorithm, this algorithm also uses recursion to divide the problem into smaller subproblems until it reaches an architecture specific tile size, where it performs the exchanges. The code is shown in Figure ....

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, Saint-Malo, France, June 1999.


The Combinatorics of Cache Misses during Matrix Multiplication - Philip Hanlon Dean (2000)   (1 citation)  Self-citation (Chatterjee Lebeck)   (Correct)

....of reference for their performance. In this paper, we focus on an analysis of matrix multiplication, the workhorse of modern linear algebraic algorithms. Our previous studies demonstrated an intimate relationship between the layout of the arrays in memory and the performance of the routine [1, 2]. This early work experimentally showed the benefits of using array layout functions based on interleaving the bits in the binary expansions of the row and column indices of arrays. This paper complements our earlier empirical studies by providing an analytical framework for analyzing the cache ....

....generated in addition to being affine in the LCVs. These two conditions keep everything within the polyhedral model [3] which has been well studied and for which counting algorithms are wellknown [9] It is at this point that our work diverges from previous work. Prior empirical evidence [4, 1, 2] suggests that alternative array layout functions such as Morton order [2] provide better cache behavior than canonical layout functions for many dense linear algebra codes. Such layout functions are described in terms of interleavings of the bits in the binary expansions of the array co ordinates ....

[Article contains additional citation context not shown here]

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, Saint-Malo, France, June 1999.


Cache-Efficient Matrix Transposition - Chatterjee, Sen (2000)   (5 citations)  Self-citation (Chatterjee)   (Correct)

....) tR Delta t C Delta M (t i #t j ) where M (i# j) is the integer whose binary representation is the bitwise interleaving of the binary representations of i and j. Then, LMO (i# j# m# n# t R#t C ) t R Delta t C Delta M (t i #t j ) LCM (f i #f j # t R#t C ) 2) See Chatterjee et al. [11, 12] for further details and implementation issues for this layout. Like the cache oblivious algorithm, this algorithm also uses recursion to divide the problem into smaller subproblems until it reaches an architecture specific tile size, where it performs the exchanges. The code is shown in Figure ....

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, Saint-Malo, France, June 1999.


Design and Evaluation of a - Linear Algebra Package   (Correct)

No context found.

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottenthodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of the 11th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, Saint-Malo, France, June 1999.


Improving the Performance of Morton Layout by Array - Alignment And Loop   (Correct)

No context found.

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In SPAA '99: Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, New York, June 1999.


Partitioning and Dynamic Load Balancing for the.. - Teresco, Devine..   (Correct)

No context found.

Chatterjee, S., Lebeck, A. R., Patnala, P. K., and Thottethodi, M.: Recursive array layouts and fast parallel matrix multiplication. In ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, (1999)


The Opie Compiler: from Row-major Source to Morton-ordered.. - Gabriel, Wise (2004)   (Correct)

No context found.

Chatterjee, S., Lebeck, A.R., Patnala, P.K., Thottenthodi, M.: Recursive array layouts and fast parallel matrix multiplication. IEEE Trans. Parallel Distrib. Syst.


Semi-structured Portable Library for Multiprocessor Servers - Tsilikas, Fleury   (Correct)

No context found.

S. Chatterjee, A. R. Leback, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In ACM Symposium on Parallel Algorithms and Architectures, 1999.


Matrix Multiplication Performance on Commodity Shared-Memory .. - Tsilikas, Fleury   (Correct)

No context found.

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, 1999.


Improving the Performance of Morton Layout by Array.. - Thiyagalingam.. (2003)   (Correct)

No context found.

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In SPAA '99: Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, New York, June 1999.


Improving the Performance of Morton Layout by Array.. - Thiyagalingam.. (2003)   (Correct)

No context found.

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In SPAA '99: Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, New York, June 1999.


Towards a Theory of Cache-Efficient Algorithms - Sen, Chatterjee (1999)   (13 citations)  (Correct)

No context found.

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, Saint-Malo, France, June 1999.


Matrix Factorization Using a Block-Recursive Structure and.. - Frens   (Correct)

No context found.

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottenthodi. Recursive array layouts and fast parallel matrix multiplication. In Proc. 11th ACM Symp. 145 on Parallel Algorithms and Architectures, pages 222-231. ACM Press, New York, 1999.

First 50 documents

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC