| J. Navarro, A. Juan and T. Lang. MOB Forms: A Class of Multilevel Block Algorithms for Dense Linear Algebra Operations. Proceedings of the ACM International Conference on Supercomputing, 1994. |
....of incomplete last tiles. Actual codes generated using the presented framework will of course have to use appropriate min max functions for the loop bounds of intra tile loops to correctly handle incomplete tiles. If the sizes of all arrays are larger than the cache size #, it can be shown [35] that, for this particular permutation of the tiling loops ##, ##, and ##, the solution to minimize the cost is ## # #, # # # # # # # # # # ### #. Since # is typically much larger than 1, for all practical purposes we can approximate # # #. Of course, the cache capacity constraint has to be ....
J. Navarro, A. Juan and T. Lang. MOB Forms: A Class of Multilevel Block Algorithms for Dense Linear Algebra Operations. Proceedings of the ACM International Conference on Supercomputing, 1994.
....is done in a single step using non unimodular matrices. We also use an order to perform index set splitting that guarantees that each loop in the nest has to be processed only once and also avoids code explosion. 2 Transformation Steps We use the Multilevel Orthogonal Block (MOB) forms [7] to compute the tiles in each level of the memory hierarchy. The MOB forms provide maximum reuse of data in all levels simultaneously and their orthogonality property provides a simple method to optimize the form and the size of the tiles at each level. We will assume that the loops to be ....
J.J Navarro, T. Juan, T. Lang. MOB Forms: A Class of Multilevel Block Algorithms for Dense Linear Algebra Operations. Int. Conf. on Supercomputing, July 1994, pp. 354-363
....data structures don t fit in the memory levels near to the processor. Nevertheless, numerical codes usually have spatial and temporal locality properties. Block and multilevel block algorithms have been studied to increase data reuse and to improve the performance of the memory hierarchy [5] 15][14]. In order to use computer architectures with several memory levels, it is important to hide the details of the architecture from the user, making the memory hierarchy transparent to the user. Therefore several code transformation techniques have been developed. These techniques aim at exploiting ....
J.J Navarro, T. Juan, T. Lang. MOB Forms: A Class of Multilevel Block Algorithms for Dense Linear Algebra Operations. International Conference on Supercomputing, July 1994, pp. 354-363
.... of locality in inner loops, for example by means of windows [10] to drive heuristic strategies for applying blocking, permutation and fusion of loops [10, 8, 7, 11] Recently, sophisticated multilevel blocking algorithms have been devised that take into account registers and multilevel caches [15]. Other approaches try to construct unimodular loop transformations. In the context of NUMA architectures, Li and Pingali [14] construct a transformation for access normalization in order to to localize array references; in this approach (affine) index functions are mapped onto target loop indices ....
J.J. Navarro, A. Juan, and T. Lang. MOB forms: A class of multilevel block algorithms for dense linear algebra operations. In Proc. ICS, 1994.
....loop nests and found that conflict misses comprise half of all cache misses and are the most significant sources of intranest misses. In addition, some work shows that cache interferences can vary wildly with slight variations in problem size and base addresses [Lam et al. 1991; Bacon et al. 1994; Navarro et al. 1994]. However, cache conflicts are di#cult to predict and estimate, as they require a detailed analysis of the data mapping in the cache and the data referencing patterns. In fact, there are relatively few studies on analyzing and reducing interference misses. More generally, there has been no work on ....
Navarro, J. J., Juan, T., and Lang, T. 1994. Mob forms: A class of multilevel block algorithms for dense linear algebra operations. In Proceedings of the 1994 International Conference on Supercomputing.
....nests and found that conflict misses comprise half of all cache misses and are the most significant sources of intra nest misses. In addition, some work shows that cache interferences can vary wildly with slight variations in problem size and base addresses [Lam et al. 1991; Bacon et al. 1994; Navarro et al. 1994]. However, cache conflicts are difficult to predict and estimate as it requires a detailed analysis of the data mapping in cache and the data referencing patterns. In fact, there are relatively few studies on analyzing and reducing interference misses. More generally, there has been no work on ....
Navarro, J. J., Juan, T., and Lang, T. 1994. Mob forms: A class of multilevel block algorithms for dense linear algebra operations. In Proc. Int'l Conf. on Supercomputing (June).
....and conclude in Section 7. Appendix A contains details of tile shape derivation for an example benchmark program. 2 Previous Work Several techniques for exploiting the data cache through loop tiling have been proposed in the past. Loop restructuring techniques to enable tiling are reported in [1, 8, 13, 14], all of which do not address conflict misses that occur in real caches. Lam, et al. 6] reported the first work modeling interferences in direct mapped caches with a study of the cache performance of a matrix multiplication program for different tile sizes. They present an algorithm (LRW) for ....
J. J. Navarro, T. Juan, and T. Lang, "Mob forms: A class of multilevel block algorithms for dense linear algebra operations," Proceedings of the 1994 ACM International Conference on Supercomputing, Manchester, England, June 1994.
....transformations[23] 9] Strip mining partitions one dimension of the iteration space into strips and the interchange determines the order in which the iterations inside the tiles are traversed. To implement Multilevel Tiling several researchers propose applying tiling level by level[7] 20][17]. In this paper we present a new way to implement Multilevel Tiling that deals with all levels simultaneously. The goal of our work is to reduce the complexity of computing the final loop bounds in the multilevel tiled code. Our algorithm computes exact loop bounds, that is, loops in the generated ....
....that have steps equal to 1 and therefore they define a convex iteration space. To compute the exact bounds, the Fourier Motzkin Elimination algorithm is used when the interchange unimodular matrix is applied [2] 13] Multilevel Tiling has been implemented by applying tiling level by level [7] 20][17], going from the outermost to the innermost level. In Fig. 1 another level of tiling can be performed by applying tiling again to loops J and I of the resulting code. Let n be the number of loops in the original loop nest and let m (m n) be the number of loops in the code after Multilevel Tiling. ....
[Article contains additional citation context not shown here]
J.J Navarro, T. Juan, T. Lang. MOB Forms: A Class of Multilevel Block Algorithms for Dense Linear Algebra Operations. International Conference on Supercomputing, July 1994, pp. 354-363
.... of locality in inner loops, for example by means of windows [10] to drive heuristic strategies for applying blocking, permutation and fusion of loops [10, 8, 7, 11] Recently, sophisticated multilevel blocking algorithms have been devised that take into account registers and multilevel caches [15]. Other approaches try to construct unimodular loop transformations. In the context of NUMA architectures, Li and Pingali [14] construct a transformation for access normalization in order to to localize array references; affine) index functions are mapped onto target loop indices in such a way ....
J.J. Navarro, A. Juan, and T. Lang. MOB forms: A class of multilevel block algorithms for dense linear algebra operations. In Proc. ICS, 1994.
....such as low associativity and cache line size on the cache performance of tiled nests. Because of these factors, performance for a given problem size can vary wildly with tile size [LRW91] In addition, performance can vary wildly when the same tile sizes are used on very similar problem sizes [LRW91, NJL94]. These results occur because low associativity causes interference misses in addition to capacity misses. Original in SIGPLAN 95: Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995. This version contains corrections to the algorithm. In this paper, we focus on ....
J. J. Navarro, T. Juan, and T. Lang. Mob forms: A class of multilevel block algorithms for dense linear algebra operations. In Proceedings of the 1994 ACM International Conference on Supercomputing, pages 354--363, Manchester, England, June 1994.
....and conclude in Section 7. Appendix A contains details of tile shape derivation for an example benchmark program. 2 Previous Work Several techniques for exploiting the data cache through loop tiling have been proposed in the past. Loop restructuring techniques to enable tiling are reported in [1, 8, 13, 14], all of which do not address conflict misses that occur in real caches. Lam, et al. 6] reported the first work modeling interferences in direct mapped caches with a study of the cache performance of a matrix multiplication program for different tile sizes. They present an algorithm (LRW) for ....
J. J. Navarro, T. Juan, and T. Lang, "Mob forms: A class of multilevel block algorithms for dense linear algebra operations," Proceedings of the 1994 ACM International Conference on Supercomputing, Manchester, England, June 1994.
....of superscalar operation can therefore be regarded as the most important directions when exploiting the corresponding features of matrix multiplication algorithms. Many advances have been made in both directions, specially through the application of techniques such as blocking ( GaPS90] LaRW91] [NaJL94] [Nava94] and software pipelining ( Lam88] RaST92] AiNi88] Blocking reduces data cache misses. However, this reduction is not enough to obtain optimal performance. Despite using blocking techniques, the processor is stalled during a considerable amount of time waiting for data to come from ....
J.J. Navarro, A. Juan and T. Lang, MOB Forms: A Class of Multilevel Block Algorithms for Dense Linear Algebra Operations, to appear in Proceedings of the ACM International Conference on Supercomputing, 1994.
No context found.
J.J. Navarro, A. Juan and T. Lang, MOB Forms: A Class of Multilevel Block Algorithms for Dense Linear Algebra Computations. ACM Int. Conf. Supercomputing, 1994, pp. 354--363.
....the family that we call Multilevel Orthogonal Block (MOB) algorithms is optimal and easy to design. Then we consider cache interferences and, as usually done, to reduce their effect 1 A preliminary presentation of MOB forms has been recently published in [NaJV93] and a more complete version in [NaJL94] some data blocks are pre copied into continuous regions of memory [LaRW91] TeGJ93] In Chapter 5 we evaluate a processor with a real memory system, formed by registers, on chip cache, and off chip cache. We also include the effect of TLB misses and page faults. We show the behavior of MOB ....
J.J. Navarro, A. Juan and T. Lang, MOB Forms: A Class of Multilevel Block Algorithms for Dense Linear Algebra Operations, to appear in Proceedings of the ACM International Conference on Supercomputing, 1994.
....extracting the maximum performance from linear algebra operations. Many advances have been made in both directions, specially through the application of techniques such as Software Pipelining ( Lam88] AiNi88] RaST92] and Blocking (Tiling) plus Data Precopying ( CaCK90] GaPS90] LaRW91] [NaJL94]) We analyze matrix multiplication algorithms for large matrices on a workstation with 2 levels of cache. Although Blocking and data precopying reduce data cache misses, this reduction is not enough to obtain the best attainable performance. Many misses still appear, specially for small, on chip ....
....multiplication algorithm has three embedded loops defining a three dimensional iteration space which we represent graphically by means of a DCD. ADCD is a rectangular parallelepiped which represents the iteration space, with the operations inside and the data on the faces. For further details see [NaJL94]. We use the DCD to ffl show by arrows the order in which the data is accessed and the operations performed, ffl show by small boxes inside the parallelepiped the resulting subproblems in a block algorithm, Figure 1a shows the DCD for the jki form. 2.2 The Target Architecture All the ....
[Article contains additional citation context not shown here]
J.J. Navarro, A. Juan and T. Lang, MOB Forms: A Class of Multilevel Block Algorithms for Dense Linear Algebra Operations, Proc. ICS'94, 1994.
.... level 1 (BRL) is a code transformation technique consisting in the application of strip mining [AbuS79] to one or more loops, loop interchange [AlKe84] full loop unrolling [DoHi79] and the elimination of redundant load and store operations [AhSU86] The 1 Blocking at the register level [NaJL94] is also known as unroll and jam [CaCK87] loops chosen to be strip mined and the final loop ordering define the dimensionality and type of BRL and the number of iterations of the loops that are fully unrolled defines the block size. In this way, BRL creates a new innermost loop that has fewer ....
....of level m. The notation in this case can be simplified with respect to the notation of non orthogonal blocking. In this case it can be simplified to ji(k: s m m f m (f ) m Figure 2. 6: Optimized MOB form A detailed description of the MOB forms and the optimization process can be found in [NaJL94]. 2.6 Basic algorithms for sparse matrices In this section we show the most important features and results of the ijk forms for SpMxM operation. The results shown here are those shown in [NaJL94] The potential locality of C = C A Theta B is worse than that of the dense case. This is because ....
[Article contains additional citation context not shown here]
J.J. Navarro, A. Juan and T. Lang, MOB Forms: A Class of Multilevel Block Algorithms for Dense Linear Algebra Operations, to appear in Proceedings of the ACM International Conference on Supercomputing, 1994.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC