| W. Li and K. Pingali, "Access Normalization: Loop Restructuring for NUMA Compilers, " Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 285--295, 1992. |
....model [7] is used to generate code. Each processor runs the same program but accesses to different parts of the data. Research in the past years has focussed at finding a matrix theory for program transformations 2 to reveal program parallelism [8, 9] or exploit data locality and block transfers [10, 11]. From the specification of the source loop nest and the transformation matrix, a target loop nest is generated with more opportunities to exploit parallelism or for data reuse. This step has been solved when unimodular matrices are used [8, 9] and in general, when non unimodular matrices are ....
....d in the loop. This means that each transformed dependence has to be lexicographically positive in the target IS. 3.1 Data Access Matrix A linear transformation has to be found in order to match the structure of the loop nests in a program with the reaching data decomposition functions. [11] proposes a representation for array subscripts named Data Access Matrix and its use as starting point to obtain the transformation matrix. The data access matrix A is a n . n matrix such that the product yields a vector of n subscripts from array references in the loop. The subscripts and the ....
[Article contains additional citation context not shown here]
Li W. and Pingali K., Access Normalization: Loop Restructuring for NUMA Compilers, in Proceedings of the Fifth Int. Conference on Architectural Support for Programming Languages and Operating Systems, October 1992.
....4.1 Previous Work Given a set of T tasks and a set of P processors, the scheduling problem can be informally stated as deciding which processor executes which task when. One solution is to statically partition the work among the processors (Static Scheduling) at compile time [Wol96, Pol88, LP93] Such schemes have been implemented on NOWs [CLZ95, CR92, Gea94] Another solution is to hand out tasks one at a time to a free processor requesting work (Simple (Dynamic) Scheduling) Pol88, ML94] More complicated dynamic strategies have also been proposed (in some cases application speci c) ....
W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA compilers. ACM Trans. on Computer Systems, 11(4), November 1993.
....model [7] is used to generate code. Each processor runs the same program but accesses to different parts of the data. Research in the past years has focussed at finding a matrix theory for program transformations to reveal program parallelism [8, 9] or exploit data locality and block transfers [10, 11]. From the specification of the source loop nest and the transformation matrix, a target loop nest is generated with more opportunities to exploit parallelism or for data reuse. This step has been solved when unimodular matrices are used [8, 9] and in general, when non unimodular matrices are ....
....d in the loop. This means that each transformed dependence has to be lexicographically positive in the target IS. 3.1 Data Access Matrix A linear transformation has to be found in order to match the structure of the loop nests in a program with the reaching data decomposition functions. [11] proposes a representation for array subscripts named Data Access Matrix and its use as starting point to obtain the transformation matrix. The data access matrix A is a n n matrix such that the product yields a vector of n subscripts from array references in the loop. The subscripts and the order ....
[Article contains additional citation context not shown here]
Li W. and Pingali K., Access Normalization: Loop Restructuring for NUMA Compilers, in Proceedings of the Fifth Int. Conference on Architectural Support for Programming Languages and Operating Systems, October 1992.
....1.1: The uniform memory access (UMA) model of a multiprocessor. P i : processor, MMU i : memory management unit, C i : cache memory, M i : memory) In the non uniform memory access model (NUMA) the shared memory is physically distributed among the N processing nodes as shown in figure 1. 2 [96] [102]. The physical memories in each processor together form the logical shared memory. The interconnection network connects processing elements containing processors and memory banks among themselves so that any processor may access a remote memory physically located within another element through the ....
Wei Li and Keshav Pingali. Access Normalization: Loop Restructuring for NUMA Compilers. ACM SIGPLAN NOTICES, 27(9):285--295, September 1992. Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS V).
....way the loop nest according to the required unimodular transformation. These methods give the exact bounds for all the loops in the nest. Recently, some loop transformations whose objective is to reduce communication or to use the memory hierarchy efficiently are non unimodular transformations [8] [14] In the case of a non unimodular transformation, the techniques proposed in [2] and [23] do not compute the exact bounds of the loops and some conditional statements must be included to determine if a 2 given iteration of the loop nest must be executed [10] These conditional statements ....
....non unimodular matrix may be required for this objective. The access ordering to the elements of a vector or matrix can be modified through a linear transformation to optimize the use of the memory hierarchy. This type of code transformation has been called access normalization by Li and Pingali [8]. In some cases, the required transformation matrix has to be non unimodular. Some temporal transformations such as slow down and retiming are useful for systolic algorithm partitioning. These transformations can be represented through nonunimodular matrices [21] Some authors have studied the ....
W.Li and K.Pingali, "Access Normalization: Loop Restructuring for NUMA Compileers". Technical Report TR92-1278, Department of Computer Science Cornell University, April 1992.
....The lexicographical order OE on Z d is defined in terms of this relation: OE , 9 1 k d : OE k We have , if either OE or = The relations k , and are defined similarly. In this paper, we will use the framework of unimodular transformations [3, 9, 13]. Given a perfectly nested loop with stride 1 DO loops, index vector I = I 1 ; I d ) and iteration space IS Z d , any combination of loop interchanging, loop skewing and loop reversal (see e.g. 17, 20, 21] that transforms this loop into another loop with index vector I 0 = ....
W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA compilers. In Proc. 5th In'l Conf. on Architectural Support for Programming Languages and Operating Systems, pages 285--295, 1992.
....occurring in the original loop are represented by a set of distance vectors D ae Z d . We have d 0 for all d 2 D and application of a transformation defined by U is valid if U d 0 still holds for all d 2 D. A detailed presentation of unimodular transformations can be found in [2, 3, 16, 19, 24]. 3.2 Objective of Reshaping Let the subscripts of an occurrence of a 2 dimensional array in a perfectly nested loop with index vector I = I 1 ; I d ) T be represented by the following affine transformation F : Z d Z 2 , where v is an integer vector and W a 2 Theta d ....
Wei Li and Keshav Pingali. Access normalization: Loop restructuring for numa compilers. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 285--295, 1992.
....have focussed on improving the cache performance of numerical programs. Most of the previous work explores the techniques to reduce capacity misses in scientific loops [Gannon et al. 1988; Irigoin and Triolet 1988; Wolfe 1989; Eisenbeis et al. 1990; Wolf and Lam 1991; Carr and Kennedy 1992; Li and Pingali 1992; Banerjee 1993; McKinley et al. 1996; Carr and Lehoucq 1995] For example, several researchers have described the popular technique of loop tiling to reduce capacity misses [Wolfe 1989; Carr and Kennedy 1992; Carr and Lehoucq 1995; Irigoin and Triolet 1988; Eisenbeis et al. 1990; Kodukula et al. ....
Li, W. and Pingali, K. 1992. Access normalization: Loop restructuring for NUMA compilers. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems.
....are possible if the same chunk is mapped to the same PE each time [389, 555] This is a special case of affinity scheduling, where the affinity is between successive executions of the loop. Likewise, performance improvements are possible if iterations are re ordered in the interest of locality [366]. It is debatable whether allocating a number of threads at a time can be applied to scheduling from a global queue by an operating system, because there is no reason to expect the different threads to be homogeneous. However, it is possible to apply this optimization in the special case when a ....
W. Li and K. Pingali, "Access normalization: loop restructuring for NUMA compilers". In 5th Intl. Conf. Architect. Support for Prog. Lang. & Operating Syst., pp. 285--295, Oct 1992.
....optimization studies, the inherent data locality characteristics of programs and our ability to exploit them. Our work is applicable to a wider range of programs because we do not require perfect nests or nests that can be made perfect with conditionals [Ferrante et al. 1991; Gannon et al. 1988; Li and Pingali 1992; Wolf and Lam 1991] It is also quicker, both in the expected and worse case. Previous research focused on evaluating data locality when given a loop permutation [Ferrante et al. 1991; Gannon et al. 1988] Since they must evaluate a given permutation, they may consider up to n loop permutations ....
....than they do. In addition, their cache optimizations degrade six programs routines, in 28 Delta McKinley, et al. one case by 20 . We degrade only one program by a slight 2 : Applu from the NAS Benchmarks. In Wolf and Lam s experiments, skewing was never needed, and reversal was seldom applied [Wolf 1992]. We therefore chose not to include skewing, even though (1) it is implemented in our system [Kennedy et al. 1993] and (2) our model can drive it. We did integrate reversal, but it did not help to improve locality. Li and Pingali [1992] use linear transformations (any linear mapping from one loop ....
[Article contains additional citation context not shown here]
Li, W. and Pingali, K. 1992. Access normalization: Loop restructuring for NUMA compilers. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York.
....mechanisms are available. In this work, authors characterize recurrences by their inherent parallelism. This property can be used to choose the recurrence that has to be embedded by the transformation matrix and other dependences that have to be eliminated by means of the alignment components. In [16] it is presented a method to obtain the transformation matrix T, from user specified data distributions with a language like FORTRAN D, in order to exploit both locality and block transfers. The method presented in this paper can be also used to extend the work developed in [16] to obtain the ....
....components. In [16] it is presented a method to obtain the transformation matrix T, from user specified data distributions with a language like FORTRAN D, in order to exploit both locality and block transfers. The method presented in this paper can be also used to extend the work developed in [16] to obtain the alignment components from the user specified alignment directives. 7. Acknowledgements This work has been supported by the Ministry of Education of Spain under contract TIC 880 92, by the ESPRIT Basic Research Action 6634 APPARC and by the CEPBA (European Center for Parallelism of ....
Li W. and Pingali K., Access Normalization: Loop Restructuring for NUMA Compilers, in Proceedings of the Fifth Int. Conference on Architectural Support for Programming Languages and Operating Systems, Boston
....data accesses. Techniques to improve data cache performance typically target and model locality characteristics found in loop nests. For example, software and hardware prefetching exploit the spatial locality of regular accesses in loop nests [CB95, CKP91, Dra95, KL91, MLG92] Many researchers [LP92, MCT96, MLG92, WL91] model data locality by distinguishing four categories of locality which they use to drive loop optimizations: spatial reuse of adjacent locations in a cache block; temporal reuse of the same location; self reuse from the same data reference; and group reuse from ....
W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA compilers. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 285-- 295, Boston, MA, October 1992.
....capacity misses and interference (or conflict) misses. Capacity misses occur when cache space is unsufficient to store all data to be reused. Interference misses occur when two data are mapped to the same cache location (in direct mapped caches) The object of data locality optimizing algorithms [2, 3, 4, 5, 6] is to reduce capacity misses only. While they are mostly used for cache performance optimization, most of them are designed for local memories, in the sense that they ignore cache interferences due to the mapping function used in caches. Cache awareness in blocking techniques is generally ....
W. Li and K. Pingali. Access Normalization: Loop Restructuring for NUMA Compilers. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 285--295, September 1992.
....performance as a result of the transformations. In both cases, the speedup is due to improved cache behavior; both programs show significant reductions in both L1 and L2 misses overall. 5 Related work A number of researchers have developed compiler techniques useful for improving cache behavior [1, 6, 8, 17, 20]. Almost all of these techniques apply to individual loop nests, however, and are not designed to detect or exploit cross loop reuse. Two exceptions are loop fusion and affinity regions. Kennedy and McKinley have proposed using loop fusion to improve locality and cache behavior [16] In a ....
W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA compilers. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V), Boston, MA, October 1992.
....and scaling. The major advantage of our approach is that there is a simple completion algorithm which, given a partial transformation matrix, produces a complete transformation matrix respecting all program dependences. We are using this approach to transform loops for locality on NUMA machines [18], as well as to restructure loops for execution on long instruction word machines. We are applying these techniques to benchmark programs from the molecular simulation community including Professor Keith Gubbins of the Chemical Engineering Department and Professor Michael Teeter of the Applied ....
.... Analysis Code Generation BBN Butterfly KSR1 Transformations Access Normalization Optimizations Conventional Optimizations Conventional Lambda Toolkit Figure 1: Pnuma System Overview opportunities for parallel execution and for block transfers, while keeping data accesses local wherever possible [18]. Loop restructuring is followed by a code generation phase that generates parallel code and makes use of block transfers [17] Compiling for distributed memory machines is the goal of a number of projects [2, 13, 39, 32, 16, 15, 25, 14] Existing work focuses on code generation techniques, not on ....
[Article contains additional citation context not shown here]
W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. In Proc. 5th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1992.
....as a result of the transformations. In both cases, the improvement is due to improved cache behavior; both programs show significant reductions in both L1 and L2 misses overall. 6. 5 Related work A number of researchers have developed compiler techniques useful for improving cache behavior [8, 16, 19, 66, 98]. Almost all of these techniques apply to individual loop nests, however, and are not designed to detect or exploit cross loop reuse. Two exceptions are loop fusion and affinity regions. 99 fusion reversal Program loops fused candidates reversed candidates applu 168 2 4 20 111 apsi 298 1 2 5 150 ....
W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA compilers. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V), Boston, MA, October 1992.
.... However, relatively little attention has been given to array restructuring; most research effort so far has focused on loop restructuring techniques, including both individual transformations and, more recently, unifying analysis frameworks (such as one using nonsingular matrices [Li and Pingali 1993c] To facilitate comparison with our work, we defer the discussion of these contributions to Chapter 9, which surveys the extensive, long standing work on loop restructuring and some recent progress in array restructuring. 9 This dissertation fills the void in the understanding of array ....
....we must first represent the abstract notion of access pattern in a concrete way that allows formal analysis. For this purpose, we use a linear algebraic framework that others have found useful for analyzing certain loop transformations and their effects on array accesses [Li and Pingali 1993c; Wolf and Lam 1991b] A tutorial is given here for completeness. Using the illustrative example in Figure 2.1, we first discuss the representations of iterations and array elements and finally those of array accesses. Consider an n deep perfect loop nest such as the one at the top of Figure 2.1, ....
[Article contains additional citation context not shown here]
Wei Li and Keshav Pingali. "Access Normalization: Loop Restructuring for NUMA Computers." ACM Transactions on Computer Systems 11(4):353-75 (November 1993).
....SPEC92 benchmarks miss rates were no more than 3 , many practically zero. 133 Table 7.1: Loops for Experiments Loop Description a Related Studies MATMUL Simple dense matrix multiply. Its innermost loop computes one element of the result matrix. Cierniak and Li 1995; Kennedy and McKinley 1992; Li and Pingali 1992; Wolf and Lam 1991a SYR2K Symmetric rank 2k update for banded matrices. It computes . Li 1993; Li 1995; Li and Pingali 1992 MXM Hand tuned matrix multiply. Its outermost loop is unrolled four times and jammed. Carr, McKinley, and Tseng 1994; Li 1993; Wolf 1992 GMTRY Gaussian elimination. It sets ....
....a Related Studies MATMUL Simple dense matrix multiply. Its innermost loop computes one element of the result matrix. Cierniak and Li 1995; Kennedy and McKinley 1992; Li and Pingali 1992; Wolf and Lam 1991a SYR2K Symmetric rank 2k update for banded matrices. It computes . Li 1993; Li 1995; Li and Pingali 1992 MXM Hand tuned matrix multiply. Its outermost loop is unrolled four times and jammed. Carr, McKinley, and Tseng 1994; Li 1993; Wolf 1992 GMTRY Gaussian elimination. It sets up a linear system for a vortex method solution and inverts the resulting matrix using Gaussian elimination without ....
[Article contains additional citation context not shown here]
Wei Li and Keshav Pingali. "Access Normalization: Loop Restructuring for NUMA Compilers." In Proceedings of Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, (Boston, MA, October), 285-95. New York: ACM Press, 1992.
....various memories and then assign loop iterations to individual machines such that most memory references are local. Accessing data not on that machine is a remote access. These must be kept to a minimum in order to achieve good performance. Data access normalization as described by Li and Pingali [16, 17] is a transformation that reorganizes the iteration space of a loop nest in order to reduce the number of remote accesses required to fetch data. It also allows individual fetches to be combined into a single larger transfer whenever possible. Figure 1 shows an example of a loop before and after ....
W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA compilers. Technical Report CTC92TR99, Cornell Theory Center, July 1992.
....various memories and then assign loop iterations to individual machines such that most memory references are local. Accessing data not on that machine is a remote access. These must be kept to a minimum in order to achieve good performance. Data access normalization as described by Li and Pingali [16, 17] is a transformation that reorganizes the iteration space of a loop nest in order to reduce the number of remote accesses required to fetch data. It also allows individual fetches to be combined into a single larger transfer whenever possible. Figure 1 shows an example of a loop before and after ....
.... j = i 1, i jb do v = u 2, u n1 n2 DO k = 1, N2 do w = max( 1, v u n2) min( n1, v u 1) B(i, j i) B(i, j i) A(i, j k) b(w, u) b(w, u) a(w, v) ENDDO enddo ENDDO enddo ENDDO enddo end end (a) Un normalized loop (b) Normalized loop Figure 1: Case related to Figure 1 in Li and Pingali [16]. Shown is a loop nest (a) before and (b) after data access normalization. We have corrected the generated loop bounds of the innermost loop to correspond to Lambda Toolkit correct results. Assume that the outermost loops are distributed but that the innermost loop is executed by a single ....
W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA compilers. ASPLOS '92, 1992.
....on a variety of architectures and machines the compiler itself should be as portable as possible. Instead of generating native code, we agree with [10] and recommend to generate C augmented with machine dependent macros. The experiences of problem oriented compilation for parallel machines [8, 9, 13] show that many important optimizations rely on high level restructuring techniques. To name but a few, these optimizations include automatic data and computation alignment, automatic data distribution, and automatic synchronization barrier elimination (see Section 5 for details) The desired ....
W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. In 5th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1992.
....lexicographical order OE on Z d is defined in terms of this relation: OE , 9 1 k d : OE k We have , if either OE or = The relations k , and are defined similarly. In this paper, we will use the framework of unimodular transformations [3, 9, 13]. Given a perfectly nested loop with stride 1 DO loops, index vector I = I 1 ; I d ) and iteration space IS Z d , any combination of loop interchanging, loop skewing and loop reversal (see e.g. 17, 20, 21] that transforms this loop into another loop with index vector I 0 = ....
W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA compilers. In Proc. 5th In'l Conf. on Architectural Support for Programming Languages and Operating Systems, pages 285--295, 1992.
No context found.
W. Li and K. Pingali, "Access normalization: Loop restructuring for NUMA compilers," ACM Trans. Comput. Syst., 1993.
No context found.
W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, 1993.
No context found.
W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, 1993.
No context found.
W. Li and K. Pingali. Access normalization: loop restructuring for NUMA compilers. Technical Report 921278, Department of Computer Science, Cornell University, 1992.
No context found.
W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, 1993.
No context found.
Li, W., and Pingali, K. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems 11,4( Nov. 1993), 353--375.
No context found.
W. Li and K. Pingali. Access normalization: loop restructuring for NUMA compilers. Technical Report 921278, Department of Computer Science, Cornell University, 1992.
....compile time model and decision methodology (Section 4) and describe the hybrid compile and run time system (Section 5) Finally, we present the modeling and experimental results (Section 6) and our conclusions (Section 7) 2. RELATED WORK Compile time static loop scheduling has been well studied [9, 14]. Static scheduling for heterogeneous NOWs was proposed in [4, 5, 7] The task queue model for dynamic loop scheduling has targeted shared memory machines [11, 14] while the diffusion model has been used for distributedmemory machines [10] A method for task level scheduling in heterogeneous ....
Li, W., and Pingali, K. Access normalization: Loop restructuring for NUMA compilers. ACM Trans. Comput. Systems 11, 4 (Nov. 1993).
....been proposed for choosing tile sizes [5, 8, 9, 15, 21] Tiling changes the order in which loop iterations are performed, so it is not always legal to tile a loop nest. If tiling is not legal, it may be possible to perform linear loop transformations like skewing and reversal to enable tiling [2, 4, 16, 23, 26]. This technology has been incorporated into production compilers such as the SGI MIPSPro compiler, enabling these compilers to produce good code for perfectly nested loops. In real programs though, many loop nests are imperfectly nested (that is, one or more assignment statements are contained ....
....execution order. The loops are grouped into layers and loops within each layer are fully permutable within the layer and can be tiled. Dependence information for this loop nest can be summarized using directions and distances, and standard techniques for locality enhancement like height reduction [16] can be applied. After this, redundant dimensions are eliminated, fully permutable loops are tiled, and code is generated using well understood techniques [2, 12] 3.3.1 Picking Good Embedding Functions As far as tiling is concerned, any solution to the system S created by the algorithm in the q ....
W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, 1993.
....been proposed for choosing tile sizes [5, 7, 8, 15, 21] Tiling changes the order in which loop iterations are performed, so it is not always legal to tile a loop nest. If tiling is not legal, it may be possible to perform linear loop transformations such as skewing and reversal to enable tiling [2, 4, 16, 23, 26]. This technology has been incorporated into production compilers such as the SGI MIPSPro compiler, enabling these compilers to produce good code for perfectlynested loops. In real programs though, many loop nests are imperfectly nested (that is, one or more assignment statements are contained in ....
....execution order. The loops are grouped into layers and loops within each layer are fully permutable within the layer and can be tiled. Dependence information for this loop nest can be summarized using directions and distances, and standard techniques for locality enhancement like height reduction [16] can be applied. After this, redundant dimensions are eliminated, fully permutable loops are tiled, and code is generated using well understood techniques [2, 12] 3.3.1 Picking good embedding functions As far as tiling is concerned, any solution to the system created by the algorithm in the ....
W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, 1993.
....that have been used in the literature for locality enhancement. Furthermore, the product space itself can be viewed as a perfectly nested loop nest (although one which has many redundant dimensions as we discuss later in this paper) so locality enhancement techniques such as height reduction [16] can be applied to the resulting loop nest; when possible, this loop nest can also be made fully permutable, enabling it to be tiled. Finally, code is generated after projecting out redundant dimensions, using standard techniques from polyhedral algebra; this code generation process may produce ....
W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, 1993.
....particularly interesting in isolation, but combined with the other transformations, it lets us do wholesale loop restructuring for NUMA architectures. The algorithm for generating a restructured program starting from a loop nest and an invertible mapping is given in the associated technical report[24]. This algorithm is non trivial since the new loop nest must traverse points in the new iteration space in lexicographic order, and the starting point, ending point and step size of a loop in the restructured loop nest can depend on only the loop indices of outer loops (for instance, these values ....
....Definition 5.1 The basis matrix of a data access matrix A is the first row basis of A. The algorithm described informally above is simple, but it is expensive to keep checking rows for independence. A more efficient algorithm is obtained by using a variation of computing the Hermite normal form[24]. A detailed understanding of this algorithm is not important for reading the rest of the paper, so we give an informal description of what it does. Given a data access matrix, Algorithm BasisMatrix returns a permutation matrix P , and the rank d of the data access matrix (the number of linearly ....
[Article contains additional citation context not shown here]
W. Li and K. Pingali. Access normalization: loop restructuring for NUMA compilers. Technical Report 921278, Department of Computer Science, Cornell University, 1992.
....and array data are mapped to a hyperplane. The loop iterations are then partitioned into optimal hyperparallelepiped tiles that minimize the interprocessor communication. This scheme, however, only considers a single loop nest. Li and Pingali take a different approach to the assignment problem [LP93] Access patterns of arrays in each loop nest are summarized in a data access matrix. After calculating or estimating an invertible matrix of the data access matrix, they use it as a basis for transforming and normalizing the loop nest. This access normalization process transforms the loop nest ....
W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA computers. ACM Trans. on Computer Systems, 11(4), November 1993. 34
....compiler for NUMA machines being developed at Cornell. Pnuma incorporates a novel loop restructuring strategy called access normalization which restructures loop nests to expose opportunities for parallel execution and for block transfers, while keeping data accesses local wherever possible [8]. Loop restructuring is followed by a code generation phase that generates parallel code and makes use of block transfers [7] Compiling for distributed memory machines is the goal of a number of projects [2, 3, 14, 12, 6, 5, 11, 4] Existing work focuses on code generation techniques, not on loop ....
....the data access matrix. The transformation matrix needs to be both non singular and satisfying the data dependencies. In general, the data access matrix is not qualified as the transformation matrix. The algorithm for deriving the transformation matrix from the data access matrix is presented in [8]. for i = 0, N 1 Gamma 1 for j = i, i b 1 for k = 0, N 2 Gamma 1 B[i, j i] B[i, j i] A[i, j k] for u = 0, b 1 for v = u, u N 1 N 2 Gamma 2 for w = 0, N 1 Gamma 1 B[w, u] B[w, u] A[w, v] a) Original Program (b) Transformed Program 0 1 0 0 Gamma1 1 0 0 1 1 1 A 0 ....
[Article contains additional citation context not shown here]
W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. In Proc. 5th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1992.
....programs into programs that exhibit good locality. A well developed theory exists for perfectly nested loops (loop nests in which all assignment statements are contained in the innermost loop) and this theory recommends the use of linear loop transformations like permutation, followed by tiling [1, 3, 6, 9, 11, 13]. Locality enhancement for imperfectly nested loops is less well understood. Faced with imperfectly nested loops, compilers like the SGI MIPSPro 0 This work was supported by NSF grants CCR 9720211, EIA9726388 and ACI 9870687. Corresponding author: pingali cs.cornell.edu try to apply other ....
W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, 1993.
No context found.
W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA computers. ACM Trans. on Computer Systems, 11(4), November 1993.
....and by showing that efficient code can be generated through the use of the Hermite normal form decomposition. A paper describing these results won the best paper prize at ASPLOS 1992 [65] this paper was also selected for publication in a special issue of ACM Transactions on Computer Systems [66]. Pingali and his students have used these techniques to restructure programs to expose parallelism for coarse grain architectures, and to enhance locality of reference in programs running on machines with caches. These ideas are being incorporated into a production FORTRAN compiler by Peter ....
....in their series on Lecture Notes in Computer Science. The following publications acknowledge support from one or both of the NSF grants listed above, and are listed in reverse chronological order. This list omits a few papers Pingali wishes he had not published [9] 58] 68] 96] 84] [66], 59] 52] 65] 67] 56] 10] 55] 7.2.2 Development of Human Resources During the current funding period, the following graduate students received doctoral degrees: Anne Rogers (assistant professor, Princeton University) Micah Beck (assistant professor, University of Tennessee at ....
W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, 11(4):353--375, November 1993.
....eliminate unnecessary computations, and then chooses an appropriate representation for the sparse matrices. Since the input to the compiler is a dense matrix program, the ability to analyze and restructure programs, for example by using the Lambda loop transformation tool kit developed at Cornell [65], is retained. However, the transformation to sparse code requires information about the sparsity structure of arrays. To get this information, we are integrating our compiler with a discretization system being written by Rich Zippel. This system uses Paul Chew s mesh generator, and incorporates ....
....these techniques by showing that general non singular matrices can be used to model these loop transformations, and by showing that efficient code can be generated through the use of the Hermite normal form decomposition. A paper describing these results won the best paper prize at ASPLOS 1992 [65]; this paper was also selected for publication in a special issue of ACM Transactions on Computer Systems [66] Pingali and his students have used these techniques to restructure programs to expose parallelism for coarse grain architectures, and to enhance locality of reference in programs ....
[Article contains additional citation context not shown here]
W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA compilers. In Fifth Architectural Support for Programming Languages and Operating Systems, pages 285--295. ACM Press, 1992.
....through transformations provide an attractive alternative. The restructuring compiler community has devoted much attention to the development of such technology. The most important transformation is Mike Wolfe s iteration space tiling [24] preceded by linear loop transformations if necessary [4, 17, 23]. This approach is restricted to perfectly nested loops, although it can be extended to imperfectly nested loops if they are first transformed into perfectly nested loops. A loop in a loop nest is said to carry reuse if the same data is touched by multiple iterations of that loop for fixed outer ....
....result is useful to determine how far to carry the process of taking Cartesian products. We assume that all array access functions are linear functions of loop variables (if the functions are affine, we drop the constant terms) if so, they can be written as F I where F is the data access matrix [17] and I is the vector of iteration space variables of loops surrounding this data reference. Theorem 2 For a given statement S, let F 1 ; Fn be the access matrices for the shackled data references in this statement. Let Fn 1 be the access matrix for an unshackled reference in S. Assume ....
W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, 1993.
....stored. To generate the relational query for computing the set of sparse loop iterations, it is useful to define the following vectors and matrices. H = 0 B B B I F 0 . Fn 1 C C C A a = 0 B B B i a 0 . an 1 C C C A f = 0 B B B 0 f 0 . f n 1 C C C A (3) Following [9], the matrix H is called a data access matrix. Notice that the following data access equation holds: a = f Hi (4) Furthermore, we view the arrays A k as relations with the following attributes: a k , which stands for the vector of array indices v k , which is the value of A k (a k ) In ....
Wei Li and Keshav Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, 11(4):353--375, November 1993.
....data are mapped to a hyperplane. The loop iterations are then partitioned into optimal hyperparallelepiped tiles that minimize the interprocessor communication. This scheme, however, only considers a single loop nest. Li and Pingali take a different approach to the data and task allocation problem [9]. Access patterns of arrays in each loop nest are summarized in a data access matrix. After calculating or estimating an invertible matrix of the data access matrix, they use it as a basis for transforming and normalizing the loop nest. This access normalization process transforms the loop nest ....
W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA computers. ACM Trans. on Computer Systems, 11(4), November 1993.
....In Section 6, we present simulation results and summarize this work. 2 Related Work This work targets shared memory multiprocessors, which is different from previous works by J. Li and M. Chen [20] and by U. Kremer et al. 16] which target distributed memory multicomputers. W. Li and K. Pingali [21] consider loop transformations to enhance data locality, whereas we consider data allocation in conjunction with loop scheduling. Agarwal et al. 1] consider a single loop nest and we considers the whole program. Our work shares several aspects with the works done by the Stanford SUIF group: We ....
W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA computers. ACM Trans. on Computer Systems, 11(4), November 1993.
.... model and decision methodology (section 4) and describe the hybrid compile and run time system (section 5) Finally, we present the modeling and experimental results (section 6) and our conclusions (section 7) 2 Related Work Compile time static loop scheduling has been well studied [14, 9]. Static scheduling for heterogeneous NOWs were proposed in [5, 4, 7] The task queue model for dynamic loop scheduling has targeted shared memory machines [14, 11] while the diffusion model has been used for distributedmemory machines [10] A method for task level scheduling in heterogeneous ....
W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA compilers. ACM Trans. on Computer Systems, 11(4), Nov. 1993.
No context found.
W. Li and K. Pingali, "Access Normalization: Loop Restructuring for NUMA Compilers, " Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 285--295, 1992.
No context found.
W. Li and K. Pingali. Access normalization: Loop restructuring for numa compilers. In ACM TOCS, 1993.
No context found.
Li W, Pingali K. Access normalization: Loop restructuring for NUMA compilers. Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, 1992; 285--295.
No context found.
W. Li, K. Pingali. "Access Normalization: Loop Restructuring for NUMA Compilers". TR 92-1278, Dept. of Computer Science, Cornell University, 1992.
No context found.
W. Li, and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, November 1993.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC