| Anant Agarwal, David Kranz, and Venkat Natarajan. Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors. In Proceedings of the 22nd International Conference on Parallel Processing, St. Charles, IL, August 1993. |
....of computation and or communication requirements. This problem can be reduced to the computation of the integer volume of a polyhedron, but this is in general too complex. Fortunately, in most practical cases we only need an estimate of this volume, and heuristics like the one proposed in [1] are sufficient. We have a preliminary 2 D implementation to estimate the integer area, which extends the work in [1] by handling affine loop bounds. The integer area is approximated by the real area plus half the points in the boundary. Thanks to Brian L. Evans at Georgia Tech for providing the ....
....volume of a polyhedron, but this is in general too complex. Fortunately, in most practical cases we only need an estimate of this volume, and heuristics like the one proposed in [1] are sufficient. We have a preliminary 2 D implementation to estimate the integer area, which extends the work in [1] by handling affine loop bounds. The integer area is approximated by the real area plus half the points in the boundary. Thanks to Brian L. Evans at Georgia Tech for providing the LatticeTheory package 64 Using a Mathematica package , we obtain all the vertices of the polyhedron from the ....
Anant Agarwal, David Kranz, and Venkat Natarajan. Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors. In Proceedings of the 22nd International Conference on Parallel Processing, St. Charles, IL, August 1993.
....the partitioning and scheduling algorithms need to consider the communication cost among processors. For example, Agarwal et al. presents a theoretical framework to derive the shapes of the iteration space partitions of do loops to minimize the traffic in multi processors with local memory [2]. But they assume that iterations can be executed in parallel and the local memory of each processor is large enough for each processor s computation share. The pattern of references among arrays on nested loops is analyzed in [5] duplicate or nonduplicate approaches are used to distribute the ....
A. Agarwal, D. Kranz and V. Natarajan, "Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors," Proceedings of
....i.e. T = diag(x 1 ; x 2 ; xK ) I K [x 1 ; x 2 ; xK ] T = I Kx. 2 Input communication costs To analyse the input communication costs, we consider the uses of L, i.e. the variable references corresponding to data that is read within the loop body. As in all existing approaches [1, 2, 3, 4, 8, 9, 10, 11], we assume that the array subscripts are affine functions of the surrounding loop indices. Consequently, the generic form of a reference to an m dimensional array a, m 0 is a[Gi v] where G = g jk ] g 1 ; g 2 ; gK ] is an m ThetaK matrix with integer elements, i = i 1 ; i 2 ; ....
....element of a is modified by L (i.e. when a is a pure input array) we compute the size of the footprint of the array reference for a generic tile T generated by the tiling matrix T, where footprint(a[Gi v] T ) fa[Gi v] j i 2 Tg. This definition of a data footprint is due to Agarwal et al. [1, 2], who also proposed a framework for computing the size of data footprints, and used it to optimise cache coherency traffic on shared memory parallel computers. The framework in [1, 2] computes the size of a data footprint by case analysis on the matrix G. The authors identify no less than four ....
[Article contains additional citation context not shown here]
A. Agarwal et al. Automatic partitioning of parallel loops for cache-coherent multiprocessors. Tech. Rep. TM-481, Lab. for Comp. Sc., MIT, Dec. 1992.
....information about the nature of each axis. 2 In fact, it was my own experience with data layout[31, 33] that initially led me to consider a full fledged analysis of the data to be laid out. 8.1.10 Code Layout Analogous arguments apply to code layout. Traditional code layout algorithms [1, 2] naively take iterations in the source as the code to be distributed. By isolating the distinct expansions within a single source loop, we provide the code layout phase the option of distributing these distinct expansions differently. In the absence of this isolation, distinct expansions are ....
A. Agarwal, D. Kranz and V. Natarajan. Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors. In 22nd International Conference on Parallel Processing, St. Charles, IL, August 1993. IEEE.
....affinity can be supplied by the compiler, providing some semantic information that would otherwise unavailable or too expensive to compute at runtime. Data access analysis has already been used to optimize cache performance in uniprocessors [18] and to partition parallel loops in multiprocessors [2]. Similar analysis may be used to help determine which loop iterations share the greatest P T1 Figure 2: Task T1 has the highest affinity for physical processor P. The tasks on the edges have lower affinity, while the tasks in the corners have the lowest. amount of data, both in the cache and in ....
Agarwal, A., Kranz, D., and Natarajan, V. Automatic partitioning of parallel loops for cachecoherent multiprocessors. In Proceedings of the 1993 International Conference on Parallel Processing (1993), pp. 2--11.
....estimates of computation and or communication requirements. Good estimates require the computation of the integer volume of a polyhedron, but this is in general too complex. Fortunately, in most practical cases we only need an estimate of this volume, and heuristics like the one proposed in [1] are sufficient. We have a preliminary 2 D implementation to estimate the integer area, which extends the work in [1] by handling affine loop bounds. The integer area is approximated by the real area plus half the points in the boundary. Using a Mathematica package 4 , we obtain all the ....
....volume of a polyhedron, but this is in general too complex. Fortunately, in most practical cases we only need an estimate of this volume, and heuristics like the one proposed in [1] are sufficient. We have a preliminary 2 D implementation to estimate the integer area, which extends the work in [1] by handling affine loop bounds. The integer area is approximated by the real area plus half the points in the boundary. Using a Mathematica package 4 , we obtain all the vertices of the polyhedron from the loop inequalities. Then, using built in functionality, the vertices are ordered ....
Agarwal, A., Kranz, D., and Natarajan, V. Automatic Partitioning of Parallel Loops for CacheCoherent Multiprocessors. In Proceedings of the 22nd International Conference on Parallel Processing (St. Charles, IL, Aug. 1993), pp. I:2--11.
....phases. By converting the code to natural shapes and natural expansion categories, it improves the input to the data partitioner and the scheduler by providing them with additional flexibility. 1 Introduction Although there are many sophisticated algorithms for data partitioning and scheduling, [1, 2, 3, 4, 10, 11, 13, 16] the inputs to these algorithms, that is the objects to be partitioned and the code fragments to be scheduled, are usually determined naively. Generally the objects to be partitioned are the named arrays as declared by the user and the code to be scheduled A version of this paper will also ....
Anant Agarwal, David Kranz, and Venkat Natarajan. Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors. In 22nd International Conference on Parallel Processing, St. Charles, IL, August 1993. IEEE.
....machines. Iteration space tiling was also used for purposes other than optimizing locality. In [12] an iteration space partitioning technique based on hyperplanes is introduced. In [17] the problem of compiling perfectly nested loops for distributed memory message passing machines is addressed. In [2] a solution to the problem of determining loop and data partitions automatically for programs with multiple loops and arrays is presented. Our work also bears similarity to that of Abu Sufah et al. 1] which deals with optimizations to enhance the locality properties of programs in a virtual memory ....
A.Agarwal, D.Kranz, and V.Natarajan. Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors. In 22nd International Conference on Parallel Processing, St.Charles, IL, August 1993.
....are designed for message passing machines and are not directly useful for our machine model. Automatic partitioning techniques for regular parallel loops on cache coherent processors which minimize coherency traffic have been developed by Hudak and Abraham [12] and by Agarwal, Kranz and Natarajan [1]. These techniques find optimal partitions for programs with linear array subscript expressions. However, the mathematical approaches used do not extend to indirect array references common in irregular code. We use domain decomposition algorithms to partition irregular codes containing indirect ....
Anant Agarwal, David Kranz, and Venkat Natarajan. Automatic partitioning of parallel loops for cachecoherent multiprocessors. In Proceedings of the International Conference on Parallel Processing, volume 1, pages 2--11, 1993.
....we draw on in our research. In addition to these projects there are some smaller groups that have also done work relevant to our own. Automatic data partitioning algorithms which minimize coherency traffic have also been developed by Hudak and Abraham [22, 1] and by Agarwal, Kranz and Natarajan [2]. Hudak and Abraham have developed automatic partitioning techniques for regular data parallel loops with array accesses that have unit coefficient linear subscripts. Agarwal, Kranz and Natarajan generate optimal block partitions for cache coherent multiprocessors. They generalize the program ....
....affine functions of loop indices. The data footprints for array references are calculated and combined to determine the cache usage. A partition is chosen which minimizes that footprint in the cache. An approximation is used to combine data footprints for references having different strides. Like [2], we support array index expressions that are affine functions of loop indices, in addition we support block cyclic data partitions which they do not handle. Heuristic techniques for automatic data partitioning have been developed as part of the PARADIGM compiler. The compiler calculates ....
Anant Agarwal, David Kranz, and Venkat Natarajan. Automatic partitioning of parallel loops for cache-coherent multiprocessors. In Proceedings of the International Conference on Parallel Processing, volume 1, pages 2--11, 1993.
....phases. By converting the code to natural shapes and natural expansion categories, it improves the input to the data partitioner and the scheduler by providing them with additional flexibility. 1 Introduction Although there are many sophisticated algorithms for data partitioning and scheduling, [1, 2, 3, 4, 10, 11, 13, 16] the inputs to these algorithms, that is the objects to be partitioned and the code fragments to be scheduled, are usually determined naively. Generally the objects An earlier version of this paper appeared in the Workshop on Automatic Data Layout and Performance Prediction, Rice University, ....
Anant Agarwal, David Kranz, and Venkat Natarajan. Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors. In 22nd International Conference on Parallel Processing, St. Charles, IL, August 1993. IEEE.
....the partitioning and scheduling algorithms need to consider the communication cost among processors. For example, Agarwal et al. presents a theoretical framework to derive the shapes of the iteration space partitions of do loops to minimize the traffic in multi processors with local memory [2]. But they assume that iterations can be executed in parallel and the local memory of each processor is large enough for each processor s computation share. Work by Abraham and Hudak also assumes that the local memory size is sufficient [1] The pattern of references among arrays on nested loop ....
A. Agarwal, D. Kranz and V. Natarajan, "Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors," Proceedings of 1993 International Conference on Parallel Processing, pp. 2--11, 1993.
.... Arrays in Distributed Memory Multiprocessors the Software Virtual Memory Approach Rajeev Barua MIT Laboratory for Computer Science Cambridge, Massachusetts 02139 1 Introduction Loop and data partitioning for shared distributed memory multiprocessors has been studied by many researchers[1]. Loop partitioning distributes iterations in nested loops accessing data arrays among processors to get maximum cache data reuse, keeping good load balance. For NUMA machines, data partitioning tries to place data where it is likely to be accessed locally. This tiles the data space amongmemory ....
....a new method, software virtual memory, section 4) that combines hardware virtual memory s efficiency with software address computation s flexibility. We have implemented the new scheme for MIT Alewife, a globally cache coherent distributedmemory multiprocessor, using the partitioning scheme in [1]. 2 Software Address Computation Addressing an element is finding its physical address, specified by a processor number and offset in that processor s memory. In a shared memory machine this information is contained in one global address. Here we discuss software address computation, a current ....
A. Agarwal, D. Kranz, and V. Natarajan. Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors. In 22nd Intl Conf on Parallel Processing, August 1993.
....method can be classified into the LPGS category. The communication between bands is fulfilled by FIFO queues. Cache coherency traffic is considered in some later work such as [1, 8] The proposed technique works with rectangular, non regular hexagonal shape partitions for any stencil [22] In [2], the partitions of parallelepiped tiles with emphasis on rectangular shapes are discussed. The purpose of partitioning is to solve the given problem by utilizing the limited resources regardless the actual size of the problem. However, without fully considering the communication between ....
A. Agarwal, D. Kranz and V. Natarajan, "Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors," Proceedings of 1993 International Conference on Parallel Processing, pp. 2-11, 1993.
....This method can be classified into the LPGS category. The communication between bands is fulfilled by FIFO queues. Cache coherency traffic is considered in some later work such as [1, 5] The proposed technique works with rectangular, nonregular hexagonal shape partitions for any stencil [11] In [2], the partitions of parallelepiped tiles with emphasis on rectangular shapes are discussed. The purpose of partitioning is to solve the given problem by utilizing the limited resources regardless the actual size of the problem. However, without fully considering the communication between ....
A. Agarwal, D. Kranz and V. Natarajan, "Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors," Proceedings of 1993 International Conference on Parallel Processing, pp. 2--11, 1993.
....iterative search through the space of loop and data partitions commencing from the initial partitioning. The iterative solution is seeded with an initial partitioning of each individual loop nest that disregards data locality. This initial loop partitioning is found using the method described in [2]. The iterative solution is also seeded with an initial data partition. This initial partitioning of each array is chosen to match the partitioning of the largest loop that accesses that array. Thus, by first partitioning each loop for cache locality, the initial seeding favors cache locality over ....
....Abraham and Hudak [1] look at the problem of automatic loop partitioning for cache locality only for the case when array accesses have simple index expressions. Their method uses only a local per loop analysis. A more general framework for loop partitioning was presented by Agarwal et al. [2] for optimizing for cache locality. That framework handled fully general affine access functions, i.e. accesses of the form A[2i j,j] and A[100 i,j] were handled. However, that work found local minima for each loop independently, giving possibly conflicting data partitioning requests across loops ....
[Article contains additional citation context not shown here]
Anant Agarwal, David Kranz, and Venkat Natarajan. Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors. In 22nd International Conference on Parallel Processing, St. Charles, IL, August 1993. IEEE. To appear in IEEE TPDS.
....memory. Keywords: multiprocessors, compilers, addressing, data partitioning, loop partitioning, pages, virtual memory, locality. 1 Introduction The problem of loop and data partitioning for distributed memory multiprocessors with global address spaces has been studiedby many researchers [1, 3, 6, 18, 9, 8, 7, 13]. The goal oflooppartitioningfor applications with nested loops that access data arrays is to divide the iteration space among the processors to get maximum reuse Authors e mail: barua, kranz, agarwal lcs. mit. edu. Authors phone: 617)253 8569. of data in the cache, subject to the constraint ....
....offer because hardware virtual memory is not supported. We have implemented the software virtual memory scheme in the compiler and runtime system for the Alewife machine [2] a globally cache coherent distributed memory multiprocessor. We use the method of loop and data partitioning described in [3]. In this paper we demonstrate that: The overhead of software virtual memory is small in general. Furthermore, if rectangular data partitions can be used, simple compiler transformations can eliminate almost all of the overhead. Software virtual memory can use page sizes as small as 32 bytes ....
[Article contains additional citation context not shown here]
Anant Agarwal, David Kranz, and Venkat Natarajan. Automatic Partitioning of Parallel Loops for CacheCoherent Multiprocessors. In 22nd International Conference on Parallel Processing, St. Charles, IL, August 1993. IEEE. To appear in IEEE TPDS.
....for a parallel version of ANSI C and a parallel version of LISP called Mul T [15] For parallel C, Alewife supports the library from Argonne National Laboratory as well as parallel loops and distributed arrays. Automatic partitioning can be used when a program uses parallel loops and arrays [1]. Parallelism in Mul T is specified with the construct. Low thread creation overhead is achieved using lazy task creation [24] a method for dynamic partitioning and load balancing. The Alewife run time system includes a parallel stop and copy garbage collector. D. Alewife Debugging and Tuning ....
A. Agarwal, D. Kranz, and V. Natarajan, "Automatic partitioning of parallel loops for cache-coherent multiprocessors," in Proc. 22nd Int. Conf. Parallel Processing, Aug. 1993, pp. 943--962.
....as parallel loops and distributed arrays. Automatic partitioning can be used when a program uses parallel loops and arrays. By analyzing the array reference expressions, the compiler decides which loop iterations to execute on each processor to maximize reuse of data in the cache, as described in [1]. Mul T Parallel programming in Mul T is done using the future construct. Low thread creation overhead is achieved using lazy task creation [30] a method for dynamic partitioning and load balancing. The Alewife runtime system includes a parallel stop and copy garbage collector. 2.4 Alewife ....
Anant Agarwal, David Kranz, and Venkat Natarajan. Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors. In 22nd International Conference on Parallel Processing, St. Charles, IL, August 1993. IEEE. To appear in IEEE TPDS.
....virtual memory. Keywords: multiprocessors, compilers, addressing, data partitioning, loop partitioning, pages, virtual memory, locality. 1 Introduction The problem of loop and data partitioning for distributed memory multiprocessors with global address spaces has been studied by many researchers [1, 3, 6, 18, 9, 8, 7, 13]. The goal of loop partitioningfor applications with nested loops that access data arrays is to divide the iteration space among the processors to get maximum reuse Authors e mail: fbarua,kranz,agarwalg lcs.mit.edu. Authors phone: 617)253 8569. of data in the cache, subject to the constraint ....
....offer because hardware virtual memory is not supported. We have implemented the software virtual memory scheme in the compiler and runtime system for the Alewife machine [2] a globally cache coherent distributed memory multiprocessor. We use the method of loop and data partitioning described in [3]. In this paper we demonstrate that: ffl The overhead of software virtual memory is small in general. Furthermore, if rectangular data partitions can be used, simple compiler transformations can eliminate almost all of the overhead. ffl Software virtual memory can use page sizes as small as 32 ....
[Article contains additional citation context not shown here]
Anant Agarwal, David Kranz, and Venkat Natarajan. Automatic Partitioning of Parallel Loops for CacheCoherent Multiprocessors. In 22nd International Conference on Parallel Processing, St. Charles, IL, August 1993. IEEE. To appear in IEEE TPDS.
....for a parallel version of ANSI C and a parallel version of LISP called Mul T [15] For parallel C, Alewife supports the p4 library from Argonne National Laboratory as well as parallel loops and distributed arrays. Automatic partitioning can be used when a program uses parallel loops and arrays [1]. Parallelism in Mul T is specified with the future construct. Low thread creation overhead is achieved using lazy task cre ation [24] a method for dynamic partitioning and load balancing. The Alewife runtime system includes a parallel stop and copy garbage collector. 2.4 Alewife debugging and ....
A. Agarwal, D. Kranz, and V. Natarajan. Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors. In The 22nd International Conference on Parallel Processing, August 1993.
....for a parallel version of ANSI C and a parallel version of LISP called Mul T [13] For parallel C, Alewife supports the p4 library from Argonne National Laboratory as well as parallel loops and distributed arrays. Automatic partitioning can be used when a program uses parallel loops and arrays [1]. Parallelism in Mul T is specified with the future construct. Low thread creation overhead is achieved using lazy task creation [22] a method for dynamic partitioning and load balancing. The Alewife runtime system includes a parallel stop and copy garbage collector. 2.4 Alewife debugging and ....
A. Agarwal, D. Kranz, and V. Natarajan. Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors. In The 22nd International Conference on Parallel Processing, August 1993.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC