| S. Chatterjee, J. Gilbert, F. Long, R. Schreiber, and S. Teng. Generating local addresses and communication sets for data-parallel programs. In ACM PPOPP, pages 149--158, 1993. |
....parallelization tools. Moreover, the application of an fsm definition seems not to have been considered at all in the field of parallel image processing. In related research areas of parallel computation, however, fsm definitions have been applied before. For example, Chatterjee et al. [2] apply a finite state machine for the generation of optimal communication sets in distributed memory implementations of data parallel languages such as HPF. As in our case, results indicate that the fsm approach requires very little runtime overhead. For ad hoc optimization of specific algorithms ....
S. Chatterjee et al. Generating Local Addresses and Communication Sets for Data Parallel Programs. J. Par. Dist. Comp., 26(1):72--84, 1995.
....for proper MI execution. The complex nature of SDA methods and the fact that the internal data structures can be distributed over multiple nodes makes argument marshaling a non trivial task that may involve data redistribution. We make use of existing redistribution algorithms or libraries [43, 59, 162] for this purpose (cf. Section 5.2.4) In implementing the ORS, we were careful to avoid the underlying message passing system when communicating between two SDAs co located on the same nodes, and used direct memory copies in these situations. Due to the conditional execution of methods it is ....
....or lightweight threads. This fact gives rise to some non trivial coordination and synchronization problems. First of all we have to facilitate the exchange of possibly distributed data between SDAs which may well require data redistribution. As already noted, existing algorithms or libraries [43, 59, 162] are employed for this task. However, in order to make use of these libraries, we have to ensure that: 1. the parameterized processes threads on all the nodes of both the caller and the callee are ready to exchange the data, and 2. the callee knows the actual layout of the input data in order to ....
S. Chatterjee, J. Gilbert, F. Long, R. Schreiber, and S. Teng. Generating Local Addresses and Communication Sets for Data-Parallel Programs. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, CA, May 1993.
....for the iteration sets of the local loops. This transformation is called mask absorption.Forblock and cyclic distributions this transformation is relatively simple, but for block cyclic distributions this transformation is more complicated and a number of different solutions have been proposed [2, 12, 13, 14, 15]. Less attention has been paid to the equally important efficient absorption of multiple masks as needed in the derivation of communication sets and dependencies between loop iterators [2, 16] The three compilers differ in their optimization techniques in two ways: i) the derivation of the local ....
S. Chatterjee, J.R. Gilbert, F.J.E. Long, R. Schreiber, and S-H. Teng, "Generating Local Addresses and Communication Sets for Data-Parallel Programs," Journal of Parallel and Distributed Computing, Vol. 26, no. 1, pp.72-84, April 1995.
....are briefly described below. Iterations. In general, the implementation of iterations depends on the applied global tolocal mapping scheme. An excellent review of such schemes is given in [7] For our framework, we have chosen the pattern cyclic enumeration scheme developed by Chatterjee et al. [8]. In this scheme, consecutive local indices of each dimension of an array section are produced by a finite state machine (FSM) An iteration (2) is then implemented with a nested loop whose depth depends on the rank of the component arrays. Execution of this loop is controlled by an iteration ....
S. Chatterjee, J.R. Gilbert, F.J.E. Long, R. Schreiber, and S.-H. Teng. Generating Local Addresses and Communication Sets for Data-Parallel Programs. Journal of Parallel and Distributed Computing, 26:72--84, 1995.
....engage only part of the processors. This causes an unbalance in the load. To alleviate this, other distribution schemes have been developed. 2 processor 0 processor 1 processor 2 processor 3 1 3 4 2 a b m n r Figure 2: An example of an array distribution (After Chatterjee et al. [6]) An array of 39 elements is aligned to a template through the function f al (i) 3 Delta i 7. The template is distributed cyclic(4) over 4 processors. The resulting distribution requires 8 rows. In terms of the variables in this paper (see Section 2) n i = 39, n p = 4, m = 4, a = 3, b = 7, ....
....has been explored for some time in the context of various data parallel languages [7, 8, 9, 10, 11, 12] The recent definition of HPF [1] has added some new data alignment and data distribution features for which no efficient solutions existed. As a consequence, new results have been reported in [13, 6, 14, 15, 16, 17, 18, 19, 20, 21] and, more recently and concurrent with this paper, 22, 23, 20, 24, 25, 26] Early optimization techniques only consider non aligned arrays. The first optimizations were reported by Callahan and Kennedy [7] and Gerndt [8] They considered non aligned block(m) distributions with linear array ....
[Article contains additional citation context not shown here]
S. Chatterjee, J. R. Gilbert, F. J.E. Long, R. Schreiber, and Shang-Hua Teng, "Generating local addresses and communication sets for data-parallel programs", Journal of Parallel and Distributed Computing, vol. 26, no. 1, pp. 72--84, April 1995, first presented at PPoPP'93.
....and Fortran 90D. Thus our approach is similar to that taken by the High Performance Fortran Forum [8] although we do not explicitly assume Fortran as our input language. The work of automatically deriving parallelism from sequential code is beyond the scope of this paper. Chatterjee et al. [4] present a similar framework for compiling array assignment statements, in terms of constructing a finite state machine. Chatterjee s approach accesses data in a manner that is more friendly 1 than our approach to a data cache, especially in the case of block cyclic data distributions. However, ....
S. Chatterjee, J. Gilbert, F. J. E. Long, R. Schreiber, and S.-H. Teng. Generating local addresses and communication sets for data-parallel programs. In Proc. of PPoPP, pages 149--158, San Diego, CA, May 1993.
....are developed for the index sets for block and cyclically distributed arrays. These closed forms are then used in the virtual processor domain for efficient enumeration of the communication and local index sets. The problem of local index set identification was addressed by Chatterjee et al. [2] using a finite state machine (FSM) to traverse the local index space. Stichnoth et al. 11] address the problem of index set and processor set identification. The formulation proposed has similarities to an instance of the virtual processor approach. The implementation of the Fortran D compiler ....
....processor set identification. The formulation proposed has similarities to an instance of the virtual processor approach. The implementation of the Fortran D compiler at Rice University is being extended to handle arrays with block cyclic distributions [6] An approach similar to the FSM approach [2] for determining the local memory access sequence is used and efficient algorithms for computing the FSM for frequently occuring cases are presented. A linear time algorithm for constructing the FSM which improves the asymptotic complexity of the algorithm presented in [6] was presented in [8] ....
S. Chatterjee, J. R. Gilbert, F. J. E. Long, R. Schreiber, and S.-H. Teng. Generating local addresses and communication sets for data parallel programs. In Proc. of ACM Symposium on Principles and Practices of Parallel Programming, pages 149--158, May 1993.
....increases the compilation time 7 to 27 , averaging on 19 for both block and cyclic distributions. 7 Related Work Several papers have address the problem of generating local address and communication sets for HPF programs where arrays are distributed using the general block cyclic distributions [7, 14, 24, 33, 34, 46, 47]. Of these, Ancourt et al. 7] use a linear algebra framework; this renders their approach general. The rest of the approaches are very efficient for a restricted class of mappings. Considering the lack of generality of these approaches, their use in the communication optimizations of the kind ....
S. CHATTERJEE, J. GILBERT, F. LONG, R. SCHREIBER, and S. TENG. Generating local addresses and communication sets for data-parallel programs. Journal of Parallel and Distributed Computing, 26(1):72--84, April 1995. 51
....proposed in [11] There has also been some research on the closely related problem of determining the local 14 addresses and communication sets for array assignment statements like A(l 1 : h 1 : s 1 ) B(l 2 : h 2 : s 2 ) where A and B have different cyclic(m) distributions. Chatterjee et al. [1] present an approach to calculate the sequence of local memory addresses that a given processor must access while doing a computation involving the regular array section A(l : h : s) when the array A has a cyclic(k) distribution. They show that the local memory access sequence is characterized by ....
S. Chatterjee, J. Gilbert, F. Long, R. Schreiber, and S. Teng. Generating Local Addresses and Communication Sets for Data Parallel Programs. In Proceedings of Principles and Practices of Parallel Programming (PPoPP) '93, pages 149--158, May 1993.
....since a local to global and a global to local translation will be needed for each active element in order to determine the destination processor. The problem of active index set identification for array statements involving block cyclically distributed arrays was addressed by Chatterjee et al. [4] using a finite state machine (FSM) to traverse the local index space of each processor. If all arrays in an array statement have the same block cyclic distribution and access stride, the order of access of the local elements with the FSM approach turns out to be the same as the access order when ....
....it appears that after determination of the active local indices of the r.h.s. section using a FSM, an explicit local to global translation corresponding to the r.h.s. section and a global to local translation corresponding to the l.h.s. section will need to be performed for each active element. In [4], restricted cases involving arrays with different strides are treated, but even with these, the generation of communication sets requires explicit index translation for each active element. With the virtual processor approach explicit local to global and global to local translation is not needed, ....
[Article contains additional citation context not shown here]
S. Chatterjee, J. R. Gilbert, F. J. E. Long, R. Schreiber, and S.-H. Teng. Generating local addresses and communication sets for data parallel programs. In Proc. of ACM Symposium on Principles and Practices of Parallel Programming, pages 149--158, May 1993.
....of basis. Analysis of the properties of communications leads to a tiling of the local memory addresses that provides maximal message vectorization. 1 Introduction Static analysis of data parallel programs, for the generation of distributed code, has been proposed by many authors, for instance [7] [4] [9] 5] 13] Static analysis aims to improve performance over run time resolution [2] which includes a lot of pure overhead in form of guards and tests. Many static compilation schemes have been considered; they differ in important points such as interleaving computation and communication as in ....
....efficient closed forms of the previous sets for the most general block cyclic distribution is an open problem. 7] gives a general compiling scheme under the weakest assumptions, but provides closed forms only when indices are independent: for instance, T [j; i] but not T [2i j; i Gamma j] [4] uses a finite state machine approach, allowing optimal memory utilization, but restricts references to array sections and uses integer divides. 9] solves the same problem with a virtualization method. Other special cases have been solved, for unit strides in [13] for one dimensional arrays in ....
S. Chatterjee, J.R. Gilbert, F. Long, R. Shreiber, and S-H. Teng. Generating local addresses and communication sets for data-parallel programs. In Symp on Principles and Practice of Programming Languages 93. ACM, 93.
....a full block distribution (CYCLIC(d n p e) where n is the array size and p the number of processors) Recently, however, several algorithms have been published that handle general block cyclic CYCLIC(k) distributions. Sophisticated techniques involve finite state machines (see Chatterjee et al. [3]) set theoretic methods (see Gupta et al. 8] Diophantine equations (see Kennedy et al. 11, 12] Hermite forms and lattices (see Thirumalai and Ramanujam [18] or linear programming (see Ancourt et al. 1] A comparative survey of these algorithms can be found in Wang et al. 22] where it ....
S. Chatterjee, J. R. Gilbert, F. J. E. Long, R. Schreiber, and S.-H. Teng. Generating local addresses and communication sets for data-parallel programs. Journal of Parallel and Distributed Computing, 26(1):72--84, 1995.
.... implemented with success in commercial HPF compilers, however few commercial HPF compilers have so far provided efficient support for block cyclic distribution[6] The lack of commercial compiler support for HPF programs with blockcyclic distributions has recently prompted many research efforts[4, 5, 6, 8, 9] to hope to lead to efficient schemes in supporting this issue. While we think the recent research efforts will greatly help the implementation of block cyclic distributions, we feel the lack of support in commercial compilers for blockcyclic distributions is in part due to the lack of a ....
....to calculate the communication set for distributed arrays with block or cyclic distribution. However, arrays with block cyclic distribution was not considered. Later, the problem to deal with array statements distributed with block cyclic distribution patterns was addressed by Chatterjee et al. [4]. It described a method for the enumeration of local indices in increasing order based on a finite state machine. In [6, 18] linear algorithms for the general case are given based on by an integer lattice method[6, 18] Their work mainly focused on the local set enumeration and did not discuss ....
[Article contains additional citation context not shown here]
S.Chatterjee, J.Gilbert, F.Long, R. Scheriber, and S.Teng. Generating Local Addresses andCommunicationSets for Data Parallel Programs. Proc. of Fourth ACM SIGPLAN Conference on Principles and Practice of Parallel Programming, pages 149--158, May 1993.
.... timing results obtained on different machines (namely the Cray T3D and the Intel Paragon) 2 Related work For a long time, redistribution was considered very difficult in the general case, and most implementations are restricting the possible distributions to block or cyclic distributions [3, 2, 11, 1, 5], or in some implementations all block sizes had to be multiple of each others to ease some memory access operations. Some recent works show that it can be done at compile time in the general case [8, 10] or describe the access of array elements with different strides [9] But RR n2766 4 Loic ....
....in a column, P col and few others to determine, when a sub matrix is used, which element of the global matrix is the starting point and which processor it belongs to. 0,1) 0,2) 1,0) 2,0) 3,0) 0,4) 0,5) 0,3) 0,0) 2,3) Blocks owned by the processors [0,0] Grid of Processors [2,3] Block Matrix 0 1 2 1 0 (2,0) 1,0) 3,0) 3,3) 1,1) 3,1) 1,4) 3,4) 1,2) 1,5) 3,2) 3,5) 0,3) 2,3) 1,3) 0,1) 0,4) 2,1) 2,4) 0,2) 0,5) 2,2) 2,5) 0,0) Figure 1: The block cyclic data distribution of a 2D array on a 2 Theta3 grid of processors. In SCALAPACK, ....
S. Chatterjee, J. R. Gilbert, F. J. E. Long, R. Schreiber, and S. H. Teng. Generating local addresses and communication sets for data-parallel programs. In Symposium on Principles and practice of parallel programming, San diego, CA, May 1993. ACM SIGPLAN.
....lattice, and involves the derivation of a suitable set of basis vectors for the lattice. Given the lattice basis, we enumerate the lattice by using loop nests; this allows us to generate efficient code that incurs negligible runtime overhead in determining the access pattern. Chatterjee et al. [3] presented an O(k log k) algorithm (where k is the block size see Chapter 2 for definition) for this problem; the algorithm proposed in this thesis is an O(k) algorithm. Recently, Kennedy et al. 10] have also presented an O(k) algorithm. Our algorithm is two to three times faster than the ....
....how to determine address sequences by lattice enumeration, and explain the use of lattice basis vectors to generate a loop nest that determines the address sequence. Finally we demonstrate the efficacy of our approach using experimental results comparing our solution to those of Chatterjee et al. [3] and Kennedy et al. 10] Chapter 4 presents efficient solutions for the compile time derivation of the best basis vectors. Chapter 5 presents algorithms for deriving communication sets for the different processors. Here we present models for packing the data to be sent and unpacking the data that ....
[Article contains additional citation context not shown here]
S. Chatterjee, J. R. Gilbert, F. J. E. Long, R. Schreiber, and S.-H. Teng. Generating local addresses and communication sets for data parallel programs. In Proc. of ACM Symposium on Principles and Practices of Parallel Programming, pages 149--158, May 1993.
....memory vectors, unless otherwise stated. Vectors are represented by the tuple V = B; S; L , where V:B is the base address, V:S is the sequence stride, and V:L is the sequence length. V [i] is the i th element in the vector V . For example, vector V = A; 4; 3 designates elements A[0] A[4], and A[8] where V [0] A[0] V [1] A[4] etc. Let M be the number of memory banks, such that M = 2 m . Let N be the number of consecutive words of memory exported by each of the M banks, i.e. the bank interleave factor, such that N = 2 n . We refer to these N consecutive words as a ....
....Vectors are represented by the tuple V = B; S; L , where V:B is the base address, V:S is the sequence stride, and V:L is the sequence length. V [i] is the i th element in the vector V . For example, vector V = A; 4; 3 designates elements A[0] A[4] and A[8] where V [0] A[0] V [1] A[4], etc. Let M be the number of memory banks, such that M = 2 m . Let N be the number of consecutive words of memory exported by each of the M banks, i.e. the bank interleave factor, such that N = 2 n . We refer to these N consecutive words as a block. DecodeBank(addr) returns the bank ....
[Article contains additional citation context not shown here]
S. Chatterjee, J. Gilbert, F. Long, R. Schreiber, and S.-H. Teng. Generating local addresses and communication sets for data-parallel programs. Journal of Parallel and Distributed Computing, 26(1):72--84, Apr. 1995.
....array A onto p = 4 processors with k = 4 and l = 0 HPF TEMPLATE T(200) HPF PROCESSORS PROCS(4) HPF ALIGN A(j) WITH T(j) HPF DISTRIBUTE T(CYCLIC(4) ONTO PROCS do i = 0, 47 A(3 i) Delta Delta Delta enddo The array layout for Example 1.2 is shown in Figure 1.4. Chatterjee et al. [4] identified a repeating access pattern and characterized it as a finite state machine. They have shown that both the starting location and the table of local memory gaps can be found by solving a set of k linear diophantine equations. Let m be the processor which holds the array element A(i) A(i) ....
....l where km (l is) mod pk k(m 1) Gamma 1: This is equivalent to finding an integer c such that is Gamma cpk = where mk Gamma l mk Gamma l k Gamma 1 which corresponds to solving k linear diophantine equations. Integer solutions exists if gcd (s; pk) divides . Chatterjee et al. [4] record the smallest non negative solution for i for each value of . The minimum of all these solutions will correspond to the first array element A(l is) on processor m. Sorting of all the solutions will determine the local memory gaps between consecutive accesses. The complexity involved in ....
[Article contains additional citation context not shown here]
S. Chatterjee, J. Gilbert, F. Long, R. Schreiber, and S. Teng. Generating local addresses and communication sets for data parallel programs. Journal of Parallel and Distributed Computing, 26(1):72--84, 1995.
....0.1 0.12 0.14 shift distance cshift, dim = 1. uniformly distributed. 8 processors 1M 590K 262K 65K (a) b) Figure 1: Collective communication costs on the CM 5. a) All to all communication. b) Shift communication. A similar linear algebraic formulation is available for distribution [2, 7]. We do not explicitly mention it here as distribution is beyond the scope of this paper. 1.2 A formal statement of the data layout problem Given an array parallel program and a target number of processors, our goal is to determine the quantities R, L, and f for each array and template at each ....
S. Chatterjee, J. R. Gilbert, F. J. E. Long, R. Schreiber, and S.-H. Teng. Generating local addresses and communication sets for data-parallel programs. Journal of Parallel and Distributed Computing, 26(1):72--84, Apr. 1995.
No context found.
S. Chatterjee, J. Gilbert, F. Long, R. Schreiber, and S. Teng. Generating local addresses and communication sets for data-parallel programs. In ACM PPOPP, pages 149--158, 1993.
No context found.
S. Chatterjee et al. Generating Local Addresses and Communication Sets for Data Parallel Programs. J. Par. Dist. Comp., 26(1):72--84, 1995.
No context found.
S. Chatterjee et al. Generating Local Addresses and Communication Sets for Data Parallel Programs. J. Par. Dist. Comp., 26(1):72--84, 1995.
No context found.
S. Chatterjee, J. Gilbert, F. Long, R. Schreiber, and S.-H. Teng. Generating local addresses and communication sets for data-parallel programs. Journal of Parallel and Distributed Computing, 26(1):72--84, Apr. 1995.
No context found.
S. Chatterjee, J. Gilbert, F. Long, R. Schreiber, and S. Teng. Generating Local Addresses and Communication Sets for Data Parallel Programs. Journal of Parallel and Distributed Computing, 26(1):72-84, April 1995.
No context found.
S. Chatterjee, J. R. Gilbert, F. J. E. Long, R. Schreiber, and S. H. Teng, "Generating Local Addresses and Communication Sets for Data-Parallel Programs," in Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, San Diego, CA, May 1993, pp. 149--158.
No context found.
$. CHATTERJEE, J. GILBERT, F. LONG, R. SCHREIBER, AND $. TENG, Generating local addresses and communication sets for data-parallel programs, in Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, CA, May 1993.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC