26 citations found. Retrieving documents...
Ken Kennedy, Nenad Nedeljkovi'c, and Ajay Sethi. Efficient address generation for block-cyclic distributions. In ACM International Conference on Supercomputing, pages 180--184, June 1995.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
A Linear Algebra Framework for Static HPF Code.. - Ancourt, Coelho.. (1995)   (63 citations)  (Correct)

....available. Moreover, accesses to an auxiliary data structure, the fsm transition map, add to the overhead. Note that the code generated in Figure 11 may be used to compute the fsm. In fact the lower iteration of the innermost loop is computed by the algorithm that builds the fsm. Kennedy et al. [55, 56, 48, 46] and others [79] have suggested improvements to this technique, essentially to compute faster at run time the automaton transition map. Also multi dimensional cases need many transition maps to be handled. Papers by Stichnoth et al. 78, 77] on the one hand and Gupta et al. 43, 44, 52] on the ....

Ken Kennedy, Nenad Nedeljkovi'c, and Ajay Sethi. Efficient address generation for block-cyclic distributions. In ACM International Conference on Supercomputing, pages 180--184, June 1995.


Finding performance bugs with the TNO HPF benchmark suite - Denissen, Sips (2002)   (Correct)

....for the iteration sets of the local loops. This transformation is called mask absorption.Forblock and cyclic distributions this transformation is relatively simple, but for block cyclic distributions this transformation is more complicated and a number of different solutions have been proposed [2, 12, 13, 14, 15]. Less attention has been paid to the equally important efficient absorption of multiple masks as needed in the derivation of communication sets and dependencies between loop iterators [2, 16] The three compilers differ in their optimization techniques in two ways: i) the derivation of the local ....

K. Kennedy, N. Nedeljkovic', and A. Sethi, "Efficient Address Generation for Block-Cyclic Distributions," Proceeding of the International Conference on Supercomputing, ACM, pp. 180-184, June 1995.


An implementation framework for HPF distributed.. - van Reeuwijk.. (1996)   (4 citations)  (Correct)

....9, 10, 11, 12] The recent definition of HPF [1] has added some new data alignment and data distribution features for which no efficient solutions existed. As a consequence, new results have been reported in [13, 6, 14, 15, 16, 17, 18, 19, 20, 21] and, more recently and concurrent with this paper, [22, 23, 20, 24, 25, 26]. Early optimization techniques only consider non aligned arrays. The first optimizations were reported by Callahan and Kennedy [7] and Gerndt [8] They considered non aligned block(m) distributions with linear array access functions. Gerndt also showed how overlap can be handled. In Paalvast et ....

....in an in order sequence. From this sequence, a finite state machine (FSM) is constructed which is used to successively access each element. In the original paper of Chatterjee et al. the construction of the FSM requires a full sorting operation. Recent papers describe more efficient methods [21, 22, 23, 26]. In [21] a linear algorithm for constructing the FSM for two special cases is given. Linear algorithms for the general case are given in [22, 23, 26] Kennedy et al. also showed [26] that their method can be used without a table, using a demand driven evaluation scheme, at the expense of some ....

[Article contains additional citation context not shown here]

K. Kennedy, N. Nedeljkovic, and A. Sethi, "Efficient address generation for block-cyclic distributions", in Proceedings of the Intl. Conf. on Supercomputing, June 1995, pp. 180--184.


A Basic-Cycle Calculation Technique for Efficient Dynamic.. - Chung, Hsu, Bai (1998)   (2 citations)  (Correct)

....for generating communication sets by computing the intersections of index sets corresponding to the LHS and RHS of array statements was presented. The intersections were computed by a scanning approach that exploited the repetitive pattern of the intersection of two index sets. Kennedy et al. [15] also presented algorithms to compute the local memory access sequence for array statements with BLOCK CYCLIC(c) distribution. In [23] the CYCLIC(k) distribution was viewed as a union of k CYCLIC distribution. Since the communication sets for CYCLIC distribution is easy to determine, ....

K. Kennedy, N. Nedeljkovic, and A. Sethi, "Efficient Address Generation for BLOCK-CYCLIC Distribution," Proc. Supercomputing '95, Barcelona, pp. 180-184, July 1995.


Scheduling Block-Cyclic Array Redistribution - Desprez, Dongarra, Petitet.. (1997)   (8 citations)  (Correct)

....Recently, however, several algorithms have been published that handle general block cyclic CYCLIC(k) distributions. Sophisticated techniques involve finite state machines (see Chatterjee et al. 3] set theoretic methods (see Gupta et al. 8] Diophantine equations (see Kennedy et al. [11, 12]) Hermite forms and lattices (see Thirumalai and Ramanujam [18] or linear programming (see Ancourt et al. 1] A comparative survey of these algorithms can be found in Wang et al. 22] where it is reported that the most powerful algorithms can handle block cyclic distributions as efficiently ....

K. Kennedy, N. Nedeljkovic, and A. Sethi. Efficient address generation for block-cyclic distributions. In 1995 ACM/IEEE Supercomputing Conference. http://www.supercomp.org/sc95/proceedings, 1995.


Efficient Address and Communication Generation for.. - Venkatachar (1996)   (Correct)

....addresses of these global elements on each processor. Figure 4.1(c) shows the local addresses which need to be generated by this SPMD code. This chapter deals with generating this set of elements in lexicographic order, using runtime approaches. 4. 2 RELATED WORK Kennedy, Nedeljkovic and Sethi [13] discuss the issue of address generation for the case of MIV. They view the problem as an integer lattice and use their approach discussed in their earlier work [12] to generate memory gap patterns. Since there is a pattern of repetition both at the outer and inner loop level, they construct ....

....O(k) time by solving k linear diophantine equations. The solution by Kennedy, Nedeljkovic and Sethi [12] also incurs O(k) overhead for finding the start element. Their method is very similar to the method of Chatterjee et al. 4] However they also present new methods of finding start element [13]. These methods are divided mainly into two parts, the first part computes start element for the case s k and is a constant time algorithm. The second case is when s k for which the worst case complexity is O(pk) If the stride is not greater than k and if first is the first element accessed ....

[Article contains additional citation context not shown here]

K. Kennedy, N. Nedeljkovic, and A. Sethi. Efficient address generation for blockcyclic distributions In Proc. ACM International Conference on Supercomputing, Madrid, Spain, July 1995.


Scheduling Block-Cyclic Array Redistribution - Desprez, Dongarra, Petitet.. (1997)   (8 citations)  (Correct)

....Recently, however, several algorithms have been published that handle general block cyclic CYCLIC(k) distributions. Sophisticated techniques involve finite state machines (see Chatterjee et al. 3] set theoretic methods (see Gupta et al. 8] Diophantine equations (see Kennedy et al. [10, 11]) Hermite forms and lattices (see Thirumalai and Ramanujam [17] or linear programming (see Ancourt et al. 1] A comparative survey of these algorithms can be found in Wang et al. 21] where it is reported that the most powerful algorithms can handle block cyclic distributions as efficiently ....

K. Kennedy, N. Nedeljkovic, and A. Sethi. Efficient address generation for block-cyclic distributions. In 1995 ACM/IEEE Supercomputing Conference. http://www.supercomp.org/sc95/proceedings, 1995.


Opus: A Coordination Language for Multidisciplinary.. - Chapman, Zima.. (1997)   (21 citations)  (Correct)

....can be accesses locally by the leader. local to the leader. Determining the communication schedule, i.e. what elements of an array are to be sent or received from which thread, is a complex task. Several groups have been studying algorithms and heuristics to determine the most efficient schedule [2, 11, 26, 16, 22, 27, 31]. We have adopted (and augmented) the finite state machine (FSM) method for local address set calculation developed by Chatterjee et al. 11] in our current prototype. The FSM method exploits the repeating patterns of local array indices to determine the elements of a distributed array that each ....

K. Kennedy, N. Nedeljkovic, and A. Sethi. Efficient address generation for block-cyclic distributions. In Proceedings of the International Conference on Supercomputing, pages 180--184, Barcelona, Spain, July 1995. ACM Press.


Page-level Affinity Scheduling for Eliminating False Sharing - Bodin, Granston, Montaut   (Correct)

....before relinquishing the page. This transformation is simpler to implement and can be applied in more cases than the transformation described here, but yields a smaller performance improvement and only when the amount of parallelism is moderate. Other researchers, for example [CGL 93, ACIK93, KNS94, AFMP95] have looked at using a block cyclic owner computes rule to compile data parallel languages such as HPF [KLS 94] Some of these techniques are also based on generating and solving sets of inequalities. However, none of them have considered the approach of precomputing solutions to ....

Ken Kennedy, Nenad Nedeljkovic, and Ajay Sethi. Efficient Address Generation for Block-Cyclic Distributions. Technical report, Center for Research on Parallel Computation, Rice University, Technical Report No. CRPC-TR94487-S, Houston, Texas, December 1994.


Advanced Compilation Techniques for HPF - Ramanujam, Dutta, Venkatachar.. (1998)   (Correct)

....node code on distributed memory machines is important. For array sections, node code generation must exploit the repetitive access pattern exhibited by the accesses to distributed arrays. Several techniques for the efficient enumeration of the access pattern already exist. But only one paper [15] so far addresses the effect of the data structures used in representing the access sequence on the execution time. In [6, 7] we present several new data structures along with node code that is suitable for both DO loops and FORALL constructs. The methods, namely strip mining and table ....

K. Kennedy, N. Nedeljkovic, and A. Sethi. Efficient address generation for block-cyclic distributions. In Proc. ACM International Conference on Supercomputing, Madrid, Spain, pages 180--184, July 1995.


SUPPLE: an Efficient Run-Time Support for Non-Uniform Parallel .. - Orlando, Perego (1996)   (2 citations)  (Correct)

....support array redistribution directives and thus, the communication schedules used to assign a cyclically distributed array to a block distributed one, and vice versa, may not be fully optimized. The performance results reported in Fig. 12. b) might be different if optimized array redistribution [21, 22] were supported. Table 3: Data volumes transferred over the interconnection network IMPLEMENTATION DATA VOLUMES SUPPLE from 2.5 MB to 19.4 MB CRAFT (BLOCK distrib. 2.3 MB CRAFT (CYCLIC distrib. 320 MB CRAFT (redistribution) 159.8 MB 5 Related work The parallel loop scheduling problem has ....

K. Kennedy, N. Nedeljkovi'c, and A. Sethi, "Efficient address generation for block-cyclic distributions," in Proc. of the 1995 ACM Int. Conf. on Supercomputing, July 1995, pp. 180--184.


State of the Art in Compiling HPF - Coelho, Germain (1996)   (7 citations)  (Correct)

....extension to alignment strides simply applies this algorithm twice. The generalization to multidimensional arrays is straightforward. The technique handles simple regular communications, such as shifts. The initial algorithm was improved to generate the transition tables faster [23, 28, 40] In [29] the method is extended to enumerate affine and coupled subscript accesses for general cyclic distributions. Such extensions typically require several tables per array dimension to handle the various indexes appearing in an affine subscript. These automaton based techniques are fast at enumerating ....

Ken Kennedy, Nenad Nedeljkovi'c, and Ajay Sethi. Efficient address generation for block-cyclic distributions. In ACM International Conference on Supercomputing, pages 180--184, June 1995.


Loop Transformations to Prevent False Sharing - Bodin, Granston, Montaut (1995)   (4 citations)  (Correct)

....according to an owner computes rule: the processor that owns the data on the left side of statement performs the computation. The problem of partitioning computation according to a general block cyclic owner computes rule has been studied by several groups of researchers [CGL 93, ACIK93, KNS94, AFMP95] The solution that we propose transforms the problem of eliminating false sharing into a similar problem: we assign pages to processors in a block cyclic fashion and then partition computation accordingly. However, there are several significant differences that arise primarily from our ....

Ken Kennedy, Nenad Nedeljkovic, and Ajay Sethi. Efficient Address Generation for Block-Cyclic Distributions. Technical report, Center for Research on Parallel Computation, Rice University, Technical Report No. CRPC-TR94487-S, Houston, Texas, December 1994.


Efficient Address Generation for Affine Subscripts in.. - Shih, Sheu (1998)   (Correct)

....machine (FSM) approach is proposed to traverse the local memory access sequence of each processor [3] The method is a table based approach. The table construction needs to solve k linear Diophantine equations and incurs a sorting operation. The work improving the FSM approach [3] is proposed in [10, 11, 20]. Efficient FSM table generation is proposed. The improved work enumerates the local memory access sequences by viewing the accessed elements an integer lattice. The sorting step in [3] is avoided in the improved work. In [7] the authors use the virtual processors to generate communication sets ....

....FSM approach or virtual processor approach except some modifications. However, most of them consider the simple array subscript. That is, the array subscripts contain only one induction variable. Recently, several efforts on compiling array references with affine array subscripts are proposed [1, 10, 11, 15, 17, 22]. Affine array subscript means the array subscript is a linear combination of multiple induction variables (MIVs) In [1] the authors use a linear algebra framework to generate communication sets for affine array subscripts. Complex loop bounds and local array subscripts of the generated code ....

[Article contains additional citation context not shown here]

K. Kennedy, N. Nedeljkovi'c, and A. Sethi. Efficient address generation for block-cyclic distributions. In Proceedings of ACM International Conference on Supercomputing, pages 180--184, July 1995.


A Linear Algebra Framework for Static HPF Code.. - Ancourt, Coelho.. (1995)   (63 citations)  (Correct)

....available. Moreover, accesses to an auxiliary data structure, the fsm transition map, add to the overhead. Note that the code generated in Figure 11 may be used to compute the fsm. In fact the lower iteration of the innermost loop is computed by the algorithm that builds the fsm. Kennedy et al. [55, 56, 48, 46] and others [79] have suggested improvements to this technique, essentially to compute faster at run time the automaton transition map. Also multi dimensional cases need many transition maps to be handled. Papers by Stichnoth et al. 78, 77] on the one hand and Gupta et al. 43, 44, 52] on the ....

Ken Kennedy, Nenad Nedeljkovi'c, and Ajay Sethi. Efficient address generation for block-cyclic distributions. In ACM International Conference on Supercomputing, pages 180--184, June 1995.


Code Generation for Complex Subscripts in Data-Parallel Programs - Ramanujam Swaroop (1997)   (3 citations)  (Correct)

....generating the access sequences for MIV based on problem parameters. With coupled subscripts, we present two construction techniques, namely searching and hashing which minimize the time needed to construct the tables. Extensive experiments were conducted and the results were then compared with [8] to indicate the efficiency of our approach. 1 Introduction Languages such as High Performance Fortran (HPF) 6] and Vienna Fortran [2] are being used to program massively parallel distributed memory machines. The compiler directives in these languages allow a programmer to specify the details ....

....burden on the compiler by rendering address generation of the memory accesses involved expensive. Therefore the need arises for time and space efficient techniques to generate these addresses. Several earlier papers deal with address generation for CYCLIC(k) distribution for simple subscripts [1, 3, 5, 7, 8, 10, 12, 13, 14, 15, 16, 17]. The array references in general may involve complex subscript functions. This introduces an additional burden on the compiler to generate efficient run time SPMD code. The two particularly interesting types of affine subscripts that occur commonly in such programs are the multiple induction ....

[Article contains additional citation context not shown here]

K. Kennedy, N. Nedeljkovic, and A. Sethi. Efficient address generation for block-cyclic distributions In Proc. ACM International Conference on Supercomputing, Madrid, Spain, July 1995.


The Sparse Cyclic Distribution against its Dense.. - Bandera, Ujaldon.. (1997)   (Correct)

....such as those addressed in our present work. In the data parallel paradigm, many language and compiler features have been proposed for extending the HPF standard through a successful parallelization of non regular applications [9, 5] For CYCLIC distributions, Benkner [4] and Nedeljkovic et al. [7] have proposed different translation schemes for dense arrays. For sparse distributions, MRD was developed and implemented by Ujaldon [11] whereas BRS is still on the way. 6. Conclusions In this paper, we have shown methods for integrating and dealing jointly with dense and sparse matrix ....

N. Nedeljkovic, K. Kennedy, A. Sethi, Efficient Address Generation for Block-Cyclic Distributions, Proceedings 9 th ACM Int'l Conf. on Supercomputing, Barcelona (Spain), pp. 180-184, July 1995.


A Linear Algebra Framework for Static HPF Code.. - Ancourt, Coelho.. (1995)   (63 citations)  (Correct)

....Moreover, accesses to an auxiliary data structure, the fsm transition map, add to the overhead. Note that the code generated in Figure 9 may be used to compute the fsm. In fact the lower iteration of the innermost loop is computed by the algorithm that constructs the fsm. Kennedy et al. [43, 42, 38, 36] have suggested improvements to this technique, essentially to compute faster the automaton transition map. Papers by Stichnoth et al. 61, 60] on the one hand and Gupta et al. 34, 35] on the other present two similar methods to solve the same problem. They use array sections but compute some of ....

Ken Kennedy, Nenad Nedeljkovi'c, and Ajay Sethi. Efficient address generation for block-cyclic distributions. CRPC-TR 94497-S, Center for Research on Parallel Computation, Rice University, December 1994. Submitted to ICS'95. Ancourt et al., A Linear Algebra: : : , Submitted to Scientific Programming 56


Algorithmic Redistribution Methods for Block Cyclic Decompositions - Petitet (1996)   (9 citations)  (Correct)

....algorithms are available [65] to solve these equations. This method is thus very general and relatively inexpensive in terms of time. Still, this approach is the most powerful and expensive method in terms of memory requirements. Dynamic storage facilities are needed for the quadruplet solutions [63]. It can be adapted to accommodate variations of the block cyclic distributions that are supported by the HPF language. 2.5 LCM Tables Definition 2.5.1 The k diagonal of a matrix is the set of entries a ij such that i Gamma j = k. Remark. With this definition the 0 diagonal is the main ....

K. Kennedy, N. Nedeljkovi'c, and A. Sethi. Efficient Address Generation For Block-Cyclic Distributions. Technical Report CRPC-TR94485-S, Center for Research on Parallel Computation, 1994.


Runtime Performance of Parallel Array Assignment: An Empirical.. - Lei Wang (1996)   (11 citations)  (Correct)

....required manual intervention, which we did not consider feasible for use in an experimental setting. 2.3 Table driven methods The next family of algorithms are the ones we consider table driven. The three algorithms we consider in this category are the RIACS algorithm [2] the Rice algorithm [6, 7, 8], and the LSU algorithm [14] The basic idea in this family of algorithms is to capture the regularities in the local data access patterns for a given array assignment statement using a table of size O(k) The original presentation of the RIACS algorithm was couched in terms of finite state ....

K. Kennedy, N. Nedeljkovic, and A. Sethi. Efficient address generation for block-cyclic distributions. In Proceedings of the 1995 International Conference on Supercomputing, pages 180--184, Barcelona, Spain, July 1995.


Compiler Techniques for Determining Data Distribution and.. - Lee, Chen (1995)   (1 citation)  (Correct)

....sets. Chatterjee et al. enumerated the local memory access sequence of communication sets based on a finite state machine [3] Kennedy et al. also presented algorithms, which were based on a finite state machine and an integer lattice method, for computing the local memory access sequence [8] [13] [14] They also noticed that the data access patterns in the communication sets appeared periodically. They calculated communication sets based on a scanning technique similar to the merge sort for computing the intersection of two reference patterns corresponding to the left hand side and the ....

K. Kennedy, N. Nedeljkovi'a, and A. Sethi. Efficient address generation for block-cyclic distributions. In Proc. of ACM International Conf. on Supercomputing, pages 180--184, Barcelona, Spain, July 1995.


Efficient Index Generation for Compiling Two-Level Mappings.. - Shih, Sheu, Huang   (Correct)

....The construction of state table involves solving k linear Diophantine equations and a sorting operation. Moreover, the FSM approach is a runtime technique. High runtime overhead to enumerate local memory access sequences will be involved. The work improving the FSM approach [4] is proposed in [13, 14]. Efficient FSM table generation is proposed. The improved work enumerates the local memory access sequences by viewing the accessed elements an integer lattice. The sorting step in [4] is avoided in the improved work. However, runtime resolution of Diophantine equations is also required. In ....

K. Kennedy, N. Nedeljkovi'c, and A. Sethi. Efficient address generation for block-cyclic distributions. In Proceedings of ACM International Conference on Supercomputing, pages 180--184, July 1995.


Table-Lookup Approach for Compiling Two-Level Data-Processor.. - Kuei-Ping Shih (1997)   (Correct)

....The construction of state table involves solving k linear Diophantine equations and a sorting operation. Moreover, the FSM approach is a runtime technique. High runtime overhead to enumerate local memory access sequences will be involved. The work improving the FSM approach [4] is proposed in [14, 13]. Efficient FSM table generation are proposed. The improved work enumerate the local memory access sequences by viewing the accessed elements an integer lattice. The sorting step in [4] is avoidable in the improved work. However, runtime resolution of Diophantine equations is also required. The ....

K. Kennedy, N. Nedeljkovi'c, and A. Sethi. Efficient address generation for block-cyclic distributions. In Proceedings of ACM International Conference on Supercomputing, pages 180--184, July 1995.


SUPPLE: an Efficient Run-Time Support for Non-Uniform Parallel .. - Orlando, Perego (1996)   (2 citations)  (Correct)

....support array redistribution directives and thus, the communication schedules used to assign a cyclically distributed array to a block distributed one, and vice versa, may not be fully optimized. The performance results reported in Fig. 10. b) might be different if optimized array redistribution [18, 19] were supported. 5 Related work The parallel loop scheduling problem has been investigated in depth by researchers working on shared memory multiprocessors. Most proposals address the efficient implementation of loops by defining dynamic Self Scheduling policies which reduce synchronizations ....

K. Kennedy, N. Nedeljkovi'c, and A. Sethi, "Efficient address generation for block-cyclic distributions," in Proc. of the 1995 ACM Int. Conf. on Supercomputing, July 1995, pp. 180--184.


Communication Generation for Data-Parallel Languages - Sethi (1996)   (1 citation)  Self-citation (Sethi)   (Correct)

....k = 64 s = 25 4.4 5.2 5.1 1093.9 s = 100 4.9 5.7 5.7 1083.5 s = 3 4.3 6.5 4.9 1079.3 k = 256 s = 25 4.4 6.7 5.2 1091.5 s = 100 5.0 7.2 5.7 1094.2 Table 7.2 Execution times in milliseconds for different versions of loops with the MIV subscript. locations on a demand driven basis can be applied [61]. Both of the two approaches, the one using the DeltaM table (typically much smaller than DeltaS or DeltaG table) in the inner loop, and the other without any table space overhead, perform only slightly worse than the address generation based on using both the DeltaG and DeltaM tables, and ....

.... or DeltaG table) in the inner loop, and the other without any table space overhead, perform only slightly worse than the address generation based on using both the DeltaG and DeltaM tables, and therefore should be methods of choice if memory overhead needs to be reduced or completely eliminated [61]. 7.1.3 Coupled subscripts All address generation methods presented so far deal only with one dimensional arrays. Chatterjee et al. have shown that for multidimensional regular array sections (corresponding to array references with independent subscripts) the memory access problem reduces to ....

[Article contains additional citation context not shown here]

K. Kennedy, N. Nedeljkovi'c, and A. Sethi. Efficient address generation for blockcyclic distributions. In Proceedings of the 1995 ACM International Conference on Supercomputing, Barcelona, Spain, July 1995.


Generating Global Name-Space Communication Sets for Array.. - Lee, Chen   (3 citations)  (Correct)

No context found.

K. Kennedy, N. Nedeljkovi'c, and A. Sethi. Efficient address generation for block-cyclic distributions. In Proc. of ACM International Conf. on Supercomputing, pages 180--184, Barcelona, Spain, July 1995.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC