| M. Gupta, S. Midkiff, E. Schonberg, et al. A HPF compiler for the IBM SP2. In Supercomputing 1995. |
....standardize the syntax. The initial definition of the new language, Hpf, was frozen in May 1993, and corrections were added in November 1994 [36] Prototype compilers incorporating some Hpf features are available [18, 19, 26, 81, 88, 14] Commercial compilers from APR [64, 65] DEC [71, 17] IBM [42] and PGI [34, 68] are also being developed or are already available. These compilers implement part or all of the Hpf Subset, which only allows static distribution of data and prohibits dynamic redistributions. This paper deals with this Hpf static subset and shows how changes of basis and affine ....
Manish Gupta, Sam Midkiff, Edith Schonberg, Ven Seshadri, David Shields, KoYang Wang, Wai-Mee Ching, and Ton Ngo. An HPF Compiler for the IBM SP2. In Workshop on Compilers for Parallel Computers, Malaga, pages 22--39, June 1995.
....hand coded stencil optimization described in Section 3.2 balanced manner. Other than this issue, the implementation is extremely true to the F90 MPI version and includes its hand coded stencil optimizations. Future work should consider how these 4D arrays might be avoided in other HPF compilers [20, 2] or using HPF extensions for sparse or irregular problems [22, 35] CAF The CAF implementation was written using the F90 MPI implementation as a starting point. Since both of these languages use a local per processor view and Fortran 90 as their base language, the implementation simply involved ....
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Y. Wang, W. M. Ching, and T. Ngo. An HPF compiler for the IBM SP-2. In Proceedings of Supercomputing `95, December 1995.
....an analogous problem occurs in the context of automatic parallelization for message passing architectures. Detecting the potential use of high level primitives such as broadcasts or reductions as opposed to low level sends and receives can result in substantial savings in the cost of communication [11]. There are four different MATLAB compilers we are aware of: Falcon [5, 6, 7] a research compiler developed at Illinois, compiles MATLAB into FORTRAN 90; MCC [17] from the MathWorks, 2 compiles into C; MATCOM [14] from the Israel Institute of Technology, compiles into C ; and MATCH [22] ....
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Wang, W. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Supercomputing, December 1995.
....data flow variable called SAFE which can be used in a similar manner as our predicate P(i) Kennedy and Sethi [35, 36, 37] do not use a linear algebra framework; later work from the dHPF project at Rice [2, 3] includes the use of the Omega library for message optimizations. The IBM pHPF compiler [13, 23] achieves both redundancy elimination and message combining globally. But message combining is feasible only if the messages have identical patterns, or one pattern is a subset of another. The general block cyclic distributions, however, can lead to complicated data access patterns and ....
M. GUPTA, S. MIDKIFF, E. SCHONBERG, V. SESHADRI, D. SHIELDS, K. WANG, W. CHING, and T. NGO. An HPF compiler for the IBM SP-2. In Proc. Supercomputing 95, San Diego, CA, December 1995. 52
....programming. It is useful both for the application programmer and for the implementation of parallel languages. Several programming systems (e.g. SR [1] MPI [11] provide multicasting to the programmer. Languages like Orca, HPF, and Jade use multicasting in their implementation (runtime system) [3, 5, 13]. In particular, multicast is much more suitable than unicast (i.e. point to point) communication for implementing replicated global information [14] Although the designers of software systems have recognized the importance of multicast, modern network technology often does not support it in ....
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K.-Y. Wang, W.-M. Ching, and T. Ngo. An HPF Compiler for the IBM SP2. In Supercomputing '95, San Diego, CA, December 1995.
.... data distribution (via DISTRIBUTE) and parallelization (via INDEPENDENT) decisions [7] The HPF standard does not include annotations to identify computations that may be pipelined, but Gupta et al. indicate that the IBM xlHPF compiler for the IBM SP 2 automatically recognizes and optimizes them [6]. Some forms of task level pipelining are supported by HPF2, but no commercial compilers support the new standard. Furthermore, a representation of this form would look more like an MPI code, thus sacrificing the benefits of HPF. ZPL is a data parallel array programming language [13] 2 It ....
....This realizes a crude form of fine grain pipelining when arrays happen to be traversed in the right way. Despite this, the inner loop communication prevent this code from being competitive with a true pipelined implementation. XLHPF. A published report indicates that IBM xlHPF performs pipelining [6]. The compiler does not provide an option for viewing the intermediate message passing code and the parallelization summary excludes this information, so we experimentally confirm that the compiler does indeed perform pipelining. Specifically, we observe that an HPF wavefront computation has ....
Manish Gupta, Sam Midkiff, Edith Schonberg, Ven Seshadri, David Shields, Ko-Yang Wang, Wai-Mee Ching, and Ton Ngo. An HPF compiler for the IBM SP2. In Proceedings of the 1995 ACM/IEEE Supercomputing Conference (CD-ROM), 1995.
....to access non local data from remote processors on distributed memory architectures is commonly orders of magnitude higher than the cost of accessing local data. As a consequence, a key problem to effectively use distributed memory architectures is centered around efforts to optimize communication [5, 17, 3, 21, 16, 23, 2, 29, 30] which includes: message vectorization (hoisting communication outside of loops) message coalescing (removing redundant communication based on the same array) communication aggregation (combine messages based on different arrays) collective communication, communication latency hiding and ....
....time. For small problem sizes we achieve a performance gain varying from 15 to 32.5 for different machine sizes ranging from 4 to 16 processors. Only modest performance improvement is achieved for larger problem sizes. 20 6 Related Work and Discussion Communication optimization [4, 30, 5, 17, 3, 21, 16, 23, 2, 29] has been extensively researched. Most existing compilers employ a combination of message aggregation and coalescing and collective communication which is based on single loop analysis. Pipelined communication [21, 16, 23, 2] has been introduced to achieve a fine grain communication latency ....
[Article contains additional citation context not shown here]
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, K. Wang, D. Shields, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proc. Supercomputing '95, San Diego, CA, December 1995.
....was the DEC HPF compiler version V2.0 1 running on a cluster of eight DEC Sables with 250 MHz Alpha 20064 processors and 256 MB of memory. The Sables operated under DEC Unix version 3.2D and were connected by an 155 Mbit sec ATM switch. The second HPF compiler we evaluated was the IBM HPF compiler [18] running on an IBM SP 2 with 66 Mhz RS 6000 Power2 processors operating AIX 4.1 connected by a 120 Mbit sec Omega switch. To evaluate the prototype Fortran D compiler on these machines, messages generated by the compiler were translated with a run time library to Message Passing Interface (MPI) ....
....slowdown means speedups are reduced when calculated from the sequential Fortran 90 version of the program. The DEC HPF compiler also handles pipelined programs less well, achieving low speedups for SOR and Implicit. In comparison, the current HPF compiler performs well for all the programs tested [18]. 4.3.2 Analysis of Results Our results indicate current HPF compilers achieve much better performance on message passing machines than early commercial efforts such as the CM Fortran compiler. The HPF compilers achieved good speedups for almost all applications, because they were designed (like ....
[Article contains additional citation context not shown here]
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K.-Y. Wang, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, San Diego, CA, November 1995.
....to access non local data from remote processors on distributed memory architectures is commonly orders of magnitude higher than the cost of accessing local data. As a consequence, a key problem to effectively use distributed memory architectures is centered around efforts to optimize communication [4, 11, 2, 15, 10, 17, 1, 16] which includes: message vectorization, message coalescing, collective communication, communication latency hiding and pipelined communication. The effect of these optimizations is limited by the fact that most of the analysis in current parallelizing compilers is performed for a single loop nest ....
.... becomes superior to C2. 5 Related work and discussion Communication optimization has been extensively researched. Most existing compilers employ a combination of message vectorization and coalescing and collective communication which is based on single loop analysis. Pipelined communication [15, 10, 17, 1] has been introduced to achieve a fine grain communication latency hiding. Data flow analysis based communication generation has been addressed by several researchers both to eliminate redundant communication within a loop nest [1] and across loops [11, 18, 17, 12] Von Hanxleden and Kennedy [12] ....
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, K. Wang, D. Shields, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proc. Supercomputing '95, San Diego, CA, December 1995.
....the same loop. A couple of scenarios have been considered. In one scenario, in order to compare our scheme with the ones used by commercial compilers, we have distributed all the arrays and vectors in a blocked manner. As Figure 5(a) shows, Pilar (Int) performs better than PGI hpf [7] and IBM hpf [8]. In the second scenario, we have distributed the sparse matrices using multiple recursive decomposition. The comparative results are shown in Figure 5(b) In this case, too, Pilar (Int) performs better than Pilar (Enu) and as expected, the run times are better than the block distributed case. ....
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Y. Wang, W. M. Ching and T. Ngo, An HPF compiler for the IBM SP2, Proceedings of Supercomputing '95, San Diego, CA, December 1995.
....structure is not compatible to the change in structure generated by compiletime schemes as highlighted in Figure 2(e, f) 2.3 Commercial HPF compilers We have studied two commercial HPF compilers, The Portland Group s (PGI) HPF compiler (version 2.2) 16] and IBM s HPF compiler (version 1.1.0. 0) [17]) in order to observe the amount of communication since reduction of communication is one of the focus of our work. We present here a simple code fragment and the amount of communication (in bytes) generated by the PGI HPF compiler and our scheme. More details of our experiments can be found in ....
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Y. Wang, W. M. Ching and T. Ngo, An HPF compiler for the IBM SP-2, Proceedings of Supercomputing '95, San Diego, CA, December 1995.
....value of) a data element is computed by the processor owning that element. This rule provides fairly robust performance for relatively uniform loop nests in data parallel codes, as long as special patterns like reductions and computations involving privatizable variables are handled differently [21, 19, 16, 10]. Nevertheless, the rule is not optimal in general [14] In particular, for loop nests with complex patterns of data reuse (due to data dependences) the owner computes rule does not perform well and more sophisticated CP selection is required [1] HPF provides a somewhat more general ON ....
....or to insert communication within the callee (causing expensive fine grain communication) The compiler currently uses the latter. 5 Related Work With few exceptions, research and commercial HPF compilers almost exclusively use the owner computes rule [25] to assign CPs to statements [8, 21, 19, 15, 16, 9, 10, 28]. In these compilers, the only exceptions to this rule are for reductions and (in the IBM pHPF compiler [16] for assignments to privatizable variables. Two compilers that use non owner computes CPs in more general cases are decHPF and SUIF. The decHPF compiler will in some cases compute the right ....
[Article contains additional citation context not shown here]
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Wang, W. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, San Diego, CA, December 1995.
....distributions, and computation partitionings. Program analysis and implementation techniques in current compilers, however, do not appear flexible enough to support a general class of computation partitionings and program optimizations. Most research and commercial data parallel compilers to date [23, 6, 22, 28, 17, 10, 11, 13, 7, 12, 31] (including the Rice Fortran D compiler) perform communication analysis and code generation using pattern matching. While such approaches can provide excellent performance where they apply, they may provide poor performance for patterns they cannot handle. More importantly, pattern based compilers ....
....with sender side buffering that prevents overlap of communication and computation at run time. 6 Related Work As explained in the Introduction, most research and commercial data parallel compilers to date use patternbased approaches for implementing basic communication and iteration set analysis [23, 6, 22, 28, 30, 31, 10, 11, 13, 7, 12]. This is a fundamentally different approach from that taken in this paper, and its strengths and weaknesses have been discussed in the Introduction. There is also a large body of work on techniques to enumerate communication sets and iteration sets in the presence of cyclic(k) distributions ....
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Wang, W. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, San Diego, CA, December 1995.
....The first and most natural idea for managing local elements is to adapt the initial dimension by dimension Cartesian allocation in order to remove the regular holes coming from the cyclic distribution and alignment. For blockdistributed dimensions, this may be done without any address translation [21]. columnrow compressions 2 d template view from A(ff) 3ff fl = Xi 16 ffi = mod 4 Row compression addressing (r; c) r = ffi Xi 3 c = fl Column compression addressing (r; c) r = ffi c = fl Xi 3 Figure 25: Regular compression for local allocation For simple shift accesses the ....
Manish Gupta, Sam Midkiff, Edith Schonberg, Ven Seshadri, David Shields, KoYang Wang, Wai-Mee Ching, and Ton Ngo. An HPF Compiler for the IBM SP2. In Supercomputing, December 1995.
....performance against data parallel (HPF) and message passing (MPI) versions of each program. High Performance Fortran (HPF) applications were created by manually translating from Fortran to Fortran 90, with HPF data decompositions added for each array. On the IBM SP 2 we used the IBM HPF compiler [11] with the O2 flag. On the DEC cluster we used the DEC HPF compiler f90 version 2.0 1 with the O2 wsf fast flags. Message passing versions of each program were created using calls to communication routines specified under Message Passing Interface (MPI) On the IBM SP 2 we used the MPL version 2 ....
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K.-Y. Wang, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, San Diego, CA, November 1995.
....the data locality and reuse benefits of fusion. Conversely, since our algorithm strives for maximum beneficial fusion, array contraction analysis could subsequently be performed on each context partition. Such a strategy would likely be just as successful in contracting arrays. IBM s pHPF compiler [11] will avoid the problem of over fusing loops since it attempts loop fusion after SPMD loop generation. However, this requires that their compiler perform sophisticated symbolic analysis of the parameterized SPMD loop bounds to identify conformable loops. This is likely to limit their success of ....
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Wang, W. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, San Diego, CA, December 1995.
....sequential or shared memory parallel programs that are annotated with directives specifying data decomposition. The compilers for these languages are responsible for partitioning the computation and generating the communication necessary to fetch values of non local data referenced by a processor [72, 123, 20, 5, 21, 68]. Accessing remote data is usually orders of magnitude slower than accessing local data, for the following reasons. It is getting increasingly cost effective to build multiprocessors from commodity hardware components and system software. Most current generation CPU s are well beyond the 100 ....
....an interdependent and global manner. The algorithm achieves both redundancy elimination and message combining globally, and is able to reduce the number of messages to an extent that is not achievable with any previous approach. Our algorithm has been implemented in the IBM pHPF prototype compiler [68]. We report results from a preliminary study of some well known HPF programs. The performance gains are impressive. Reduction in static message count can be up to a factor of almost nine. Time spent in communication is reduced in many cases by a factor of two or more. We believe that these are ....
[Article contains additional citation context not shown here]
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, K. Wang, D. Shields, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proc. Supercomputing '95, San Diego, CA, Dec. 1995.
....standardize the syntax. The initial definition of the new language, Hpf, was frozen in May 1993, and corrections were added in November 1994 [36] Prototype compilers incorporating some Hpf features are available [18, 19, 26, 81, 88, 14] Commercial compilers from APR [64, 65] DEC [71, 17] IBM [42] and PGI [34, 68] are also being developed or are already available. These compilers implement part or all of the Hpf Subset, which only allows static distribution of data and prohibits dynamic redistributions. This paper deals with this Hpf static subset and shows how changes of basis and affine ....
Manish Gupta, Sam Midkiff, Edith Schonberg, Ven Seshadri, David Shields, KoYang Wang, Wai-Mee Ching, and Ton Ngo. An HPF Compiler for the IBM SP2. In Workshop on Compilers for Parallel Computers, Malaga, pages 22--39, June 1995.
....performance against data parallel (HPF) and message passing (MPI) versions of each program. High Performance Fortran (HPF) applications were created by manually translating from Fortran to Fortran 90, with HPF data decompositions added for each array. On the IBM SP 2 we used the IBM HPF compiler [10] with the O2 flag. On the DEC cluster we used the DEC HPF compiler f90 version 2.0 1 with the O2 wsf fast flags. Message passing versions of each program were created using calls to communication routines specified under Message Passing Interface (MPI) On the IBM SP 2 we used the MPL version 2 ....
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K.-Y. Wang, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, San Diego, CA, Nov. 1995.
....portable, abstract programming model applicable to a wide variety of parallel systems. To achieve wide acceptance, a data parallel language requires parallelizing compilers that can provide consistently high performance for a broad class of applications. Current commercial compilers for HPF [8, 12, 14] typically yield This work has been supported in part by DARPA Contract DABT63 92 C 0038, the Texas Advanced Technology Program Grant TATP 003604 017, an NSF Research Instrumentation Award CDA9617383, and sponsored by DARPA and Rome Laboratory, Air Force Materiel Command, USAF, under agreement ....
....for different applications, even among regular data parallel applications on message passing systems. Regular applications are those with statically analyzable data access patterns and, usually, well balanced computational costs. Most research and commercial data parallel compilers to date [22, 7, 21, 29, 16, 11, 12, 14, 8, 13, 32, 33] (including the Rice Fortran 77D compiler) perform communication analysis and code generation by considering specific combinations of the form of references, data layouts and computation partitionings. Such case based analysis has been the principal implementation technique for data parallel ....
[Article contains additional citation context not shown here]
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Wang, W. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, San Diego, CA, Dec. 1995.
....by an affine mapping function and the array sections accessed by each array reference can be computed symbolically at compile time. Even within this class of applications, stateof the art commercial and research compilers do not consistently achieve performance competitive with hand written code [16, 24]. Although many important optimizations for such systems have been proposed by previous researchers, current compilers implement only a small fraction of these optimizations, generally focusing on the most fundamental ones such as static loop partitioning based on the owner computes rule [39] ....
....ones such as static loop partitioning based on the owner computes rule [39] moving messages out of loops, reducing the number of data copies, and exploiting collective communication. Furthermore, even for these optimizations, most research and commercial data parallel compilers to date [7, 10, 15, 16, 17, 19, 24, 32, 33, 35, 42, 45, 46] (including the Rice Fortran 77D compiler [24] perform communication analysis and code generation for specific combinations of the form of references, data layouts and computation partitionings. While such case based approaches can provide excellent performance where they apply, they will ....
[Article contains additional citation context not shown here]
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Wang, W. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, San Diego, CA, December 1995.
....optimizations depend not only on the dependence and data flow information, but also on the resource constraints. It describes how communication optimizations are performed in the presence of not only arbitrary control flow but also resource constraints. 4. Current state of the art HPF compilers [38, 3, 14] support techniques for simple regular distributions with a limited set of subscript patterns and provide either no or inefficient support for the general regular distribution (specifically, the cyclic(k) distribution; it is defined in Section 2.1.1) This dissertation describes the first ....
....computation partitioning This phase is responsible for dividing work among processors. It determines the set of processors that execute every statement in the program. This phase can select different set of processors to execute different statements in a loop body. Unlike other HPF compilers [50, 38, 3, 14] that use either only owner computes rule or same computation partitioning for all the statements in a loop body, we evaluate the cost of various computation partitioning options and select the best option for each statement. The traditional computation paritioning rule, called owner computes ....
[Article contains additional citation context not shown here]
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K-Y. Wang, WM. Ching, and T. Hgo. An HPF compiler for the ibm sp2. In Proceedings of Supercomputing '95, San Diego, CA, December 1995.
....distributions, and computation partitionings. Program analysis and implementation techniques in current compilers, however, do not appear flexible enough to support a general class of computation partitionings and program optimizations. Most research and commercial data parallel compilers to date [21, 5, 20, 26, 17, 9, 10, 13, 6, 11, 29] (including the Rice Fortran D compiler) perform communication analysis and code generation using pattern matching. While such approaches can provide excellent performance where they apply, they may provide poor performance for patterns they cannot handle. More importantly, pattern based compilers ....
....with sender side buffering that prevents overlap of communication and computation at run time. 6 Related Work As explained in the Introduction, most research and commercial data parallel compilers to date use patternbased approaches for implementing basic communication and iteration set analysis [21, 5, 20, 26, 28, 29, 9, 10, 13, 6, 11]. This is a fundamentally different approach from that taken in this paper, and its strengths and weaknesses have been discussed in the Introduction. There is also a large body of work on techniques to enumerate communication sets and iteration sets in the presence of cyclic(k) distributions ....
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Wang, W. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, San Diego, CA, December 1995.
....code into equivalent sequential Fortran77D HPF code, and then a Fortran77D compiler which performs program analysis, optimization, and code generation. See Figure 2.3 for an outline of a compiler which utilizes this model. This strategy is used by many of the HPF compilers for MIMD architectures [81, 85, 19], and it is not limited to compilers for distributed memory machines [20] The advantages of this model are fairly clear. By exploiting an existing Fortran77D compiler, a Fortran90D compiler can be created in a much shorter time span. In addition, the Fortran90D compiler gains from the years of ....
....It simplifies the compiler, and when it works it works surprisingly well. But when it cannot match a pattern, the code produced is mediocre at best. An example of the vast differences in code quality produced by such compilers is presented later in this dissertation. xlhpf IBM s xlhpf compiler [81] is naturally classified as a scalarizing compiler as defined in Section 2.4.1. This is due to the fact that the first action taken after the intermediate representation is created is the scalarization of the array language into Fortran77 scalar form. Data dependence analysis is then performed to ....
[Article contains additional citation context not shown here]
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Wang, W. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, San Diego, CA, December 1995.
....performs all the data movement for single statement stencils written using shift intrinsics. This strategy is shared by many Fortran90 HPF compilers that focus on handling scalarized code. As with the CM 2 stencil compiler, our methodology is a strict superset of this strategy. Gupta, et al. [12], in describing IBM s xlhpf compiler, state that they are able to reduce the number of messages for multi dimensional shifts by exploiting methods similar to ours. However, they do not describe their algorithm for accomplishing this, and it is unknown whether they would be able to eliminate the ....
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Wang, W. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, San Diego, CA, December 1995.
....the reduce operators in the ZPL source (a) min and max . factors preserve the source code s high level semantics. In contrast to this high level approach the IBM HPF compiler may lose semantic information because it scalarizes Fortran 90 array structures early in the compilation process [12]. The ZPL compiler thus employs standard compilation concepts and techniques, but extends them to exploit the language s abstractions and to treat arrays atomically. Our presentation of the ZPL compiler will assume an understanding of scalar compilation and concentrate only on areas of difference. ....
Manish Gupta, Sam Midkiff, Edith Schonberg, Ven Seshadri, David Shields, KoYang Wang, Wai-Mee Ching, and Ton Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, December 1995.
....sequential or shared memory parallel programs that are annotated with directives specifying data decomposition. The compilers for these languages are responsible for partitioning the computation, and generating the communication necessary to fetch values of non local data referenced by a processor [15, 30, 4, 3, 5, 12]. Accessing remote data is usually orders of magnitude slower than accessing local data. This gap is growing Computer Science Division, U.C. Berkeley, CA 94720. Partly supported by ARPA DOD (DABT63 92 C 0026) DOE (DE FG0394ER25206) and NSF (CCR 9210260, CDA 8722788 and CDA9401156) Part of ....
....extensively researched, from local single loop nest to global and even interprocedural optimizations. The earliest and most commonly used optimizations include message vectorization [15, 30] using collective communication [11, 20] message coalescing [15] and exploiting pipelined communication [15, 12], all within the scope of a single loop nest. Local analysis of array accesses based on dependence testing alone often retains redundant communication. Naturally, the next step was the use of dataflow analysis, e.g. using precise array dataflow analysis to detect redundant communication within a ....
[Article contains additional citation context not shown here]
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, K. Wang, D. Shields, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proc. Supercomputing '95, San Diego, CA, Dec. 1995.
....with directives specifying data decomposition. The compilers for these languages are responsible for partitioning the computation, and generating the communication necessary to fetch values of non local data referenced by a processor. A number of such prototype compilers have been developed [18, 33, 23, 26, 22, 25, 3, 15, 28]. Since the cost of interprocessor communication is usually orders of magnitude higher than the cost of accessing local data, it is extremely important for the compilers to optimize communication. The most common optimizations include message vectorization [18, 33] using collective communication ....
....vectorization is accounted for by the computation of ANTLOC for an entire interval, as it characterizes the communication that can be moved outside a loop. Since message vectorization is a well understood optimization implemented by most distributed memory compilers based on data dependence [33, 26, 18, 23, 22, 15], we shall focus on other important optimizations that require the generality of data flow analysis. Both of the equations for determining INSERT and REDUND inherently capture the elimination of redundant communication. When communication is moved and inserted at some other place, the available ....
[Article contains additional citation context not shown here]
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, K.Y. Wang, D. Shields, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proc. Supercomputing '95, San Diego, CA, December 1995.
....merely annotated with directives specifying the distribution of data across processors. The compilers for these languages generate parallel programs in single program multiple data (SPMD) form, and generate the communication necessary to fetch values of non local data referenced by each processor [9, 19, 15, 10, 6]. The knowledge of data mapping and computation partitioning at compile time offers many opportunities for reducing synchronization costs. Tseng [18] and O Boyle and Bodin [13] have recently presented techniques that exploit these opportunities on shared memory and distributed shared memory (DSM) ....
....techniques to reduce the cost of synchronization. Section 5 describes the results of a preliminary study on the effectiveness of our analysis for HPF programs. Finally, Section 6 presents conclusions. 2 HPF Compilation Framework We illustrate our ideas by discussing them in the context of pHPF [6], a prototype compiler for HPF, that currently generates communication on IBM SP1 and SP2 machines using send receive as the basic primitive. We describe how communication generation could instead be done, in a future version, using get or put, which are being implemented in software on SP2 as ....
[Article contains additional citation context not shown here]
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, K.Y. Wang, D. Shields, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proc. Supercomputing '95, San Diego, CA, December 1995.
....processor which owns the data being modified in that computation. The ownercomputes rule and its generalized variants (where the computation may be assigned to processors that own some other data not being modified) are followed by most compilers for languages like High Performance Fortran (HPF) [13, 11, 19, 14, 2, 9, 1]. Many of these compilers have not paid adequate attention to the problem of mapping privatizable scalar and array variables. This paper presents a framework for privatizing scalar and array variables in the context of a datadriven approach to parallelization. We describe different alternatives ....
....of a class of codes with multi dimensional data distribution that was not previously possible. Finally, we show how the ideas of privatization apply to execution of control flow statements as well. The ideas presented in this work have been implemented in the pHPF prototype compiler for HPF [9]. Our preliminary results are very encouraging. 2 Mapping of Privatized Scalars 2.1 Alignment Choices We shall first illustrate the need for scalar privatization and for different kinds of alignment using an example shown in Figure 1. It is necessary to privatize each of the variables m, x, y, ....
[Article contains additional citation context not shown here]
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, K. Wang, D. Shields, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proc. Supercomputing '95, San Diego, CA, December 1995.
....Therefore, we believe that the suite of APR benchmarks is not well suited for evaluating HPF compilers in general. Similarly, papers by vendors describing their individual HPF compilers typically show some performance numbers; however it remains difficult to make comparisons across compilers [8, 12, 13]. Lin et al. used the APR benchmark suite to compare the performance of ZPL versions of the programs against the corresponding HPF performance published by APR and found that ZPL generally outperforms HPF [17] However, without access to the APR compiler at the time, detailed analysis was not ....
....pattern that requires communication is simply a shift by a constant, which results in a simple neighbor exchange in the processor grid. All compilers (ZPL and HPF) recognize this pattern well and employ optimizations such as message vectorization and storage preallocation for the nonlocal data [3, 8, 9, 12]. Therefore, although the benchmark is rather complex, the initial indication is that both HPF and ZPL should be able to produce efficient parallel programs. The benchmark is a V cycle multigrid algorithm for computing an approximate solution to the discrete Poisson problem: where is the Laplacian ....
Manish Gupta, Sam Midkiff, Edith Schonberg, Ven Seshadri, David Shields, KoYang Wang, Waimee Ching, and Ton Ngo. An HPF compiler for the IBM SP2. Supercomputing 1995.
....Therefore, we believe that the suite of APR benchmarks is not well suited for evaluating HPF compilers in general. Similarly, papers by vendors describing their individual HPF compilers typically show some performance numbers; however it remains difficult to make comparisons across compilers [8, 12, 13]. Lin et al. used the APR benchmark suite to compare the performance of ZPL versions of the programs against the corresponding HPF performance published by APR and found that ZPL generally outperforms HPF [17] However, without access to the APR compiler at the time, detailed analysis was not ....
....pattern that requires communication is simply a shift by a constant, which results in a simple neighbor exchange in the processor grid. All compilers (ZPL and HPF) recognize this pattern well and employ optimizations such as message vectorization and storage preallocation for the nonlocal data [3, 8, 9, 12]. Therefore, although the benchmark is rather complex, the initial indication is that both HPF and ZPL should be able to produce efficient parallel programs. The benchmark is a V cycle multigrid algorithm for computing an approximate solution to the discrete Poisson problem: r 2 u = v where r ....
Manish Gupta, Sam Midkiff, Edith Schonberg, Ven Seshadri, David Shields, KoYang Wang, Waimee Ching, and Ton Ngo. An HPF compiler for the IBM SP2. In Supercomputing 1995, San Diego, December 1995. IBM T. J. Watson Research Center, IEEE.
No context found.
M. Gupta, S. Midkiff, E. Schonberg, et al. A HPF compiler for the IBM SP2. In Supercomputing 1995.
No context found.
M. Gupta, S. Midki#, E. Schonberg, V. Seshadri, D. Shields, K.-Y. Wang, W.-M. Ching, and T. Ngo. An hpf compiler for the ibm sp2. In Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), page 71. ACM Press, 1995.
No context found.
M. Gupta, S. Midkiff, E. Schoenberg, B. Seshadri, D. Shields, K.Y. Wang, M.M. Ching, and Ton Ngo. An HPF compiler for the IBM SP-2. Proc. of Supercomputing`95. ACM Press, New York., San Diego, California, 1995.
No context found.
Manish Gupta, Sam Midkiff, Edith Schonberg, Ven Seshadri, David Shields, KoYang Wang, Wai-Mee Ching, and Ton Ngo. An HPF compiler for the IBM SP2. In Proceedings of the
No context found.
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Wang, W. Ching, and T. Ngo. An HPF Compiler for the IBM SP2. December 1995.
No context found.
M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Wang, W. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, San Diego, CA, December 1995. 20
No context found.
M. Gupta, S. Midkiff, E. Schoenberg, B. Seshadri, D. Shields, K.Y. Wang, M.M. Ching, and Ton Ngo. An HPF compiler for the IBM SP-2. In Proc. Supercomputing 95. ACM Press, New York., San Diego, California, 1995.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC