| Palermo D., Su E. et.al. Communication optimizations used in the PARADIGM compiler for distributed-memory multicomputers. Proceedings of the 1994. |
....is performed through calls to libraries as MPI or OpenMP. The tiling problem can be broadly de ned as the problem of choosing the tile parameters (notably the shape and size) in an optimal manner. It may be decomposed into two subproblems: tile shape optimization [4] and tile size optimization [2, 8, 12, 14]. By its very nature, such a two step approach may not be globally optimal, but is often used in order to make the Supported by grant of Rgion Nord Pas de Calais problem tractable (some authors also attempt to resolve both problems under some simplifying assumptions [15, 16, 9] The tile ....
D. Palermo, E. Su, J. Chandy, and P. Banerjee. Communication optimizations used in the PARADIGM compiler for distributed memory multicomputers. In International Conference on Parallel Processing, St. Charles, IL, August 1994. IEEE.
....this is a hard, discrete, non linear optimization problem, and there is currently no solution. However, optimal solutions can be found analytically under certain restrictions. The problem is usually decomposed into two subproblems: tile shape optimization [5] and tile size optimization [3, 11, 19, 25]. By its very nature, such a two step approach is not globally optimal, but often makes the problem tractable. Some authors simultaneously resolve both problems under certain simplifying assumptions [26, 27, 12] Whether tiling is used for locality enhancement (i.e. optimizing the performance of ....
....of optimizing a much more realistic cost measure, namely the running time of a tiled program on a parallel machine. For 2 D orthogonal tiling, King et al. 20] consider the case of square tiles (i.e. the height and width of the tiles are both equal) Hiranandani et al. 11] and Palermo et al. [25] consider rectangular tiles, but only for a block distribution of tile rows to processors (i.e. the tile height is xed to be s = p ) under slightly dioeerent machine models. The latter two results were in the context of prototype compiler implementations, and although the compiler handles ....
[Article contains additional citation context not shown here]
D. Palermo, E. Su, J. Chandy, and P. Banerjee. Communication optimizations used in the PARADIGM compiler for distributed memory multicomputers. In International Conference on Parallel Processing, St. Charles, IL, August 1994. IEEE.
....has been posted so that processors never need to wait for data, AIMS probes show that in many instances processors are actually waiting for messages from their predecessors in the pipeline. A more realistic model such as that proposed by Adve et al. 3] Hiranandai et al. 4] and Palermo et al. [5], ascribes certain communication overheads to the processors in the parallel machine, rather than to the communication network. More specifically, they assume that a fixed amount of cpu time is spent by a processor that receives and processes ( handles ) a message. Since the first processor never ....
....appearance of a synchronized pipeline result with a slightly modified subtask duration. It should be noted that padding by itself is usually not the best strategy for optimizing fine grain pipelines. Grouping several subtasks together to increase granularity (see, for example, references [4] and [5]) if possible, is generally more effective, because it reduces the communication overhead and the frequency of interrupts. Grouping and padding can also be combined to obtain an optimally performing pipeline [8] This is especially effective if the optimal grain size is still relatively fine, ....
D.J. Palermo, E. Su, J.A. Chandy, P. Banerjee, Communication optimizations used in the PARADIGM compiler for distributed-memory multicomputers, International Conference on Parallel Processing, St. Charles, IL, August 1994
....performed through calls to libraries MPI or OpenMP. The tiling problem can be broadly defined as the problem of choosing the tile parameters (notably the shape and size) in an optimal manner. It may be decomposed into two subproblems: tile shape optimization [BDRR94] and tile size optimization [AR97, HKT94, KCN90a, PSCB94] (some authors also attempt to resolve both problems under some simplifying assumptions [RS91, SD90, ES96] By its very nature, such a two step approach may not be globally optimal, but is often used in order to make the problem tractable. The tile size problem seeks, for a given tile shape, to ....
D. Palermo, E. Su, A. Chandy, and P. Banerjee. Communication Optimizations Used in the PARADIGM Compiler for Distributed-Memory Multicomputers. In International Conference on Parallel Processing, St. Charles, IL, August 1994.
....times as predicted by the model compare favorably to actually measured performance. A number of techniques such as aggregating messages and increasing granularity (grouping) are routinely used to optimize the performance of pipelines, as discussed, for example, by Hiranandani and Palermo et al. [4, 7]. Here we demonstrate that in addition to the above optimizations, a significant part of the delay of fine grain software pipelines implemented on MIMD distributed memory parallel computers can be eliminated by removing dynamic load imbalances created by interrupts. As shown in Section 5, a number ....
....inset) as does the amount of time spent waiting between subtasks. This variation shows because AIMS does not explicitly monitor system level operations on each processor. A clear fan out of message transfer lines is visible between processor 1 and processor 2 (which was also observed by Palermo [7]) and to a lesser extent between 2 and 3. This implies that some subtasks take longer to complete on a certain processor than on its predecessor in the pipeline. But there are also phases in the pipeline algorithm during which message transfer lines are parallel, indicating that communicating ....
[Article contains additional citation context not shown here]
Palermo, D.J., Su, E., Chandy, J.A. and Banerjee, P. Communication optimizations used in the PARADIGM compiler for distributed-memory multicomputers. Proc. Int. Conf. Parallel Processing, St. Charles, IL, 1994
....Synchronization overhead can be reduced as shown in Figure 7.10a by combining multiple synchronizations into one. This increases the dependence latency but reduces synchronization overhead. The discussion of combining synchronization in this section presents minor extensions to previous work [PSCB94] but is discussed in some detail here for completeness. The example in Figure 7.10a shows two regions of straight line code being overlapped in different Istreams, where each Istream executes on one processor. If the two regions happen to be loops, the principle is the same. Synchronization may ....
....of every iteration. This reduces the synchronization overhead, but increases the dependence latency as shown in Figure 7.10b. There are two ways of synchronizing every SI iterations. One is to test indices of enclosing loops for the right modulus. Another method, called coarse grained pipelining [PSCB94, HKT92] applies strip mining: the synchronized loop is converted to a nested loop, where Fig. 7.9. Drift in dependence latency for worst case iteration distance. In (a) the dependence latency shifts from positive to negative, the opposite as for (b) a) b) Istream 0 Istream 1 Istream 0 ....
[Article contains additional citation context not shown here]
DanielJ. Palermo, Ernesto Su, JohnA. Chandy, and Prithviraj Banerjee. Communication Optimizations Used in the PARADIGM Compiler for Distributed -Memory Multicomputers. In International Conference on Parallel Processing, St. Charles, IL, 1994.
....of every iteration. This reduces the synchronization overhead, but increases the dependence latency as shown in Figure 5b. There are two ways of synchronizing every SI iterations. One is to test indices of enclosing loops for the right modulus. Another method, called coarse grained pipelining [PSCB94, HKT92] applies strip mining: the synchronized loop is converted to a nested loop, where the inner loop executes SI iterations and the outer loop contains the synchronization. The minimum synchronization interval SI is computed for each nesting level. The computation is done for inner loops ....
....derivative of the symbolic expression for the overall latency with respect to SI, setting it to zero and solving for SI. Pedigree uses these SI values to select the most effective parallelization of a program. This technique for selecting SI used is essentially the same as in Palermo, et al. PSCB94] except that we compute SI for each parallelized nesting level and they consider parallelization at only a single nesting level. Table 2 shows sample results from some of the SDIO benchmarks [Nic91] for the frequency of synchronization among iterations as a function of synchronization overhead. ....
Daniel J. Palermo, Ernesto Su, John A. Chandy, and Prithviraj Banerjee. Communication Optimizations Used in the PARADIGM Compiler for Distributed-Memory Multicomputers. In ICPP, 1994.
.... access pattern descriptors that are generated for the vectors are f [ E ) E ) O ) O ) O ) O ) g The communication cost x is bounded above by C = p [ T startup T transmit (100 p) We assume that the HPF compiler performs message vectorisation and message coalescing [29]. x On PVM on a network of workstations 10 Algorithm Axis Align Input: A distribution block Output : An axis alignment Generate a set of ordered lists of vectors for the distribution block C total communication cost for the distribution block For each array reference in the distribution ....
D. J. Palermo, E. Su, J. A. Chandy, and P. Banerjee. Communication optimizations used in the paradigm compiler for distributed-memory multicomputers. In Intl. Conf. on Parallel Processing, August 1994.
....the tile shape was given. Schreiber and Dongarra [21] gave heuristics to choose the tile shape and size in an optimal manner. This problem has also been tackled by Ramanujam and Sadayappan [20] Boulet et al. 6] and recently by Calland and Risset [7] Hiranandani et al. 11] and Palermo et al. [19] have also developed techniques for tile size optimization which are incorporated into the Fortran D and Paradigm compilers at Rice and Illinois respectively. To properly understand these methods, let us first recall some notation, mostly drawn from the lucid exposition by Boulet et al. 6] A ....
....D compiler developed at Rice University, goes to the other extreme by assuming that communication cost is constant, independent of message size [11] This Irisa Optimal Tiling of Two Dimensional Uniform Recurrences 29 gives reasonable results, but they are not optimal. Palermo et al. have showed [19] that accounting for the message size yields improvements. ffl Furthermore, many other factors come into play, such as the nature of the final code (number of special boundary conditions, etc. ease of automatic code generation, etc. In practice, there are a fairly small number of possible tile ....
[Article contains additional citation context not shown here]
D. Palermo, E. Su, J. Chandy, and P. Banerjee. Communication optimizations used in the PARADIGM compiler for distributed memory multicomputers. In International Conference on Parallel Processing, pages xx--yy, St. Charles, IL, August 1994. IEEE.
....prediction of a pipelined phase execution is more complicated than in the loosely synchronous case due to the structure of the underlying critical execution path. Execution models that can estimate pipelines of different granularity have been discussed in the literature, for instance in [26, 32]. For a pipelined phase, Fortran RED uses the innermost level that carries a true dependence to determine the granularity of the pipeline. This level is referred to as the pipeline level. The performance estimate for a single pipeline stage is the predicted computation and memory access cost for a ....
D. Palermo, E. Su, J. A. Chandy, and P. Banerjee. Communication optimizations used in the PARADIGM compiler for distributed-memory multicomputers. In Proceedings of the 1994 International Conference on Parallel Processing, 1994.
.... available parallelism with the costs of communications when a nested loop program is executed in SPMD (Single Program Multiple Data) fashion on a DMM is the iteration space tiling (also called super node partitioning ) 27, 14, 1, 22] It may be used as a technique in parallelizing compilers (see [13, 20] where it is called coarse grain pipelining) as well as in performance tuning of parallel codes by hand (see also [16, 23, 24, 17] A tile in the iteration space is a collection of iterations to be executed as a single unit with the following protocol all the (non local) data required for each ....
....tile parameters (notably the shape and size) in an optimal manner. It may be decomposed into two subproblems: Corresponding author Andonov, Yanev and Bourzoufi Three Dimensional Orthogonal Tile Sizing Problem 2 choosing a good tile shape [14, 23, 22, 6, 7] and finding the best tile size [18, 23, 15, 4, 13, 20]. For the former problem, the communication cost is approximated by the number of dependency vectors crossing a tile boundary. The latter problem assumes that the tile shape is first given and then seeks to minimize the total execution time. In general, this appears to be a hard integer non linear ....
[Article contains additional citation context not shown here]
D. Palermo, E. Su, A. Chandy, and P. Banerjee. Communication Optimizations Used in the PARADIGM Compiler for Distributed-Memory Multicomputers. In International Conference on Parallel Processing, St. Charles, IL, August 1994.
....added up to determine the overall phase performance. Performance prediction of a pipelined phase execution is more complicated than in the loosely synchronous case. Execution models that can estimate pipelines of different granularity have been discussed in the literature, for instance in [MCAK94, PSCB94] The actual choice of a particular pipeline model will depend on the desired accuracy. A detailed discussion of the execution model used in the prototype implementation of our layout assistant tool can be found in Section 5.2.1. Machine Model The actual costs of communication operations and ....
D. Palermo, E. Su, J. A. Chandy, and P. Banerjee. Communication optimizations used in the PARADIGM compiler for distributed-memory multicomputers. In Proceedings of the 1994 International Conference on Parallel Processing, St. Charles, IL, August 1994.
....arrays. 1. Introduction Tiling the iteration space [7, 12, 15] is a common method for improving the performance of parallel loop programs executed in SPMD (Single Program Multiple Data) fashion on a DMM (distributed memory machine) It may be used as a technique in parallelizing compilers (see [6, 11] where it is also called coarse grain pipelining) as well as in performance tuning of parallel codes by hand (see also [9, 13, 14] A tile in the iteration space is a collection of iterations to be executed as a single unit with the following protocol all the (non local) data required for each ....
....no calls to any communication routine. The tiling problem can be broadly defined as the problem of choosing the tile parameters (notably the shape and size) in an optimal manner. It may be decomposed into two subproblems: choosing an optimal tile shape [3, 4] and finding an optimal tile size [1, 6, 8, 11] (some authors attempt to resolve both problems under some simplifying assumptions [12, 13] For the former problem, the communication cost is approximated by the number of dependency vectors crossing a tile boundary. The latter problem assumes that the tile shape is first given and then seeks to ....
[Article contains additional citation context not shown here]
D. Palermo, E. Su, J. Chandy, and P. Banerjee. Communication optimizations used in the PARADIGM compiler for distributed memory multicomputers. In International Conference on Parallel Processing, pages xx--yy, St. Charles, IL, August 1994. IEEE.
.... approach to solve the computation communication alternative when a nested loop program is executed in SPMD (Single Program Multiple Data) fashion on DMM is the iteration space tiling (also called supernode partitioning ) 25, 13, 22] It may be used as a technique in parallelizing compilers ( see [11, 19]) where it is called coarse grain pipelining) as well as in performance tuning of parallel codes by hand [15, 23, 24] A canonical tile is an n dimensional parallelepiped box in an n dimensional Corresponding author y Currently visiting professor in LIMAV. Supported be NATO Research Grant ....
....techniques to this model. 2. 2 Relaxing the systolic model : execution time on DMM We now develop an expression for the running time of such a tiled program on a distributed memory machine (DMM) which is valid for the range of all possible values of x i , using the standard communication model [19, 17, 10, 9]. The code executed for a tile is the standard loop (the receive call is a blocking one to ensure synchronization) repeat receive(v1) receive(v2) receive(vn) compute(body) send(v1) send(v2) send(vn) end where we denote by vi the message transmiited in the ith axe. We use a ....
[Article contains additional citation context not shown here]
D. Palermo, E. Su, A. Chandy, and P. Banerjee. Communication Optimizations Used in the PARADIGM Compiler for Distributed-Memory Multicomputers. In International Conference on Parallel Processing, St. Charles, IL, August 1994.
....of arrays are analyzed. This is used to generate required communication and to partition the computation among the PEs. And as communication operations are so expensive, an attempt is usually made to optimize them using methods such as message vectorization [21, 73, 91] message aggregation [120, 131, 152], and the exploitation of collective communication operations [120] The major advantage of this model is its simplicity. The compiler takes the parallelism that is explicitly stated by the programmer and maps it to the parallel 14 F90D HPF Source Parser Array Analysis Communication Generation ....
....efficient code for distributed memory machines. These optimizations fall into several categories: Reducing Communication: Here we perform optimizations that attempt to reduce the amount of communication. These include message vectorization [21, 73] message coalescing [95] message aggregation [120, 131], redundant communication elimination [17] and the exploitation of collective communication [120] Hiding Communication: These transformations attempt to hide the cost of communication by overlapping communication and computation. Examples of such optimizations are communication placement [49, ....
D. Palermo, E. Su, J. Chandy, and P. Banerjee. Communication optimizations used in the Paradigm compiler for distributed-memory multicomputers. In Proceedings of the 1994 International Conference on Parallel Processing, St. Charles, IL, August 1994.
....for message passing on distributed memory machines is the setup time required for sending a message. Typically, this cost is equivalent to the sending cost of hundreds of bytes. Vectorization combines messages for the same source and destination into a single message to reduce this overhead [17, 61] Since in Fortran 90D HPF we are only parallelizing array assignments and forall loops, there is no data dependency between different loop iterations. Thus, all the required communication can be performed before or after the execution of a loop on each of the processors involved as shown in ....
....Overlap Shift Allocation without Overlap Shift B(1) B(2) B(3) B(4) B(5) B(5) B(6) B(7) B(8) B(1) B(2) B(3) B(4) B(5) B(6) B(7) B(8) Figure 6.18: Sample Overlap Shift Optimization CHAPTER 6. OPTIMIZATIONS 83 6. 5 Message Aggregation The communication library routines try to aggregate messages [37, 17, 61] (corresponding to several array sections) into a single larger message, possibly at the expense of extra copying into a continues buffer. A communication routine first calculates the largest possible array section from this processor to the rest. These may indicate several continuous block of ....
D. Palermo, E. Su, J. Chandy, and P. Banarjee. Communication Optimizations Used in the Paradigm Compiler For Distributed-Memory Multicomputers. International Conference on Parallel Processing, 1994.
....effective. 1 Introduction There exists a rich body of work in optimizing communication for array languages and in parallelizing compilers [1, 2, 10, 15, 13, 21] There are fewer studies empirically evaluating communication optimizations in the context of a specific compiler and target machine [4, 14, 19]. Moreover, detailed performance evaluations of communication optimizations for non kernel applications are virtually non This research was supported by ARPA Grant N00014 92 J1824 existent, particularly with respect to optimizations that are performed in a machine independent manner. This ....
Daniel J. Palermo, Ernesto Su, John A. Chandy, and Prithviraj Banerjee. Communication optimizations used in the PARADIGM compiler for distributed-memory multicomputers. In International Conference on Paralle Processing, pages II:1--10, August 1994.
....representation for regular distributions that facilitates determining the processor sets for data redistribution. PITFALLS robustly handles arbitrary source and target processor configurations and arbitrary number of data array dimensions. PITFALLS is being developed for inclusion in the PARADIGM [14] compiler project at the University of Illinois. The research presented in [10, 11, 12, 13] focus on the efficiency of computing send receive processor sets rather than on the efficiency of the actual data exchange portion of redistribution. We have found that the latter operation can be several ....
D. J. Palermo, E. Su, J. A. Chandy, and P. Banerjee, "Communication optimizations used in the PARADIGM compiler for distributed-memory multicomputers," in Proceedings of the 1994 International Conference on Parallel Processing, vol. 2, pp. 1--10, Aug. 1994.
....reference are computed at run time. Since each processor must execute the entire iteration space to compute ownership, this method results in large amounts of overhead. Communication for resolution programs is also very inefficient as it involves transmission of a large number of small messages [40]. Instead we considered the message vectorized version with loop bounds reduction as the base version. Since most of the compilers for message passing architectures apply some kind of message vectorization, we felt that it would be unfair to compare our method against run time resolution without ....
D. J. PALERMO, E. SU, J. A. CHANDY, and P. BANERJEE. Communication optimizations used in the PARADIGM compiler for distributed-memory multicomputers. In Proc. International Conference on Parallel Processing, St. Charles, IL, August 1994.
No context found.
Palermo D., Su E. et.al. Communication optimizations used in the PARADIGM compiler for distributed-memory multicomputers. Proceedings of the 1994.
No context found.
D. J. Palermo, E. Su, J. A. Chandy, and P. Banerjee. Communication optimizations used in the paradigm compiler for distributed-memory multicomputers. In Intl. Conf. on Parallel Processing, August 1994.
No context found.
D. J. Palermo, E. Su, J. A. Chandy, and P. Banerjee. Communication optimizations used in the paradigm compiler for distributed-memory multicomputers. In Intl. Conf. on Parallel Processing, August 1994.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC