| GUPTA, M., SCHONBERG, E., AND SRINIVASAN, H. 1996. A unified framework for optimizing communication in data-parallel programs. IEEE Trans. Parallel Distrib. Syst. 7,7, 689--704. |
....communication can perform optimizations across communication patterns at the software level, the protocol level, and the hardware level. It is more aggressive than traditional communication optimization techniques, which are either performed in the library [3, 8, 30, 26] or in the compiler [1, 2, 7, 12, 15, 28]. Compiled communication o#ers many advantages over the traditional communication method. First, by managing network resources at compile time, some runtime communication overheads such as group management can be eliminated. Second, compiled communication can use long lived connections for ....
....messaging layer [3, 8, 30, 26] CC MPI is di#erent from these systems in that it allows users to select the most e#ective method based on message sizes and network conditions. Many parallel compiler projects also try to improve communication performance by generating e#cient communication code [1, 2, 7, 12, 15, 28]. The communication optimizations performed by these compilers focus on reducing the number and the volume of communications and are architecture independent. While CC MPI is not directly related to compiler optimization techniques, compiler techniques developed, however, can enable e#ective ....
M. Gupta, E. Schonberg, and H. Srinivasan. A Unified Framework for Optimizing Communication in Data-Parallel Programs. IEEE trans. on Parallel and Distributed Systems, 7(7):689--704, July 1996.
....passing communication in order prefetch the data read locally. Such work emphasizes fine grain parallelism and regular data layout in the context of looplevel parallelism. Many communication optimization techniques have been proposed to improve performance for the distributed memory machines [4 6,53,66]. These techniques work well for regular codes, or for computations where data accesses can be symbolically computed at compile time. For irregular codes work on run time techniques for automatic parallelization considers array accesses of loop indices or indirect references as functions of the ....
....mesh) data and generic user data can be expressed in a common format. The motivation for defining such a mechanism lies in enabling the system to implement the distributed object model presented so far and many of the optimizations currently used in the loop level parallelism frameworks [4, 6, 38, 53, 66, 67] We analyze our system from two perspectives. From the usability point of view we want to show how simple it is for a user to learn and use our system to write data parallel scientific applications. From the efficiency point of view we want to show that our approach is scalable and efficient. We ....
Manish Gupta, Edith Schonberg, and Harini Srinivasan. A unified framework for optimizing communication in data-parallel programs. IEEE Transactions on Parallel and Distributed Systems, 7(7):689--704, 1996.
.... of macroservers can draw in part on a rich body of experience with university prototypes as well as industrial software systems that were developed around the languages mentioned above, in particular implementations of HPF like languages for distributedmemory systems (see, for example [11, 16, 5]) The last of these papers contains an extensive set of references to related compilation work. 19 INTEGER NN = number of nodes( number of nodes available for this application NODES P(NN) abstract node array MACROSERVER CLASS sparse template INTEGER : u REAL , SPARSE ....
M.Gupta,E.Schonberg, and H.Srinivasan. A Unified Framework for Optimizing Communication in Data-Parallel Programs. IEEE Transactions on Parallel and Distributed Systems,Vol.7,No.7,pp.689-704, July 1996.
....in previous phases of the compilation. In this section, we focus on a subset of these issues, namely sequencing tasks and determining their communication requirements. In a multiprocessor system, communication analysis is a set of techniques used to track the flow of data between processors [KN95][GSS96]. We view each FPGA CCU as a single processor, but we must also go a step further and treat each task running on the same FPGA as a separate process . By so doing, we can avoid returning computed results to local memories if they are just going to subsequently be accessed by another FPGA ....
M. Gupta, E. Schonberg, and H. Srinivasan. A unified framework for optimizing communication in data parallel programs. Technical Report RC 19872(87937) 12/14/94, IBM Research. To appear in IEEE Transactions on Parallel and Distributed Systems.
....acknowledgements in distributed algorithms. These optimizations can be performed with other optimizations performed in the communication libraries to improve the communication latency. Compilers have been enhanced to perform static analysis of distributed programs to optimize communication. Gupta [28, 29] provided a framework based on data flow analysis to optimize communication by (i) vectorizing communication, ii) moving communication earlier to hide latency, iii) reducing the amount of data to be communicated, and (iv) eliminating communication that is redundant on any control flow path. The ....
....optimizations that reduce the communication operations by aggregating contiguous data and sending them as a single unit instead of several send primitives. Parallelizing compilers use loop unrolling technique and data distribution technique to perform aggregation to improve message granularity [29]. 5.2.2 Application Kernel Level Dynamic Aggregation The aggregation of messages can be performed in the communication manager module of distributed applications. Large distributed applications have clearly defined modules that perform different functionalities of a distributed system. The ....
Gupta, M., Schonberg, E., and Srinivasan, H. Unified framework for optimizing communication in data-parallel programs. IEEE Transactions on Parallel and Distributed Systems 7, 7 (July 1996), 689--704.
.... Partial Redundancy Elimination (PRE) 5, 9, 22, 44, 47, 54] and related optimizations [6, 16, 39, 53] Code Hoisting and Strength Reduction [15, 17, 21, 26, 29, 30, 36] Live Range Characterisation and Register Assignment [14, 41] Dead Code Elimination [4, 38] Communication Placement [25, 27]. Shrink Wrapping of Procedure Calls [8] Redundant Array Bound Checks Elimination [40] Compilation for Distributed Memory [1] Data flow analysis for some of the above optimizations involves a sequence of separate flows in forward and backward directions while for others it involves ....
M. Gupta, E. Schonberg, and H. Srinivasan. A unified framework for optimizing communication in dataparallel programs. IEEE Transactions on Parallel and Distributed Systems, 7(7):689--704, 1996.
....of which each contains one data element is reduced into the overhead for only one message through the message vectorization. Although this results in significant performance improvement, the communication cost is still expensive even in the case of only one message for each destination. Some work[6, 7] hides the communication latency by placing the independent computation between SEND and RECV. But, the work does not provide any solution in lack of independent computation. We propose a scheme which can be applied to latency hiding even in the case that no independent computation exists in the ....
M. Gupta, E. Schonberg, and H. Srinivasan, "A Unified Framework for Optimizing Communication in Dataparallel Programs," IEEE Trans. on Parallel and Distributed Systems, Vol. 7, No. 7, pp. 689-704, July 1996.
....in previous phases of the compilation. In this section, we focus on a subset of these issues, namely sequencing tasks and determining their communication requirements. In a multiprocessor system, communication analysis is a set of techniques used to track the flow of data between processors [KN95][GSS96]. We view each CCU as a single processor, but we must also go a step further and treat each task running on the same CCU as a separate proFPGA Shared Memory Local Memory ch ch ch ch Figure 5. Data placement 0.0) for i=1 to 10 for j=1 to 40 X[i, j] X[i, j] X[i, j 1] 0.1) for jj=1 to ....
M. Gupta, E. Schonberg, and H. Srinivasan. A unified framework for optimizing communication in data parallel programs. Technical Report RC 19872(87937) 12/14/94, IBM Research. To appear in IEEE Transactions on Parallel and Distributed Systems.
....into either an existing or a new high level language, and mapped to the macroserver model. This will also require significant new compiler and runtime technology. While many of the ideas developed for the compilation and runtime system optimization of data parallel languages in the past decade [61, 22, 8] will be useful for dealing with a subset of the problem, the much larger design space associated with the macroserver model will necessitate the development of new techniques. We expect that an important line of research will be based on feedback directed and dynamic compilation technology, as ....
M.Gupta,E.Schonberg, and H.Srinivasan. A Unified Framework for Optimizing Communication in Data-Parallel Programs. IEEE Transactions on Parallel and Distributed Systems Vol.7(7), pp.689-704, July 1996.
....to access non local data from remote processors on distributed memory architectures is commonly orders of magnitude higher than the cost of accessing local data. As a consequence, a key problem to effectively use distributed memory architectures is centered around efforts to optimize communication [5, 17, 3, 21, 16, 23, 2, 29, 30] which includes: message vectorization (hoisting communication outside of loops) message coalescing (removing redundant communication based on the same array) communication aggregation (combine messages based on different arrays) collective communication, communication latency hiding and ....
....The effect of these optimizations is limited by the fact that most of the analysis in current parallelizing compilers is performed for a single loop nest at a time, and very few research efforts have been started to optimize communication globally across arbitrary control flow. Most approaches [17, 24, 23, 18] for global scheduling of communication commonly rely on data flow analysis which is used to place SENDs as early and RECVs as late as possible in order to maximize communication This research is partially supported by the Austrian Science Fund as part of Aurora Project Tools under contract ....
[Article contains additional citation context not shown here]
M. Gupta, E. Schonberg, and H. Srinivasan. A unified framework for optimizing communication in data-parallel programs. IEEE Transactions on Parallel and Distributed Systems, pages 7(7):689--704, July 1996.
....access non local data from remote processors on distributed memory architectures is commonly orders of magnitude higher than the costs of accessing local data. As a consequence, a key problem to effectively use distributed memory architectures is centered around efforts to optimize communication [3, 7, 9, 8, 10, 1] which includes: message vectorization (hoisting communication outside of loops) message coalescing (removing redundant communication based on the same array) communication aggregation (combining messages based on different arrays) collective communication, communication latency hiding, and ....
....globally across arbitrary control flow. Note that reducing the number of messages can also be a valuable optimization for message passing programs that are executed on shared memory architectures, since fewer messages commonly translate into less synchronization points [11] Most approaches [8, 10, 9] for global scheduling of communication commonly place SENDs as early and RECVs as late as possible in order to maximize communication latency hiding. Hardly any communication optimization considers communication buffer constraints, although it has been shown [10] that latency hiding can ....
[Article contains additional citation context not shown here]
M. Gupta, E. Schonberg, and H. Srinivasan. A unified framework for optimizing communication in data-parallelprograms. IEEE Transactions on Parallel and Distributed Systems, pages 7(7):689--704, July 1996.
....communication and perform other optimizations. In [24] a data flow framework which can integrate a number of communication optimizations is presented. However, the method can only apply to a very small subset of programs which are constrained in the forms of loop nests and array indices. In [32] a unified framework which uses global array data flow analysis for communication optimizations is described. Since only a very simplified version of the analysis algorithm is implemented, it is not clear whether this approach is practical for large programs. In [14, 41] methods that combine ....
....are based upon a variant of Tarjan s intervals [75] The optimizations require that there are no critical edges which are edges that connect a node with multiple outgoing edges to a node with multiple incoming edges. The critical edges can be eliminated by edge splitting transformation[32]. Figure 5.2 shows an example code and its corresponding interval flow graph. ALIGN (i, j) with VPROCS(i, j) x, y, z ALIGN (i, j) with VPROCS(2 j, i 1) w (s1) do i = 1, 100 (s2) do j = 1, 100 (s3) x(i,j) s4) enddo (s5) enddo (s6) do i = 1, 100 (s7) do j = 1, 100 (s8) ....
M. Gupta, E. Schonberg and H. Srinivasan "A Unified Framework for Optimizing Communication in Data-parallel Programs." In IEEE trans. on Parallel and Distributed Systems, Vol. 7, No. 7, pages 689-704, July 1996.
....to access non local data from remote processors on distributed memory architectures is commonly orders of magnitude higher than the cost of accessing local data. As a consequence, a key problem to effectively use distributed memory architectures is centered around efforts to optimize communication [4, 11, 2, 15, 10, 17, 1, 16] which includes: message vectorization, message coalescing, collective communication, communication latency hiding and pipelined communication. The effect of these optimizations is limited by the fact that most of the analysis in current parallelizing compilers is performed for a single loop nest ....
....The effect of these optimizations is limited by the fact that most of the analysis in current parallelizing compilers is performed for a single loop nest at a time, and very few research efforts have been started to optimize communication globally across arbitrary control flow. Most approaches [11, 18, 17, 12] for global scheduling of communication commonly rely on data flow analysis which is based on array sections and data dependence analysis. Commonly data flow analysis is used to place SENDs as early and RECVs as late as possible in order to maximize communication latency hiding. In addition ....
[Article contains additional citation context not shown here]
M. Gupta, E. Schonberg, and H. Srinivasan. A unified framework for optimizing communication in data-parallel programs. IEEE Transactions on Parallel and Distributed Systems, pages 7(7):689-- 704, July 1996.
....like FORALL [Hig93] In this paper, we describe a dataflow framework for optimizing communication in out of core problems. We focus on communication optimization within a single out of core FORALL construct. Unlike the available dataflow frameworks for optimizing inter processor communication [KN94, KS95, GSS95], our framework takes an unified approach for placing I O and communication calls while preserving characteristics of these calls. All the current frameworks focus on improving communication performance by vectorizing messages, eliminating redundant communication and overlapping communication with ....
Manish Gupta, Edith Schonberg, and Harini Srinivasan. A Unified Framework for Optimizing Communication in Data-Parallel Programs. IEEE Transactions on Parallel and Distributed Systems, 1995.
....estimator P 3 T [7, 8] is an integrated tool of VFCS which assists users in performance tuning of regular programs at compile time. P 3 T is based on a single profile run to obtain characteristic data for branching probabilities, statement and loop execution counts. It is well known [9, 10, 11, 12] that the overhead to access non local data from remote processors on distributed memory architectures is commonly orders of magnitude higher than the cost of accessing local data. Communication overhead is, therefore, one of the most important metrics in choosing an appropriate data ....
M. Gupta, E. Schonberg, and H. Srinivasan, "A unified framework for optimizing communication in data-parallel programs," IEEE Transactions on Parallel and Distributed Systems, pp. 7(7):689--704, July 1996.
....to execute on distributed memory systems. Traditionally, data dependence analysis has been used to perform communication optimizations within a single loop nest [2, 10, 14] Recently, data flow analysis techniques have been developed to obtain information for global communication optimizations [4, 7, 9, 12, 13]. One approach, which will be referred to as the array dependence approach, refines data flow analysis for scalar with data dependence analysis [4, 12, 13] Another approach, which we will refer to as the array dataflow approach, performs global array data flow analysis [7, 9] The array dataflow ....
.... [4, 7, 9, 12, 13] One approach, which will be referred to as the array dependence approach, refines data flow analysis for scalar with data dependence analysis [4, 12, 13] Another approach, which we will refer to as the array dataflow approach, performs global array data flow analysis [7, 9]. The array dataflow approach can obtain more accurate data flow information at a higher analysis cost than the array dependence approach. The high analysis cost in the array dataflow approach results from the complexity of the data flow descriptor [7, 9] and that operations on the descriptors ....
[Article contains additional citation context not shown here]
M. Gupta, E. Schonberg and H. Srinivasan "A Unified Framework for Optimizing Communication in Data-parallel Programs." In IEEE Trans. on Parallel and Distributed Systems, Vol. 7, No. 7, pages 689-704, July 1996.
....this approach is beneficial for all types of communications, it does not expose architectural dependent optimization opportunities to the compiler. Many parallel compiler projects tried to improve communication performance by generating e#cient communication code for distributed memory machines [1, 2, 7, 10, 12, 21]. To simplify the compilation, these compilers use the dynamic communication model and do not exploit the potential of compiled communication. Communication analysis and optimization has been applied in parallel compilers. Early approaches optimize communications in a single loop nest using data ....
....analysis and optimization has been applied in parallel compilers. Early approaches optimize communications in a single loop nest using data dependence information [1, 12] Later, data flow analysis tech5 niques have been developed to obtain information for global communication optimizations [7, 10, 15, 24]. However, the analysis only obtains logical communication information, which is insu#cient for compiled communication. The interaction between other components, such as program partitioning and data mapping, and the communication sub system, has also been investigated [22] which o#ers another ....
M. Gupta, E. Schonberg and H. Srinivasan "A Unified Framework for Optimizing Communication in Data-parallel Programs." In IEEE Trans. on Parallel and Distributed Systems, Vol. 7, No. 7, pages 689-704, July 1996.
....that global communication optimizations can greatly reduce communication costs[5, 14] Two different approaches, one based on data dependence analysis [14] and the other using data This research is supported in part by NSF award CCR9157371 and by AFOSR award F49620 93 1 0023DEF. flow analysis[11, 8], have been proposed. While the data dependence approach is more efficient in terms of its analysis cost, data flow analysis technique has the advantage of better precision. However, the dataflow frameworks [11, 8] typically propagate information represented in some form of array section ....
....part by NSF award CCR9157371 and by AFOSR award F49620 93 1 0023DEF. flow analysis[11, 8] have been proposed. While the data dependence approach is more efficient in terms of its analysis cost, data flow analysis technique has the advantage of better precision. However, the dataflow frameworks [11, 8] typically propagate information represented in some form of array section descriptor. Due to the complexity of the array section descriptors, the propagation of data flow information can be expensive both in time and space. Furthermore, in traditional data flow approaches, obtaining data flow ....
[Article contains additional citation context not shown here]
M. Gupta, E. Schonberg and H. Srinivasan "A Unified Framework for Optimizing Communication in Dataparallel Programs." In IEEE trans. on Parallel and Distributed Systems, Vol. 7, No. 7, pages 689-704, July 1996.
....the communication requirement of a program and manages network resources statically to support efficient communications for the program. A number of compiler issues must be addressed in order to apply the compiled communication technique. First, traditional communication analysis techniques [4, 6, 8] represent the communications in logical forms (logical communications) such as Available Section Descriptor (ASD) 4] Section Communication Descriptor (SCD) 8] and a linear algebra framework [6] While these descriptors con This work was performed when the author was at the Department of ....
....for the program. A number of compiler issues must be addressed in order to apply the compiled communication technique. First, traditional communication analysis techniques [4, 6, 8] represent the communications in logical forms (logical communications) such as Available Section Descriptor (ASD) [4], Section Communication Descriptor (SCD) 8] and a linear algebra framework [6] While these descriptors con This work was performed when the author was at the Department of Computer Science, University of Pittsburgh. Communication phase analysis Communication analysis a HPF like program ....
M. Gupta, E. Schonberg and H. Srinivasan "A Unified Framework for Optimizing Communication in Data-parallel Programs." In IEEE Trans. on Parallel and Distributed Systems, Vol. 7, No. 7, pages 689-704, July 1996.
....optimizations such as message vectorization and redundant communication elimination. Program analysis must be performed to obtain information required for the optimizations. Two different analysis approaches, one based on data dependence analysis [10] and the other using array data flow analysis [7,6], have been proposed. While the data dependence approach is more efficient in terms of the analysis cost, the array data flow analysis approach has the advantage of better precision. Array data flow analysis propagates some form of array section descriptor [6,7] Due to the complexity of the ....
....other using array data flow analysis [7,6] have been proposed. While the data dependence approach is more efficient in terms of the analysis cost, the array data flow analysis approach has the advantage of better precision. Array data flow analysis propagates some form of array section descriptor [6,7]. Due to the complexity of the array This research is supported in part by NSF award CCR 9157371 and by AFOSR award F4962093 1 0023DEF. section descriptor, the propagation of data flow information can be expensive both in time and space. Furthermore, in traditional data flow analysis ....
[Article contains additional citation context not shown here]
M. Gupta, E. Schonberg and H. Srinivasan, A unified framework for optimizing communication in data-parallel programs, IEEE Trans. on Parallel and Distributed Systems 7 (1996) 689--704.
No context found.
GUPTA, M., SCHONBERG, E., AND SRINIVASAN, H. 1996. A unified framework for optimizing communication in data-parallel programs. IEEE Trans. Parallel Distrib. Syst. 7,7, 689--704.
No context found.
M. Gupta, E. Schonberg, and H. Srinivasan. A Unified Framework for Optimizing Communication in Data-Parallel Programs. IEEE trans. on Parallel and Distributed Systems, 7(7):689--704, July 1996.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC