| M. W. Hall, K. Kennedy, and K. S. MCKinley. Interprocedural transformations for parallel code gener- ation. In Proceedings of Supercomputing '91, Albuquerque, NM, Nov. 1991. |
....to optimize the mapping and execution of the program. It is imperative that good performance be achievable with modest effort and the highest levels of performance be available with reasonable tuning effort. 3. 1 Automatic techniques These techniques exploit aggressive interprocedural analysis [48, 49, 47, 18, 30, 29, 11, 1], profile data, and run time statistics to optimize program implementation choices are essential to the programmability of the machine and accessibility of high performance. Aggressive compiler analysis has been essential to high performance computing based on vector, shared memory, and ....
Hall, M. W., Kennedy, K., and McKinley, K. S. Interprocedural transformations for parallel code generation. In Proceedings of the 4 th Annual Conference on High-Performance Computing (Supercomputing '91) (Nov. 1991), pp. 424--434.
....if RPTS and related techniques are to be applied to these codes. In our current interprocedural framework, we assume indices of the spatial loops are only used locally. For cases that spatial loop indices are used globally or passed at call sites, selective inlining substitution or loop embedding [12] is used. Prismatic time skewing can be applied w.r.t. multiple time step loops as long as they are not imperfectly nested. For time step loops that are imperfectly nested, only the innermost is transformed. Figure 6 shows a code segment with a time step loop and five point stencil computation in ....
M. W. Hall, K. Kennedy, and K. S. M c Kinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91, Albuquerque, NM, Nov. 1991.
....analysis can be used as a base for performing selective inline expansion of subprograms for code generation. The information derived from performing this analysis can also be used to guide selective modification of subprograms using such techniques as cloning, loop embedding, and loop extraction [10]. All of the programs that we have tested were inlined successfully by Polaris. Some constructs are not easily expressible in Fortran after inline expansion. The constructs which are not fully supported involve the need for expressing an equivalence between nonconforming formal and actual ....
Mary W. Hall, Ken Kennedy, and Kathryn S. McKinley. Interprocedural transformations for parallel code generation. Supercomputing'91, pages 423--434, 1991.
....performance and to change the program s structure in a way that enables some other transformation. The former class of optimizations are relatively straightforward. They include inline substitution, cross procedural register allocation [9, 28] and limited forms of interprocedural code motion [17]. The latter situation is more complex; the difficulty here is deciding when and where to apply an optimization. Inline substitution and procedure cloning can fall in this category. Procedure cloning is an unusual case. It replaces a single copy of the procedure with two or more copies and ....
M. W. Hall, K. Kennedy, and K. McKinley. Interprocedural transformation for parallel code generation. In Proceedings of Supercomputing '91, Albuquerque, NM, Nov. 1991.
....transformations is the creation of access regions and the aggressive exploitation of access region properties by subsequent optimizations. The lifting of access region is conceptually similar to moving loops across procedure boundaries and lifting and blocking of communication in parallel Fortran [19, 22]. In our case, the possibility of deadlock requires atomic primitives and more extensive analysis. Our register allocation scheme is based on that of Chow and Hennessy [11] adapted for lazy state saving. The problem of register allocation in the presence of synchronization points has been studied ....
Mary W. Hall, Ken Kennedy, and Kathryn S. McKinley. Interprocedural transformations for parallel code generation. In Proceedings of the 4 th Annual Conference on High-Performance Computing (Supercomputing '91), pages 424--434, November 1991.
....transformations such as loop un switching, loop distribution, loop fusion, loop coalescing, loop peeling, and loop unrolling [1] software pipelining [3] speculative execution [4] code motion, redundant code elimination, and interprocedural optimizations such as inlining, cloning, etc. [5]. The idea of using program execution profiles to guide performance optimizing transformations has been proposed in [6] and [7] In most of these approaches, scheduling succeeds transformation application. We demonstrate that, in high level synthesis, scheduling information can guide the ....
M. W. Hall, K. Kennedy, and K. S. McKinley, "Interprocedural transformations for parallel code generation," in Proc. Supercomputing, pp. 424--434, Dec. 1991.
....transformations can significantly improve parallelization and has called for precise interprocedural analysis information. In particular, we feel that loop distribution and array privatization would benefit greatly from our analysis. Other interprocedural transformations have also been suggested [27]. If a precise form of analysis is required to perform these transformations, the efficiency of such an analysis is paramount. Due to its demand driven implementation, fida is reasonably efficient in the context of automatic parallelization. 5 Related Work In previous work [50] 11] 13, 14, ....
Mary W. Hall, Ken Kennedy, and Kathryn S. McKinley. Interprocedural Transformations for Parallel Code Generation. In Supercomputing '91, pages 424--434, November 1991.
....may also inhibit optimization in our system. For example, linpackd and matrix300 are written in a modular style with singly nested loops enclosing function calls to routines which also contain singly nested loops. To improve programs written in this style requires interprocedural optimization [10][13]; these optimizations are not currently implemented in our translator. Many loop nests (60 ) in the original programs are already in memory order, and even more (66 ) have the loop carrying the most reuse in the innermost position. This result indicates that scientific programmers often pay ....
M.W. Hall, K. Kennedy, and K. McKinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing `91, Albuquerque, NM, November 1991.
....location by a single array reference or by multiple array references. Without loss of generality, we assume Fortran s column major storage. Augmented Call Graph. We use an augmented call graph G ac to describe the calling relationships among procedures and loop nest structures in the program [20]. This flow insensitive call graph contains procedure nodes and call nodes. For each procedure p that makes a procedure call at site s, an edge connects node p to node s. For each call site s to procedure q, an edge connects node s to node q. The G ac also adds loop nodes for every loop and edges ....
....array section analysis that enables interprocedural optimization. This analysis is part of dependence testing in ParaScope, and is computed before optimization [22] We include this description as technical background. We use section analysis to analyze interprocedural side effects to arrays [8, 20, 21, 22]. Sections represent the most commonly occurring array access patterns; single elements, rows, columns, grids, and their higher dimensional analogs. The various approaches to interprocedural array side effect analysis must make tradeoffs between precision and efficiency [8, 11, 22, 30, 45] ....
M. W. Hall, K. Kennedy, and K. S. M c Kinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91, pages 424--434, Albuquerque, NM, November 1991.
.... in source form, the potential for exponential growth in program size, and the inability of modern optimizing compilers to effectively compile large programs after complete inline expansion ( 5] 4 sparked the development of interprocedural data flow analysis techniques, which are surveyed in [9], 12] and [19] Most of the early interprocedural analysis techniques merge information into a flow independent summary form, which saves time or space but loses the precision needed to perform techniques such as array privatization ( 20] More recent interprocedural analysis techniques, such as ....
Mary W. Hall, Ken Kennedy, and Kathryn S. McKinley. Interprocedural transformations for parallel code generation. In Proc. Supercomputing '91, pages 424--434, 1991. 37
.... data distributions are discussed in [11] Techniques for summarizing interprocedural side effects of array accesses using regular section descriptors are detailed in [13] Program transformations that need to be performed for optimizing interprocedural redistributions are presented in [12]. These techniques have been implemented in the Fortran D compiler and in the FIAT system [10] at Rice University. However, these works assume static distributions and do not deal with the problem of determining dynamic data decompositions in the presence of procedure calls. Palermo et al. have ....
M. Hall, K. Kennedy, and K. McKinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing'91, November 1991.
....we will assume that the compiler performs procedure cloning for every distinct pattern of entry and exit decomposition schemes. To simplify our discussion we will assume that programs have only acyclic call graphs. The augmented call graph is used to identify phases across procedure boundaries [HKM91] Subsequently, the call graph is traversed in reverse topological order. For each procedure P the single source shortest paths problem is solved on its phase control flow graph using the hierarchical approach of algorithm DECOMP in Figure 6. Each call site of P in procedure Q is represented by a ....
M. W. Hall, K. Kennedy, and K. S. M c Kinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91, Albuquerque, NM, November 1991.
....information or move code across procedure boundaries. Many researchers have demonstrated significant performance improvements on shared memory multiprocessors by manually applying interprocedural analysis and transformation techniques to enhance parallelism or memory hierarchy utilization [5, 29, 49]. Techniques that have proven useful for this purpose include scalar and array side effect analysis [31, 32, 39, 51] interprocedural constant propagation [23, 44] array KILL analysis [5, 26, 49] and transformation to expose loop nests to parallelization [29] In addition to automatic ....
.... memory hierarchy utilization [5, 29, 49] Techniques that have proven useful for this purpose include scalar and array side effect analysis [31, 32, 39, 51] interprocedural constant propagation [23, 44] array KILL analysis [5, 26, 49] and transformation to expose loop nests to parallelization [29]. In addition to automatic parallelization, many problem domains in high performance computing can greatly benefit from exploiting interprocedural information. For example, compiling for distributed memory architectures and automatic detection of data races in shared memory codes each involve ....
[Article contains additional citation context not shown here]
M. W. Hall, K. Kennedy, and K. S. MCKinley. Interprocedural transformations for parallel code gener- ation. In Proceedings of Supercomputing '91, Albuquerque, NM, Nov. 1991.
....architectures, and instrumenting code for run time detection of race conditions in shared memory parallel programs. To date, we have effectively employed cloning in experiments with interprocedural constant propagation [3, 12] and interprocedural transformations for parallel code generation [13] through hand optimization and a partial implementation of the algorithm. ParaScope is devoted to high performance Fortran programming, but the need for cloning arises in many other contexts such as those discussed in Section 2. Experimentation is needed to verify that the assumptions used in our ....
Hall, M.W., Kennedy, K., and McKinley, K.S. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91. IEEE Computer Society, November 1991.
....architectures, and instrumenting code for run time detection of race conditions in shared memory parallel programs. To date, we have effectively employed cloning in experiments with interprocedural constant propagation [3, 12] and interprocedural transformations for parallel code generation [13]. ParaScope is devoted to high performance Fortran programming, but the need for cloning arises in many other contexts. For example, in languages with implicit typing, cloning enables separate calls to a procedure to be customized according to the types of the input parameters. Similar problems ....
M. W. Hall, K. Kennedy, and K. S. McKinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91, November 1991.
....each procedure is edited, even if the program is compiled multiple times or if the procedure is part of several programs. 2. Interprocedural Propagation. The compiler collects local summary information from each procedure in the program to build an augmented call graph containing loop information [19]. It then propagates the initial information on the call graph to compute interprocedural solutions. 3. Interprocedural Code Generation. The compiler directs compilation of all procedures in the program based on the results of interprocedural analysis. Another important aspect of the compilation ....
M. W. Hall, K. Kennedy, and K. S. M c Kinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91, Albuquerque, NM, November 1991.
....optimization in our system. For example, Linpackd and Matrix300 are written in a modular style with singly nested loops enclosing function calls to routines which also contain singly nested loops. To improve programs written in this style requires interprocedural optimization [Cooper et al. 1993; Hall et al. 1991]; these optimizations are not currently implemented in our translator. Many loop nests (69 ) in the original programs are already in memory order, and even more (74 ) have the loop carrying the most reuse in the innermost position. This result indicates that scientific programmers often pay ....
Hall, M. W., Kennedy, K., and McKinley, K. S. 1991. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91. IEEE, New York.
....the kernel algorithm to be applied across procedure calls. In particular, loops containing calls can be parallelized and nests spanning calls optimized. The interprocedural transformations, loop embedding, loop extraction, and procedure cloning are used only when they enable loop transformations [9, 18]. These components appeared previously in the literature and for the algorithmic details the reader should refer to the appropriate articles [9, 11, 12, 18] Section 3.4 however extends and integrates them for the first time into a single code generation algorithm. To illuminate the algorithm and ....
.... The interprocedural transformations, loop embedding, loop extraction, and procedure cloning are used only when they enable loop transformations [9, 18] These components appeared previously in the literature and for the algorithmic details the reader should refer to the appropriate articles [9, 11, 12, 18]. Section 3.4 however extends and integrates them for the first time into a single code generation algorithm. To illuminate the algorithm and experimental results, we summarize its components below. 3.1 Optimize: Data Locality and Parallelism The most effective and essential component of our ....
[Article contains additional citation context not shown here]
M. W. Hall, K. Kennedy, and K. S. M c Kinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91, Albuquerque, NM, November 1991.
.... source to source parallelizer for shared memory multiprocessors [15] Using the information produced by the program compiler and a performance estimator, it will apply a combination of interprocedural transformations and parallelism enhancing transformations to provide an initial parallel program [16, 17, 18]. The next two subsections describe how the program compiler computes interprocedural data flow information and how it applies interprocedural transformations. Recompilation analysis is handled using methods described by Burke and Torczon [19] 2.3 Interprocedural Analysis Interprocedural ....
....merge the name spaces. Inline substitution is a simple form of interprocedural code motion [36] it replaces a procedure call with a copy of the code for the called procedure. Loop extraction and loop embedding are two other forms of interprocedural code motion that the program compiler will use [16]. Loop extraction pulls an outermost enclosing loop from a procedure body into a calling procedure; it is a form of partial inlining. Loop embedding is the inverse operation; it pushes a loop surrounding a call site into the called procedure. Procedure cloning lets the compiler produce multiple ....
[Article contains additional citation context not shown here]
M. W. Hall, K. Kennedy, and K. S. McKinley, "Interprocedural transformations for parallel code generation, " in Proceedings of Supercomputing '91, Nov. 1991.
....our partitioning problem must satisfy all of the following properties ffl If vertex i and j are in the same partition then all the vertices on the directed paths in G from i to j are also in the same partition. 3 2 Fusion works across procedures too if good interprocedural analysis is available [HKM91]. 3 This constraint is also referred to as convexity constraint. ffl Two nodes connected by a fusion preventing dependence may not be in the same partition. ffl Any two nodes are in the same partition only if they have a path (without any fusion preventing edges) in the fusion graph between ....
M. W. Hall, K. Kennedy, and K. S. McKinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91, Albuquerque, NM, November 1991.
....also inhibit optimization in our system. For example, Linpackdand Matrix300 are written in a modular style with singly nested loops enclosing function calls to routines which also contain singly nested loops. To improve programs written in this style requires interprocedural optimization [CHK93, HKM91] these optimizations are not currently implemented in our translator. Many loop nests (69 ) in the original programs are already in memory order, and even more (74 ) have the loop carrying the most reuse in the innermost position. This result indicates that scientific programmers often pay ....
M. W. Hall, K. Kennedy, and K. S. M c Kinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91, Albuquerque, NM, November 1991.
....as separate units, there is increasing evidence that optimization across procedure boundaries can yield significant improvements in program execution times. Interprocedural analysis and optimization have proven to be important to automatic parallelization of loops containing procedure calls [7, 11, 12, 14, 22] and compiling for distributedmemory multiprocessors [10] The above research focuses on analyzing languages used by scientific programmers, usually Fortran. However, interprocedural optimization is perhaps even more important for functional languages, where functions are small and calls occur ....
Hall, M.W., Kennedy, K., and McKinley, K.S. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91, pages 424--434. IEEE Computer Society, November 1991.
....improved parallel performance by 26 . The original nests were connected by data dependence (i.e. contained reuse) and accounted for a significant portion of the total execution time. These nests were fused across procedure boundaries using loop extraction to place the nests in the same procedure [17, 22]. Loop extraction pulls a loop out of the called routine and into the caller, actually increasing procedure call overhead. The increased reuse and decreased parallel loop synchronization resulting from fusion more than overcame the additional call overhead. Ocean is 3664 non comment line program ....
....parallel loop synchronization resulting from fusion more than overcame the additional call overhead. Ocean is 3664 non comment line program from the Perfect benchmark suite [9] Fusion improved parallel performance by 32 on Ocean. Thirty one nests benefit from fusion across procedure boundaries [17, 22]. Some of the candidates were exposed after constant propagation and dead code elimination. Loop extraction enabled fusion. Again, extraction s only effect is to increase total execution time because of increased call overhead. The fused nests consisted of between two and four parallel loops ....
M. W. Hall, K. Kennedy, and K. S. M c Kinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91, Albuquerque, NM, Nov. 1991.
....information or move code across procedure boundaries. Many researchers have demonstrated significant performance improvements on shared memory multiprocessors by manually applying interprocedural analysis and transformation techniques to enhance parallelism or memory hierarchy utilization [5, 29, 49]. Techniques that have proven useful for this purpose include scalar and array side effect analysis [31, 32, 39, 51] interprocedural constant propagation [23, 44] array Kill analysis [5, 26, 49] and transformation to expose loop nests to parallelization [29] In addition to automatic ....
.... memory hierarchy utilization [5, 29, 49] Techniques that have proven useful for this purpose include scalar and array side effect analysis [31, 32, 39, 51] interprocedural constant propagation [23, 44] array Kill analysis [5, 26, 49] and transformation to expose loop nests to parallelization [29]. In addition to automatic parallelization, many problem domains in high performance computing can greatly benefit from exploiting interprocedural information. For example, compiling for distributed memory architectures and automatic detection of data races in shared memory codes each involve ....
[Article contains additional citation context not shown here]
M. W. Hall, K. Kennedy, and K. S. M c Kinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91, Albuquerque, NM, Nov. 1991.
No context found.
M. W. Hall, K. Kennedy, and K. S. M c Kinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91, Albuquerque, NM, Nov. 1991.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC