Results 1 - 10
of
97
Compiler Transformations for High-Performance Computing
- ACM Computing Surveys
, 1994
"... In the last three decades a large number of compiler transformations for optimizing programs have been implemented. Most optimization for uniprocessors reduce the number of instructions executed by the program using transformations based on the analysis of scalar quantities and data-flow techniques. ..."
Abstract
-
Cited by 332 (4 self)
- Add to MetaCart
In the last three decades a large number of compiler transformations for optimizing programs have been implemented. Most optimization for uniprocessors reduce the number of instructions executed by the program using transformations based on the analysis of scalar quantities and data-flow techniques. In contrast, optimization for
The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization
, 1995
"... Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we advocate a novel framework for their identification: speculatively e ..."
Abstract
-
Cited by 185 (36 self)
- Add to MetaCart
Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we advocate a novel framework for their identification: speculatively execute the loop as a doall, and apply a fully parallel data dependence test to determine if it had any cross--iteration dependences; if the test fails, then the loop is re--executed serially. Since, from our experience, a significant amount of the available parallelism in Fortran programs can be exploited by loops transformed through privatization and reduction parallelization, our methods can speculatively apply these transformations and then check their validity at run--time. Another important contribution of this paper is a novel method for reduction recognition which goes beyond syntactic pattern matching: it detects at run--time if the values stored in an array participate in a reduction operation, even if they are transferred through private variables and/or are affected by statically unpredictable control flow. We present experimental results on loops from the PERFECT Benchmarks which substantiate our claim that these techniques can yield significant speedups which are often superior to those obtainable by inspector/executor methods. The methods presented in this paper differ from and extend our previous work on several important points. First, instead of distributing the loop into inspector and executor loops (the approach taken in all previous work on run-- time parallelization) we advocate the use of run--time tests to validate the execution of a loop that is speculatively executed in parallel. Second, in addition to array privatization, the new techniques are capa...
Beyond Induction Variables: Detecting and Classifying Sequences Using a Demand-driven SSA Form
- ACM Transactions on Programming Languages and Systems
, 1995
"... this paper we present a practical technique for detecting a broader class of linear induction variables than is usually recognized, as well as several other sequence forms, including periodic, polynomial, geometric, monotonic, and wrap-around variables. Our method is based on Factored Use-Def (FUD) ..."
Abstract
-
Cited by 99 (5 self)
- Add to MetaCart
this paper we present a practical technique for detecting a broader class of linear induction variables than is usually recognized, as well as several other sequence forms, including periodic, polynomial, geometric, monotonic, and wrap-around variables. Our method is based on Factored Use-Def (FUD) chains, a demand-driven representation of the popular Static Single Assignment form. In this form, strongly connected components of the associated SSA graph correspond to sequences in the source program: we describe a simple yet efficient algorithm for detecting and classifying these sequences. We have implemented this algorithm in Nascent, our restructuring Fortran 90+ compiler, and we present some results showing the effectiveness of our approach.
Automatic Program Parallelization
, 1993
"... This paper presents an overview of automatic program parallelization techniques. It covers dependence analysis techniques, followed by a discussion of program transformations, including straight-line code parallelization, do loop transformations, and parallelization of recursive routines. The last s ..."
Abstract
-
Cited by 97 (8 self)
- Add to MetaCart
This paper presents an overview of automatic program parallelization techniques. It covers dependence analysis techniques, followed by a discussion of program transformations, including straight-line code parallelization, do loop transformations, and parallelization of recursive routines. The last section of the paper surveys several experimental studies on the effectiveness of parallelizing compilers.
Detecting Coarse-Grain Parallelism Using an Interprocedural Parallelizing Compiler
, 1995
"... This paper presents an extensive empirical evaluation of an interprocedural parallelizing compiler, developed as part of the Stanford SUIF compiler system. The system incorporates a comprehensive and integrated collection of analyses, including privatization and reduction recognition for both array ..."
Abstract
-
Cited by 85 (21 self)
- Add to MetaCart
This paper presents an extensive empirical evaluation of an interprocedural parallelizing compiler, developed as part of the Stanford SUIF compiler system. The system incorporates a comprehensive and integrated collection of analyses, including privatization and reduction recognition for both array and scalar variables, and symbolic analysis of array subscripts. The interprocedural analysis framework is designed to provide analysis results nearly as precise as full inlining but without its associated costs. Experimentation with this system shows that it is capable of detecting coarser granularity of parallelism than previously possible. Specifically, it can parallelize loops that span numerous procedures and hundreds of lines of codes, frequently requiring modifications to array data structures such as privatization and reduction transformations. Measurements from several standard benchmark suites demonstrate that an integrated combination of interprocedural analyses can substantially ...
Compiler Optimizations for Eliminating Barrier Synchronization
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1995
"... This paper presents novel compiler optimizations for reducing synchronization overhead in compiler-parallelized scientific codes. A hybrid programming model is employed to combine the flexibility of the fork-join model with the precision and power of the singleprogram, multiple data (SPMD) model. By ..."
Abstract
-
Cited by 75 (13 self)
- Add to MetaCart
This paper presents novel compiler optimizations for reducing synchronization overhead in compiler-parallelized scientific codes. A hybrid programming model is employed to combine the flexibility of the fork-join model with the precision and power of the singleprogram, multiple data (SPMD) model. By exploiting compiletime computation partitions, communication analysis can eliminate barrier synchronization or replace it with less expensive forms of synchronization. We show computation partitions and data communication can be represented as systems of symbolic linear inequalities for high flexibility and precision. These optimizations has been implemented in the Stanford SUIF compiler. We extensively evaluate their performance using standard benchmark suites. Experimental results show barrier synchronization is reduced 29% on averageand by several orders of magnitude for certain programs. 1 Introduction Parallel machines with shared address spaces and coherent caches provide an attracti...
Gated SSA-Based Demand-Driven Symbolic Analysis for Parallelizing Compilers
, 1995
"... In this paper, we present a GSA-based technique that performs more efficient and more precise symbolic analysis of predicated assignments, recurrences and index arrays. The efficiency is improved by using a backward substitution scheme that performs resolution of assertions on-demand and uses heuris ..."
Abstract
-
Cited by 66 (10 self)
- Add to MetaCart
In this paper, we present a GSA-based technique that performs more efficient and more precise symbolic analysis of predicated assignments, recurrences and index arrays. The efficiency is improved by using a backward substitution scheme that performs resolution of assertions on-demand and uses heuristics to limit the number of substitution. The precision is increased by utilizing the gating predicate information embedded in the GSA and the control dependence information in the program flow graph. Examples from array privatization are used to illustrate how the technique aids loop parallelization.
Interprocedural array regions analyses
, 1995
"... In order to perform powerful program optimizations, an exact interprocedural analysis of array data ow is needed. For that purpose, two new types of array region are introduced. IN and OUT regions represent the sets of array elements, the values of which are imported to or exported from the current ..."
Abstract
-
Cited by 64 (7 self)
- Add to MetaCart
In order to perform powerful program optimizations, an exact interprocedural analysis of array data ow is needed. For that purpose, two new types of array region are introduced. IN and OUT regions represent the sets of array elements, the values of which are imported to or exported from the current statement or procedure. Among the various applications are: compilation of communications for message-passing machines, array privatization, compile-time optimization of local memory or cache behavior in hierarchical memory machines.
Interprocedural Symbolic Analysis
, 1994
"... Compiling for efficient execution on advanced computer architectures requires extensive program analysis and transformation. Most compilers limit their analysis to simple phenomena within single procedures, limiting effective optimization of modular codes and making the programmer's job harder. We p ..."
Abstract
-
Cited by 48 (1 self)
- Add to MetaCart
Compiling for efficient execution on advanced computer architectures requires extensive program analysis and transformation. Most compilers limit their analysis to simple phenomena within single procedures, limiting effective optimization of modular codes and making the programmer's job harder. We present methods for analyzing array side effects and for comparing nonconstant values computed in the same and different procedures. Regular sections, described by rectangular bounds and stride, prove as effective in describing array side effects in Linpack as more complicated summary techniques. On a set of six programs, regular section analysis of array side effects gives 0 to 39 percent reductions in array dependences at call sites, with 10 to 25 percent increases in analysis time. Symbolic analysis is essential to data dependence testing, array section analysis, and other high-level program manipulations. We give methods for building symb...
Advanced Program Restructuring for High-Performance Computers with Polaris
- IEEE COMPUTER
, 1996
"... Multiprocessor computers are rapidly becoming the norm. Parallel workstations are widely available today, and it is likely that most PCs in the near future will contain multiple processors. To accommodate these changes, some classes of applications must be developed in explicitly parallel form. Ye ..."
Abstract
-
Cited by 46 (26 self)
- Add to MetaCart
Multiprocessor computers are rapidly becoming the norm. Parallel workstations are widely available today, and it is likely that most PCs in the near future will contain multiple processors. To accommodate these changes, some classes of applications must be developed in explicitly parallel form. Yet, in order to avoid a substantial increase in software development costs, compilers that translate conventional programs into efficient parallel form clearly will be necessary. In the ideal case, multiprocessor parallelism should be as transparent to programmers as instruction level parallelism is to programmers of today's superscalar machines. However, compiling for multiprocessors is substantially more complex than compiling for functional unit parallelism, in part because successful parallelization often requires a very accurate analysis of long sections of code. This article discusses recent experience at Illinois on the automatic parallelization of scientific codes using the Polaris restructurer. We also present several new analysis techniques that we have developed in recent years based on an extensive analysis of the characteristics of real Fortran codes. These techniques, based on both static and dynamic parallelization strategies, have been incorporated into the Polaris restructurer. Preliminary results on parallel workstations are encouraging, and once the implementation of the new techniques is complete, we expect that Polaris will be able to obtain good speedups for many real scientific codes on parallel workstations.

