Results 1 - 10
of
329
The program dependence graph and its use in optimization
- ACM Transactions on Programming Languages and Systems
, 1987
"... In this paper we present an intermediate program representation, called the program dependence graph (PDG), that makes explicit both the data and control dependence5 for each operation in a program. Data dependences have been used to represent only the relevant data flow relationships of a program. ..."
Abstract
-
Cited by 996 (3 self)
- Add to MetaCart
(Show Context)
In this paper we present an intermediate program representation, called the program dependence graph (PDG), that makes explicit both the data and control dependence5 for each operation in a program. Data dependences have been used to represent only the relevant data flow relationships of a program. Control dependence5 are introduced to analogously represent only the essential control flow relationships of a program. Control dependences are derived from the usual control flow graph. Many traditional optimizations operate more efficiently on the PDG. Since dependences in the PDG connect computationally related parts of the program, a single walk of these dependences is sufficient to perform many optimizations. The PDG allows transformations such as vectorization, that previ-ously required special treatment of control dependence, to be performed in a manner that is uniform for both control and data dependences. Program transformations that require interaction of the two dependence types can also be easily handled with our representation. As an example, an incremental approach to modifying data dependences resulting from branch deletion or loop unrolling is intro-duced. The PDG supports incremental optimization, permitting transformations to be triggered by one another and applied only to affected dependences.
The Omega Test: a fast and practical integer programming algorithm for dependence analysis
- Communications of the ACM
, 1992
"... The Omega testi s ani nteger programmi ng algori thm that can determi ne whether a dependence exi sts between two array references, and i so, under what condi7: ns. Conventi nalwi[A m holds thati nteger programmiB techni:36 are far too expensi e to be used for dependence analysi6 except as a method ..."
Abstract
-
Cited by 522 (15 self)
- Add to MetaCart
The Omega testi s ani nteger programmi ng algori thm that can determi ne whether a dependence exi sts between two array references, and i so, under what condi7: ns. Conventi nalwi[A m holds thati nteger programmiB techni:36 are far too expensi e to be used for dependence analysi6 except as a method of last resort for si:8 ti ns that cannot be deci:A by si[976 methods. We present evi[77B that suggests thiwi sdomi s wrong, and that the Omega testi s competi ti ve wi th approxi mate algori thms usedi n practi ce and sui table for usei n producti on compi lers. Experi ments suggest that, for almost all programs, the average ti me requi red by the Omega test to determi ne the di recti on vectors for an array pai ri s less than 500 secs on a 12 MIPS workstati on. The Omega testi based on an extensi n of Four i0-Motzki var i ble eli937 ti n (aliB: r programmiA method) toi nteger programmi ng, and has worst-case exponenti al ti me complexi ty. However, we show that for manysiB7 ti ns i whi h ...
Interprocedural Compilation of Fortran D for MIMD Distributed-Memory Machines
- COMMUNICATIONS OF THE ACM
, 1992
"... Algorithms exist for compiling Fortran D for MIMD distributed-memory machines, but are significantly restricted in the presence of procedure calls. This paper presents interprocedural analysis, optimization, and code generation algorithms for Fortran D that limit compilation to only one pass over ea ..."
Abstract
-
Cited by 333 (49 self)
- Add to MetaCart
Algorithms exist for compiling Fortran D for MIMD distributed-memory machines, but are significantly restricted in the presence of procedure calls. This paper presents interprocedural analysis, optimization, and code generation algorithms for Fortran D that limit compilation to only one pass over each procedure. This is accomplished by collecting summary information after edits, then compiling procedures in reverse topological order to propagate necessary information. Delaying instantiation of the computation partition, communication, and dynamic data decomposition is key to enabling interprocedural optimization. Recompilation analysis preserves the benefits of separate compilation. Empirical results show that interprocedural optimization is crucial in achieving acceptable performance for a common application.
Improving Data Locality with Loop Transformations
- ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1996
"... ..."
(Show Context)
Fortran D Language Specification
, 1990
"... This paper presents Fortran D, a version of Fortran enhanced with data decomposition specifications. It is designed to support two fundamental stages of writing a data-parallel program: problem mapping using sophisticated array alignments, and machine mapping through a rich set of data distribution ..."
Abstract
-
Cited by 303 (50 self)
- Add to MetaCart
This paper presents Fortran D, a version of Fortran enhanced with data decomposition specifications. It is designed to support two fundamental stages of writing a data-parallel program: problem mapping using sophisticated array alignments, and machine mapping through a rich set of data distribution functions. We believe that Fortran D provides a simple machine-independent programming model for most numerical computations. We intend to evaluate its usefulness for both programmers and advanced compilers on a variety of parallel architectures.
Some efficient solutions to the affine scheduling problem -- Part I One-dimensional Time
, 1996
"... Programs and systems of recurrence equations may be represented as sets of actions which are to be executed subject to precedence constraints. In many cases, actions may be labelled by integral vectors in some iteration domain, and precedence constraints may be described by affine relations. A s ..."
Abstract
-
Cited by 266 (22 self)
- Add to MetaCart
Programs and systems of recurrence equations may be represented as sets of actions which are to be executed subject to precedence constraints. In many cases, actions may be labelled by integral vectors in some iteration domain, and precedence constraints may be described by affine relations. A schedule for such a program is a function which assigns an execution date to each action. Knowledge of such a schedule allows one to estimate the intrinsic degree of parallelism of the program and to compile a parallel version for multiprocessor architectures or systolic arrays. This paper deals with the problem of finding closed form schedules as affine or piecewise affine functions of the iteration vector. An efficient algorithm is presented which reduces the scheduling problem to a parametric linear program of small size, which can be readily solved by an efficient algorithm.
Global Optimizations for Parallelism and Locality on Scalable Parallel Machines
- IN PROCEEDINGS OF THE SIGPLAN '93 CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION
, 1993
"... Data locality is critical to achieving high performance on large-scale parallel machines. Non-local data accesses result in communication that can greatly impact performance. Thus the mapping, or decomposition, of the computation and data onto the processors of a scalable parallel machine is a key i ..."
Abstract
-
Cited by 256 (20 self)
- Add to MetaCart
Data locality is critical to achieving high performance on large-scale parallel machines. Non-local data accesses result in communication that can greatly impact performance. Thus the mapping, or decomposition, of the computation and data onto the processors of a scalable parallel machine is a key issue in compiling programs for these architectures.
Dataflow Analysis of Array and Scalar References
- International Journal of Parallel Programming
, 1991
"... Given a program written in a simple imperative language (assignment statements, for loops, affine indices and loop limits), this paper presents an algorithm for analyzing the patterns along which values flow as the execution proceeds. For each array or scalar reference, the result is the name an ..."
Abstract
-
Cited by 254 (3 self)
- Add to MetaCart
(Show Context)
Given a program written in a simple imperative language (assignment statements, for loops, affine indices and loop limits), this paper presents an algorithm for analyzing the patterns along which values flow as the execution proceeds. For each array or scalar reference, the result is the name and iteration vector of the source statement as a function of the iteration vector of the referencing statement. The paper discusses several applications of the method: conversion of a program to a set of recurrence equations, array and scalar expansion, program verification and parallel program construction. Keywords dataflow analysis, semantics analysis, array expansion. 1 Introduction It is a well known fact that scientific programs spend most of their running time in executing loops operating on arrays. Hence if a restructuring or optimizing compiler is to do a good job, it must be able to do a thorough analysis of the addressing patterns in such loops. If taken in full generality, ...
Improving Register Allocation for Subscripted Variables
, 1990
"... INTRODUCTION By the late 1980s, memory system performance and CPU performance had already begun to diverge. This trend made effective use of the register file imperative for excellent performance. Although most compilers at that time allocated scalar variables to registers using graph coloring with ..."
Abstract
-
Cited by 225 (35 self)
- Add to MetaCart
INTRODUCTION By the late 1980s, memory system performance and CPU performance had already begun to diverge. This trend made effective use of the register file imperative for excellent performance. Although most compilers at that time allocated scalar variables to registers using graph coloring with marked success [12, 13, 14, 6], allocation of array values to registers only occurred in rare circumstances because standard data-flow analysis techniques could not uncover the available reuse of array memory locations. This deficiency was especially problematic for scientific codes since a majority of the computation involves array references. Our original paper addressed this problem by presenting an algorithm and experiment for a loop transformation, called scalar replacement, that exposed the reuse available in array references in an innermost loop. It also demonstrated experimentally how another loop transformation, called unroll-and-jam [2], could expose more opportunities for scalar…
Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA
- PPOPP'08
, 2008
"... GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of tradition ..."
Abstract
-
Cited by 215 (11 self)
- Add to MetaCart
(Show Context)
GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processor’s organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each thread’s resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup.