| U. Banerjee, S. Shen, D.J. Kuck, and R.A. Towle. Time and Parallel Processor Bounds for fortran-Like Loops. IEEE Transactions on Computers, C28 (9):660--670, September 1979. |
....The first 1 Let O1 and O2 be operations such that O1 precedes O2 in the original code. O2 must follow O1 if either of the following conditions hold: 1) O2 has a true data dependence on O1 if O2 reads data written by O1 . 2) O2 is data anti dependent on O1 if O2 destroys data required by O1 [BSKT79, Veg82]. 2 O 2 : b[i] a[i] 5 O 4 : a[i 1] c[i] 2 O 3 : c[i] x b[i] O 1 : x = i 12 1 (1,1) 1,1) 0,1) 0,1) 0,1) 2 4 3 (b) a) Figure 1.1: Loop Dependency Example (a) Loop Body Code (b) DDG type is called a loop independent dependency and shows dependencies ....
U. Banerjee, S. Shen, D.J. Kuck, and R.A. Towle. Time and Parallel Processor Bounds for fortran-Like Loops. IEEE Transactions on Computers, C28 (9):660--670, September 1979.
....have been presented in the literature for partitioning DO loops into computations that can be executed in parallel. It is possible to obtain fully parallel code when there are no cyclic dependence chains, running all the iterations of the loop independently in a DOALL like loop. Loop distribution [BCKT79] splits a loop into a sequence of loops including either a single statement or statements included in a cyclic dependence relationship. Problems arise when dependences form recurrences or cycles in the dependence graph. In this case, parallelizing methods can be classified in (a) methods that try ....
U. Banerjee, S. Chen, D.J. Kuck and R.A. Towle, "Time and Parallel Processor Bounds for Fortran-like Loops", IEEE Trans. on Computers, Vol. C-28, No. 9, Sept 1979.
....variety of tests, all of which can prove independence in some case. It is infeasible to solve the problem directly for large problems, even for linear functions, because nding dependences is equivalent to the NP complete problem of nding integer solutions to systems of Diophantine equations [14]. This is also known as the fundamental theorem of dependence analysis as formulated by Banerjee [12] Tests can be categorised as general or approximate tests, exact tests and multiple dimension exact tests, but exactness is not always possible and usually expensive. 2.2.5 False Dependences ....
Utpal Banerjee, Shyh-Ching Chen, David J. Kuck, and Ross A. Towle. Time and parallel processor bounds in FORTRAN-like loops. IEEE Transactions on Information Theory, C-28(9):660-670, 1979.
....When register assignment occurs before instruction scheduling (i.e. early register assignment) graph coloring register assignment generally assigns the program s values to a near minimal number of registers; however this process causes addition of false dependences. These false dependences[3] arise due to the re definition of registers when multiple values are mapped to the same physical register. Consider the code in Figure 2.6. For this example, we shall assume an ILP level of two; that is t1 sum t2 prod oe oe oe oe A[i] sum t1 B[i] prod t2 1) 2) 3) 4) Figure ....
U. Banerjee, S. Shen, D.J. Kuck, and R.A. Towle. Time and parallel processor bounds for FORTRAN-like loops. IEEE Transactions on Computers, C-28(9):660--670, Sep 1979.
....SGEFA. Although fairly different in structure, both applications presented result in clear programs with parallelism constructs appearing in natural ways. The importance of DOALL type constructs opens the way to application of the research already done in parallelizing Fortran loops automatically [27]. The barrier makes possible the separation of a computation into sequential phases without invoking the process environment management overhead of Fork Join. Producer consumer synchronization and critical sections make it easy to deal with mutual exclusion type restrictions on access to shared ....
U. Banerjee, S. C. Chen, D. J. Kuck and R. A. Towle, "Time and parallel processor bounds for FORTRAN-like loops," IEEE Trans. Comput., C-28 (9) (1979) 660-670. 38
....output dependences. Let O 1 and O 2 be operations such that O 1 precedes O 2 in the original code. O 2 must follow O 1 if any of the following conditions hold: 1) O 2 is data dependent on O 1 if O 2 reads data written by O 1 , 2) O 2 is anti dependent on O 1 if O 2 destroys data required by O 1 [9], or (3) O 2 is output dependent on O 1 if O 2 writes to the same variable as does O 1 . The term dependence refers to data dependences, anti dependences, and output dependences. 5 Source 2 Stage 1 Stage 0 Stage 0 Stage 1 Stage 2 Stage 3 Result Bus ALU Multiplier Source 1 X X ....
U. Banerjee, S. Shen, D.J. Kuck, and R.A. Towle. Time and Parallel Processor Bounds for fortran-Like Loops. IEEE Transactions on Computers, C-28(9):660--670, September 1979.
....implementing a software mechanism similar to those proposed by Bernstein et al. 10] for proving the safety of speculative loads and measuring its impact on performance. The third is augmenting our scheduler to allow weak ordering of memory references by performing memory reference disambiguation [4, 32, 13, 5]. The fourth is in integrating a mechanism for register reallocation and spill code insertion into our scheduling framework (for related work, see [6] Some of these efforts are currently underway. Acknowledgments This research was supported by NSF Grant No. MIP9007678 and by an NSF Graduate ....
U. Banerjee, S. Chen, D. Kuck, R. Towle. Time and Parallel Processor Bounds for Fortran-Like Loops. IEEE Trans. on Computers. C-28(9), Sept., 1979.
....primitives, and complex communication primitives such as parallel prefix [Thi92] Compilers such as [AALL93, Arn82, THK93] find data parallelism in loops automatically. Data is often mapped across processors by the programmer, perhaps with some compiler assistance [THK93, CMZ92, Hig93, GOS94, BCKT79, KP96] Pedigree automatically parallelizes a single program for execution across multiple processors. The key differences from previous work pertain to the degree of automation, generality, granularity, the overlap of inter dependent code, and flexibility in mapping available parallelism to ....
Utpal Banerjee, Shyh-Ching Chen, David Kuck, and Ross Towle. Time and 166 Parallel Processor Bounds for FORTRAN-Like Loops. IEEE Transactions on Computers, C-28(9), September 1979.
....the variables in loops as shown in this paper is to improve the generality of dependence testing in loops, generating more precise dependence graphs and allowing more aggressive optimizations. Dependence testing for linear induction variables is generally well covered in the literature [AK87, BCKT79, GKT91, MHL91] Banerjee presents some algorithms for handling polynomial induction variables in his MS thesis [Ban76] After a brief review, we focus on dependence with wrap around variables, monotonic variables and periodic variables. The algorithm used to classify variables will actually ....
.... the loop: for i = 2 to 11 by 3 loop Delta Delta Delta A(i) Delta Delta Delta will be changed to: for i = 0 to (11 2) 3 loop Delta Delta Delta A(i 3 2) Delta Delta Delta This transformation was initially designed to simplify the formulation of data dependence testing algorithms [BCKT79] Since normalization always puts the loop lower limits into subscript expressions, it can complicate life for simple dependence analyzers when the lower limit contains other variables, as shown by the work on Parafrase [SLY90] For this reason, and because it can adversely affect the kinds of ....
[Article contains additional citation context not shown here]
Utpal Banerjee, Shyh-Ching Chen, David J. Kuck, and Ross A. Towle. Time and parallel processor bounds for Fortranlike loops. IEEE Trans. on Computers, C28 (9):660--670, September 1979.
.... Assignment Before Instruction Scheduling When register assignment is done before instruction scheduling (early register assignment) graph coloring register assignment generally assigns the program s values to a near minimal number of registers; however this process can cause anti dependences [2] to be added. These anti dependences arise due to the re definition of registers that occurs when multiple values are mapped to the same physical register. Consider the code in Figure 1. For this example, we shall assume a machine that can initiate two operations per cycle. We further assume that ....
BANERJEE, U., SHEN, S., KUCK, D., AND TOWLE, R. Time and parallel processor bounds for FORTRAN-like loops. IEEE Transactions on Computers C-28, 9 (Sep 1979), 660--670.
.... Assignment Before Instruction Scheduling When register assignment is done before instruction scheduling (early register assignment) graph coloring register assignmentgenerally assigns the program s values to a near minimal number of registers; however this process can cause anti dependences [2] to be added. These anti dependences arise due to the re definition of registers that occurs when multiple values are mapped to the same physical register. Consider the code in Figure 1. For this example, we shall as t1 sum t2 prod oe oe oe oe A[i] sum t1 B[i] prod t2 1) 2) 3) 4) Figure ....
BANERJEE, U., SHEN, S., KUCK, D., AND TOWLE, R. Time and parallel processor bounds for FORTRAN-like loops. IEEE Transactions on Computers C-28, 9 (Sep 1979), 660--670.
....of the loop several times without leaving it. In general, the problem of finding the longest dependence chain can be formulated as an integer programming problem. If distances associated to dependences are small compared with loop limits and all recurrences are included in a single p block [6], the problem can be directly formulated with the weight of recurrences in the dependence graph instead of the distances of individual arcs. Meanwhile, the parallelism of the most restrictive recurrence evaluated as (3) gives an upper bound of the loop parallelism and processors required to ....
Banerjee U., Chen S., Kuck D.J. and Towle R.A., "Time and Parallel Processor Bounds for Fortran-like Loops", IEEE Trans. on Computers, Vol. C-28, No. 9, Sept 1979.
....the following sections is included in the extended abstract; the example included is referred to in a subsequent section. #endif Loop Distribution. DIS is legal as long as any cycle of dependence relations is not broken across the distributed loop(s) some reordering of the code may be necessary [BCK79]. #ifdef short [example of how CSE disables DIS, and how CSE 1# can enable DIS] #else For instance, the following loop: for I = 1 to N do S 1# : A(I) B(I) 1 S 2# : B(I 1) B(I) C(I) D(I) endfor has the dependence relations S 2# d ( # S 1# S 2# d ( # S 2# The loop can be ....
U. Banerjee, S. Chen, D. J. Kuck and R. A. Towle, Time and Parallel Processor Bounds for Fortran-Like Loops, IEEE Trans. on Computers C-28, 9 (September 1979), 660-670.
....and dataflow analysis (Aho et al. ASU86] or Hecht [Hec77] Another analysis important for generating efficient compacted code is accurately determinating data dependencies among a program s operations. Data dependence concepts and standard terminology are widely discussed in the literature [BSKT79, PKL80, Veg82, PW86, Ban88] The three basic types of data dependence are: ffl Flow Dependence sometimes called true dependence or data dependence. An operation m 2 is flow dependent on operation m 1 if m 1 executes before m 2 and m 1 writes to some memory location read by m 2 . ffl ....
U. Banerjee, S. Shen, D.J. Kuck, and R.A. Towle. "Time and parallel processor bounds for fortran-like loops". IEEE Transactions on Computers, C-28(9):660--670, Sep 1979.
....in terms of time complexity. 1. Introduction One of the most tedious tasks for a lot of sequencial algorithms is the execution of nested FOR(DO) loops with uniform data dependencies. Therefore, a significant number of papers has been devoted to the efficient parallelization of these loops [1] [3], 8] Most of the well known signal and image processing algorithms, such as LU decomposition, convolution, etc. belong in this category. Furthermore, a large class of algorithms with non uniform dependencies, fall into this class after applying methods such as the one presented in [17] The ....
U. Banerjee, S.-C. Chen, D. J. Kuck and R. A. Towle, "Time and Parallel Processor Bounds for Fortran-Like Loops," IEEE Trans. Comput., vol. C-28, no. 9, pp. 660670, Sept. 1979.
....and their dependent computation; 3. Applying the Regular Schedule. There exist several methods for compiler recognition of recurrences in loops that can be used in Step 1 of our approach described above (since this is not the topic of this paper, we briefly overview these works) Banerjee et al. [3] showed how the data dependence graph can be used to isolate recurrences so that pattern matching techniques can be applied. Ammarguellat and Harrison [2] put forth a method for recognizing recurrence relations automatically. Pinter and Pinter [25] gave a graph rewriting technique for recognizing ....
....[13] into array stb5[N 1] which enables us to recognize that the core recurrence is on the second statement in the new loop. We then construct a matrix chain multiplication for the core recurrence as follows: stb5[k] 1] stb5[1] 1] sb[1] 0 1 0 sa[1] 1 # sb[2] 0 1 0 sa[2] 1 # sb[3] 0 1 0 sa[3] 1 # 1 1 1 sb[k] 0 1 0 sa[k] 1 # : 4) where 1 k N 1. Once we have constructed the matrix chain multiplication for the core recurrence, we can easily apply our Regular Schedule to the chain for computing stb5[k] We organize the core recurrence with the computation ....
[Article contains additional citation context not shown here]
U. Banerjee, S. C. Chen, D. Kuck and R. Towle, "Time and parallel processor bounds for Fortran-like loops", IEEE Trans. on Computers, C-28(9), pp. 660-670, September 1979.
.... redundant tree height. #define PI RTH RTH period size. #define N2 N1 PI PI PI 2 padding to make array size a multiple of RTH RTH plus 2. register int k, m, i, j, l; double x[N2] a[N2] b[N2] c[N2] ar[N2] br[N2] initialize arrays a[N2] b[N2] c[N2] x[1] c[1] x[2]=c[2] do computation of one period at a time. L1: for(m=2; m =N2; m =PI) f do simultaneously 1st matrix multiplications on p redundant trees in a period. vectorizable. L2.1: for (i=m 1; i =m 1 RTH (RTH 1) i =RTH) f L4.1: ar[i 1] a[i] a[i 1] b[i 1] br[i 1] b[i] a[i 1] ....
....redundant tree height. #define PI RTH RTH period size. #define N2 N1 PI PI PI 2 padding to make array size a multiple of RTH RTH plus 2. register int k, m, i, j, l; double x[N2] a[N2] b[N2] c[N2] ar[N2] br[N2] initialize arrays a[N2] b[N2] c[N2] x[1] c[1] x[2] c[2]; do computation of one period at a time. L1: for(m=2; m =N2; m =PI) f do simultaneously 1st matrix multiplications on p redundant trees in a period. vectorizable. L2.1: for (i=m 1; i =m 1 RTH (RTH 1) i =RTH) f L4.1: ar[i 1] a[i] a[i 1] b[i 1] br[i 1] b[i] a[i 1] ....
[Article contains additional citation context not shown here]
U. Banerjee, S. C. Chen, D. Kuck and R. Towle, "Time and parallel processor bounds for Fortran-like loops", IEEE Trans. on Computers, C-28(9), pp. 660-670, September 1979.
....2. Construct the matrix chain multiplication for the BLR s and their dependent computation; 3. Applying the Regular Schedule. There exist several methods for compiler recognition of recurrences in loops(since this is not the topic of this paper, we briefly overview these works) Banerjee et al.[3] showed how the data dependence graph can be used to isolate recurrences so that pattern matching techniques can be applied. Ammarguellat and Harrison[2] put forth a method for recognizing recurrence relations automatically. Pinter and Pinter[23] gave a graph rewriting technique for recognizing ....
....into array stb5[N 1] which enables us to recognize that the core recurrence is on the second statement in the new loop. We then construct a matrix chain multiplication for the core recurrence as follows: stb5[k] 1] stb5[1] 1] sb[1] 0 1 0 sa[1] 1 # sb[2] 0 1 0 sa[2] 1 # sb[3] 0 1 0 sa[3] 1 # 1 1 1 sb[k] 0 1 0 sa[k] 1 # : 4) where 1 k N 1. Once we constructed the matrix chain multiplication for the core recurrence, we can easily apply our Regular Schedule to the chain for computing stb5[k] We organize the core recurrence with the computation around it ....
[Article contains additional citation context not shown here]
U. Banerjee, S. C. Chen, D. Kuck and R. Towle, "Time and parallel processor bounds for Fortran-like loops", IEEE Trans. on Computers, C-28(9), pp. 660-670, September 1979.
....type of technique that generates heterogeneous parallel code transforms serial do loops into two or more serial loops that execute in parallel with each other. The technique is based on a transformation called loop distribution, developed by Muraoka [66] and also described by Banerjee et al. [67] which partitions the statements in the loop body into a sequence of subsequences and creates a separate loop for each subsequence. Example 14 Consider the loop of Example 11. We can partition the statements in the loop body into three subsequences: do K 1 = 0; N S 1 : A(K 1 ) F 1 (A(K 1 ....
U. Banerjee, S. C. Chen, D. J. Kuck, and R. A. Towle. Time and Parallel Processor Bounds for Fortran-Like Loops. IEEE Trans. on Computers, C-28(9):660--670, Sept. 1979.
No context found.
U. Banerjee, S.C. Chen, D.J. Kuck, and R.A. Towle. Time and parallel processor bounds for fortran-like loops. IEEE Trans. Computers, C-28(9), September 1979.
No context found.
U. Banerjee, S. Shen, D.J. Kuck, and R.A. Towle. "Time and parallel processor bounds for fortran-like loops". IEEE Transactions on Computers, C-28(9):660--670, Sep 1979.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC