| Polychronopoulos, C.: Parallel Programming and Compilers. Kluwer (1988) |
.... compiler techniques that are primarily for the frontend modules of parallelizing compilers, such as: effective dependence detection techniques [14, 30, 60, 61, 73] dependence elimination techniques [21, 59, 69] the optimization techniques for parallel job and loop scheduling [46, 57], and the techniques for optimizing data locality and reuse in caches [38, 40, 72] The work on parallelizing compilers for UMA machines also has given us the foundations necessary to tackle the study of compiler algorithms for more complex classes of machines like distributed memory ....
....in Section 3.1, where all parallel loop iterations are statically assigned to the processors. We use the same loop scheduling scheme for the distributed memory architectures as Polaris currently uses for UMA architectures: cyclic schedules for triangular loops, and block schedules for square loops [57]. According to our experiments, these conventional static scheduling schemes provide relatively good load balancing between processors at low loop scheduling costs in most cases. 67 Figure 3.2 shows the results after the code in Figure 3.1 is transformed by the work partitioning module. P is the ....
C. Polychronopoulos. Parallel Programming and Compilers. Academic Publishers, MA, 1988.
....compile time model and decision methodology (Section 4) and describe the hybrid compile and run time system (Section 5) Finally, we present the modeling and experimental results (Section 6) and our conclusions (Section 7) 2. RELATED WORK Compile time static loop scheduling has been well studied [9, 14]. Static scheduling for heterogeneous NOWs was proposed in [4, 5, 7] The task queue model for dynamic loop scheduling has targeted shared memory machines [11, 14] while the diffusion model has been used for distributedmemory machines [10] A method for task level scheduling in heterogeneous ....
....results (Section 6) and our conclusions (Section 7) 2. RELATED WORK Compile time static loop scheduling has been well studied [9, 14] Static scheduling for heterogeneous NOWs was proposed in [4, 5, 7] The task queue model for dynamic loop scheduling has targeted shared memory machines [11, 14], while the diffusion model has been used for distributedmemory machines [10] A method for task level scheduling in heterogeneous programs was proposed in [13] and [2] presents an application specific approach to schedule individual parallel applications. A common approach taken for dynamic ....
Polychronopoulos, C. D. Parallel Programming and Compilers. Kluwer Academic, 1988.
....two scheduling techniques, we introduce the terms M task and S task to denote a task that can run on multiple processors and a single processor, respectively. An M task can be either a purely data parallel task, or a mixed task data parallel routine. While pure data parallel scheduling techniques [3, 11, 12, 15, 24] could still be applied within data parallel M tasks, pure task scheduling techniques [17, 18, 19, 25, 26] are no longer applicable to schedule M tasks. As a result, new approaches have to be found that fully exploit the available parallelism. Scheduling is known to be NP complete even for the ....
C. D. Polychronopoulos. Parallel Programming and Compilers. Kluwer Academic, 1988.
....two scheduling techniques, we shall use the terms M task and S task to denote a task that can run on multiple processors and a single processor, respectively. An M task can be either a purely data parallel task, or a mixed task data parallel routine. While pure data parallel scheduling techniques [1, 5, 13, 14, 16, 19, 29] could still be applied within data parallel M tasks, pure task scheduling techniques [12, 15, 17, 22, 23, 24, 30, 31] are no longer applicable to schedule M tasks. As a result, new approaches have to be found 2 that fully exploit the available parallelism. Scheduling is known to be NP complete ....
C. D. Polychronopoulos. Parallel Programming and Compilers. Kluwer Academic, 1988.
....two scheduling techniques, we introduce the terms M task and S task to denote a task that can run on multiple processors and a single processor, respectively. An M task can be either a purely data parallel task, or a mixed task data parallel routine. While pure data parallel scheduling techniques [3, 11, 12, 15, 24] could still be applied within data parallel M tasks, pure task scheduling techniques [17, 18, 19, 25, 26] are no longer applicable to schedule M tasks. As a result, new approaches have to be found that fully exploit the available parallelism. Scheduling is known to be NP complete even for the ....
C. D. Polychronopoulos. Parallel Programming and Compilers. Kluwer Academic, 1988.
....by the product of the elements in the diagonal. In [ShFo88] two approaches, called the Partitioning Vector and the Smith Normal Form approaches, for identifying independent partitions of algorithms with uniform dependences are given. Other methods such as DOACROSS [Cytr86] and Cycle Shrinking [Poly88] obtain more parallelism by synchronizing the computations assigned to the processors. The first one partially overlaps the execution of successive iterations of the loop in order to satisfy all the dependences of the graph. The second one executes in parallel as many successive iterations in each ....
C.D. Polychronopoulos, "Parallel Programming and Compilers", Kluwer Academic Pub., London, 1988.
....two scheduling techniques, we introduce the terms M task and S task to denote a task that can run on multiple processors and a single processor, respectively. An M task can be either a purely data parallel task, or a mixed task data parallel routine. While pure data parallel scheduling techniques [1, 5, 14, 15, 16, 19, 29] could still be applied within data parallel M tasks, pure task scheduling techniques [13, 21, 22, 23, 24, 30, 31] are no longer applicable to schedule M tasks. As a result, new approaches have to be found that fully exploit the available parallelism. Scheduling is known to be NP complete even for ....
C. D. Polychronopoulos. Parallel Programming and Compilers. Kluwer Academic, 1988.
....on a queue by a master process and independently removed by slave processes running in parallel. Consider the following pair of example code segments: 18 In both loops, all the computations are independent so if there were 10000 processors, each processor could execute a single iteration. In [Poly88] such a loop with no cross iteration dependencies is called a DOALL loop. In the vector add example, each iteration would be relatively short and the execution time would be relatively constant from iteration to iteration. In the particle tracking example, each iteration will choose a random ....
Polychronopoulos C, "Parallel Programming and Compilers", Kluwe Academic, Boston.
....suited to parallel execution. 2.1 Distributed systems Advances in execution speed have traditionally only been reached by developments in circuit design. i.e. processor speed. As the limitations in the physical area of design are coming to an end and the price per performance ratio is very high [Poly88], a system that provides greater processing power for the same or reduced cost, than that of today s fastest uniprocessor or supercomputers, would be much more appealing to the user. A distributed computer system provides greater overall processing power because of its low price per performance ....
....a major factor in determining the granularity that can be used. Say for example a program has many subroutines, each subroutine could be executed in parallel. This level of abstraction is course grain, and depending on the size of the subroutines could be very well suited to a distributed system [Poly88]. Many scientific and engineering applications use loop structures to perform iterative computations [Evan95a] Several different types of parallel loops exist, each depends on the Building a Parallelising Compiler for Distributed Computing Systems Chapter 2 Compilers and Parallelism 5 kind ....
[Article contains additional citation context not shown here]
C.D. Polychronopoulos. "Parallel Programming and Compilers", Kluwer Academic Publishers, 1988.
....as the vertical flow in Figure 5. The first problem with such a view is that the compilation phase that is the phase where a task graph is generated from the expression of the sequential program is not usually tractable. For a review of the subject see Padua and Wolfe [1986] or Polychronopoulos [1988]. In practice the best that compiler technology can do, with standard languages, is the partial unravelment of parallelism. In general short range independencies can be found aided by the programmer s use of constructs such as DOACROSS (proposed by Cytron [1986] and DOALL (see Zima [1990] ....
Polychronopoulos, C. (1988). Parallel Programming and Compilers. Kluwer Academic Publishers, Norwell Mass.
....of standard sequential languages there is a major problem with such a view, namely that the compilation phase that is the phase where a task graph is generated from the expression of the sequential program is not usually tractable. For a review of the subject see Padua and Wolfe [1986] or Polychronopoulos [1988]. In practice the best that compiler technology can do, with standard languages, is the partial unravelment of parallelism. In general short range independencies can be found aided by the programmer s use of constructs such as doacross (proposed by Cytron [1986] and doall (see [Zima and Chapman ....
Polychronopoulos, C. (1988). Parallel Programming and Compilers. Kluwer Academic Publishers, Norwell Mass.
....optimizations can move instructions between basic blocks and outside of loops so that expansion of registers used in address calculations becomes more difficult. The analysis described here is similar to the data dependence analysis that is performed by vectorizing and parallelizing compilers [5, 6, 7, 21, 28, 29]. However, data dependence analysis is typically performed on a high level representation. Our analysis had to be performed on a low level representation after code generation and all optimizations had been applied. The calculation of relative addresses involves the following steps. 1. The ....
....# 1. save sp, 96) sp r[20] HI[ B] # 2. sethi hi( B) l4 r[18] HI[ A] # 3. sethi hi( A) l2 r[25] 204; # 4. mov 204, i1 r[26] HI[10200] # 5. sethi hi(10200) i2 r[26] r[26] LO[10200] # 6. add i2, lo(10200) i2 r[20] 4 r[25] # 13. add i1,4, l4 r[20] r[20] r[28] # 14. add l4, i4, l4 r[21]=r[25] r[27] # 15. add i1, i3, l5 r[22] r[25] r[28] # 16. add i1, i4, l6 r[21] r[21] r[22] # 17. sub l5, l6, l5 r[22] r[24] # 18. mov i0, l6 ST=HI[ Rand] LO[ Rand] 68,0; # 21. call Rand,0 R[r[20] r[21] r[8] # 22. st o0, l4 l5] R[r[20] r[8] # 23. st o0, l4] PC=RT; # 32. ret ....
[Article contains additional citation context not shown here]
C. D. Polychronopoulos. Parallel Programming and Compilers. Kluwer, 1988.
.... compiler techniques that are primarily for the frontend modules of parallelizing compilers, such as: effective dependence detection techniques [14, 30, 60, 61, 73] dependence elimination techniques [21, 59, 69] the optimization techniques for parallel job and loop scheduling [46, 57], and the techniques for optimizing data locality and reuse in caches [38, 40, 72] The work on parallelizing compilers for UMA machines also has given us the foundations necessary to tackle the study of compiler algorithms for more complex classes of machines like distributed memory ....
....Section 3.1, where all parallel loop iterations are statically assigned to the processors. We use the same loop scheduling scheme for the distributed memory architectures as Polaris currently uses for UMA architectures: cyclic schedules for triangular loops, and block schedules for square loops [57]. According to our experiments, these conventional static scheduling schemes provide relatively good load balancing between processors at low loop scheduling costs in most cases. Figure 3.2 shows the results after the code in Figure 3.1 is transformed by the work partitioning module. P is the ....
C. Polychronopoulos. Parallel Programming and Compilers. Academic Publishers, MA, 1988.
....load, reducing communication overhead, and overlapping communication with computation. Using DAG scheduling algorithms for ITGs is not feasible since the number of iterations may be too large or may not even be known at compile time. Loop parallelism can be uncovered by transformation methods [P88, SK92, WF89] and various loop scheduling techniques have been proposed. Self scheduling (e.g. P88] is a dynamic method for DOALL parallelism when task weights are not predictable at static time. Compared to this work, we are interested in computation in which there exist dependencies between tasks; task ....
....scheduling algorithms for ITGs is not feasible since the number of iterations may be too large or may not even be known at compile time. Loop parallelism can be uncovered by transformation methods [P88, SK92, WF89] and various loop scheduling techniques have been proposed. Self scheduling (e.g. [P88]) is a dynamic method for DOALL parallelism when task weights are not predictable at static time. Compared to this work, we are interested in computation in which there exist dependencies between tasks; task weights are predictable statically and do not change significantly at run time. The ....
[Article contains additional citation context not shown here]
C. D. Polychronopoulos, Parallel Programming and Compilers, Kluwer Academic, 1988.
....architectures, a desirable feature is that the partitioning is a coarse grain DAG. This is because it is easier to develop efficient scheduling algorithms for a DAG than for a directed graph. For a DAG a task needs to communicate only at the beginning and the end of its execution. Polychronopoulos [19] describes an algorithm that transforms DDG task graphs into DAGs by merging strongly connected tasks. In this paper, we consider the clustering problem for DAGs. Clustering is a mapping of the nodes of a task graph onto labeled clusters. A cluster consists of a set of tasks while a task is an ....
....clustering which is an important special case of clustering. Sarkar [20] presents an algorithm for scheduling on an unbounded number of processors. Kruatrachue and Lewis [14] study the grain size determination for DAGs and propose a scheduling heuristic with duplication of tasks. Polychronopoulos [19] presents an extensive analysis of the partitioning problem and a local algorithm for merging tasks. Wu and Gajski [22] developed a programming aid for hypercube architectures using a clustering algorithm that eliminates unnecessary processors without increasing parallel time. A comparison of ....
[Article contains additional citation context not shown here]
C. D. Polychronopoulos. "Parallel Programming and Compilers." Kluwer Academic Publishers, 1988.
....of tasks in G that T x depends on. ffl The expanded graph E(G;N) contains all N v task instances T k i (1 i v, 1 k N ) and these tasks and their dependencies in E(G;N) constitute a DAG. 1 Our results can be extended for inexact dependence values such as and used in the literature [18]. ffl Each task T x in an ITG has a computation cost x and there is a communication cost for sending a message from task T x on one processor to task T y on another processor. The cost is denoted as c x;y , which is usually estimated as startup cost transmission speed Theta size of ....
C. D. Polychronopoulos, Parallel Programming and Compilers, Kluwer Academic, 1988.
....by the use of the PARFOR syntax in Figure 5.6. Any FOR loop for which IDDL(S ij k ) is empty has no loop carried dependences and may be replaced in its entirety by the equivalent PARFOR loop. The class compiler can generate code which replaces PARFOR statements with the corresponding unravelled [Pol88] method steps. This is possible because the value of MaxDepDist is known at compile time and thus the PARFOR loop has constant lower and upper bounds. Knowing the bounds at compile time permits unravelling. The unravelling 3 For convenience, it is assumed that arrays are subscripted from zero ....
C.D. Polychronopoulos. Parallel Programming and Compilers. Kluwer Academic, 1988.
....of standard sequential languages there is a major problem with such a view, namely that the compilation phase that is the phase where a task graph is generated from the expression of the sequential program is not usually tractable. For a review of the subject see Padua and Wolfe [1986] or Polychronopoulos [1988]. In practice the best that compiler technology can do, with standard languages, is the partial unravelment of parallelism. In general short range independencies can be found aided by the programmer s use of constructs such as doacross (proposed by Cytron [1986] and doall (see [Zima and ....
Polychronopoulos, C. (1988). Parallel Programming and Compilers. Kluwer Academic Publishers, Norwell Mass.
....DAG (Directed Acyclic Graphs) scheduling[4, 16, 20, 24, 27] Using graph scheduling algorithms for iterative computation is not feasible since the number of iterations may be too large or may not even be known at compile time. Loop scheduling has been studied extensively in the previous work e.g. [6, 17, 21]. Software pipelining [2, 11, 15, 19, 22] is an important technique proposed for instruction level loop scheduling on VLIW and superscalar architectures. Loop unrolling or graph unfolding techniques [2, 18] have also been developed to allow a compiler to explore more parallelism. Our work has been ....
....the SOR ITG for BCSSTK14 and BCSSTK15. 7 Conclusions Our experiments show that the automatic scheduling algorithm for ITGs delivers good performance on message passing architectures. Our work is useful for assisting performance prediction in compilation optimization such as program partitioning [3, 17, 21]. Currently we are implementing code generation and runtime support for executing ITG schedules and investigating applications in sparse matrix computations. Acknowledgement This was supported in part by NSF RIA CCR9409695 and by ARPA contract DABT 63 93 C 0064. We thank Pedro Diniz for his help ....
C. D. Polychronopoulos, Parallel Programming and Compilers, Kluwer Academic, 1988.
....model consists of a set of tasks and a set of distinct data objects. Each task reads and writes a subset of data objects. Data dependence graphs (DDG) derived from partitioned code normally have three types of dependence between tasks: true dependence, antidependence, and output dependence [Polychronopoulos 1988]. In a DDG, some anti or output dependence edges may be redundant if they are subsumed by other true data dependence edges. Other anti output dependence edges can be eliminated by program transformation. This article deals with a transformed dependence graph that contains only acyclic true ....
Polychronopoulos, C. D. 1988. Parallel Programming and Compilers. Kluwer Academic Publishers.
....High performance on parallel computers can be achieved only when they are programmed effectively. The complexity of this task and portability issues make automatic support of parallel program development highly desirable [Allen and Kennedy 1987; Kuck et al. 1980; Padua and Wolfe 1986; Polychronopoulos 1988]. Despite all the effort spent on the automatic parallelization of programs, existing parallelizing compilers are not able to deliver the desired performance over a wide range of real applications [Blume and Eigenmann 1992] The challenging problem confronting designers of parallelizing compilers ....
....the dependence problem, but also complicates subscript expressions and can raise difficulties for dependence analysis. However, there are cases where loop normalization becomes necessary. The advantage and disadvantages of loop normalization for dependence testing has been studied by Girkar and Polychronopoulos [1988] and by Wolfe [1993] Parafrase 2 tries to avoid loop normalizations that complicate subscript expressions. Even in the cases where normalization becomes necessary, it is done in such a way that usually simplifies subscript expressions and thereafter dependence anal 24 Delta M. R. Haghighat ....
Polychronopoulos, C. D. 1988. Parallel Programming and Compilers. Kluwer Academic Publishers, Norwell, Mass.
No context found.
Polychronopoulos, C.: Parallel Programming and Compilers. Kluwer (1988)
No context found.
POLYCHRONOPOULOS, C. Parallel Programming and Compilers. Kluwer, Boston, 1988.
No context found.
Polychronopoulos, C.: Parallel Programming and Compilers. Kluwer (1988)
No context found.
C. D. Polychronopoulos, "Parallel Programming and Compilers," Kluwer Academic Publishers, 1988, ISBN: 0-89838288 -2, QA76.6.p653 1988
No context found.
Polychronopoulos C.D., Parallel Programming and Compilers. Boston M.A. Kluwer, 1988.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC