| # S. Leung and J. Zahorjan, "Improving the Performance of Runtime Parallelization," Proc. Fourth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPOPP), pp. 83--91, May 1993. |
....overhead can easily overwhelm any potential benefits due to the exploitation of parallelism. Although recent research has yielded some progress on reducing message based communication overhead [81] and supporting efficient execution of certain kinds of datadependent communication patterns [50, 67], the applicability of the static software DSM approach appears to remain fairly limited. Dynamic Approaches Dynamic software DSM systems typically support a more general programming model than their static counterparts, typically allowing multiple independent threads of control to operate ....
Shun-tak Leung and John Zahorjan. Improving the Performance of Runtime Parallelization. In Proceedings of the Fourth Symposium on Principles and Practice of Parallel Programming, pages 83--91, May 1993.
....an almost sequential execution schedule. It would therefore be beneficial to use the (iteration) data dependence graph (DDG) and, with the help of an efficient graph partitioning routine generate an optimal schedule. Somewhat similar techniques have been previously presented in the literature [6, 15, 21, 12, 5], but apply only to loops from which a proper inspector can be extracted. We will now present a technique that employs the R LRPD test to efficiently extract the necessary information and construct the DDG for any loop. We use the Sliding Window implementation of the R LRPD test to detect, window ....
S. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In 4th PPOPP, pages 83--91, May 1993.
....dependences in the presence of subscripted subscripts. Although more powerful analysis techniques could remove this last limitation when the index arrays are computed using only statically known values, nothing can be done at compile time when the index arrays are a function of the input data [19, 29, 36]. Most previous approaches to run time parallelization have concentrated on developing methods for constructing execution schedules for partially parallel loops, i.e. loops whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. These ....
....parallel loops, i.e. loops whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. These methods are centered around the extraction of an inspector loop that analyzes the data access pattern off line, i.e. without side effects [8, 19, 22, 26, 27, 28, 29, 35, 36]. The inspection phase of these schemes usually yields a partitioning of the set of iterations into subsets that can be executed in parallel. These subsets, sometimes called wavefronts, are scheduled sequentially by placing synchronization barriers between them. Unfortunately the distribution of ....
S. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In 4th PPOPP, pages 83--91, May 1993.
....pattern is input data dependent. For example, most dependence analysis algorithms conservatively assume dependences when presented with non linear or subscripted subscript expressions. During the past few years, techniques have been developed for the run time analysis and scheduling of loops [5, 9, 13, 17, 20, 23, 25, 26, 27, 28, 29, 30, 33, 34]. The majority of this workhas concentrated on developing run time methods for constructing execution schedules for partially parallel loops, i.e. loops whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original, or source ....
....sequential code. Since compile time data dependence analysis techniques cannot be used on such programs, methods of performing the analysis at run time are required. Several techniques have been developed for the run time analysis and scheduling of loops with cross iteration dependences [5, 9, 13, 17, 20, 23, 28, 29, 30, 33, 34]. However, for various reasons, such techniques have not achieved wide spread use in current parallelizing compilers. In the following we describe a new run time scheme for constructing a parallel execution schedule for the iterations of a loop. The general structure of our method is similar to ....
[Article contains additional citation context not shown here]
S. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In 4th PPOPP, pp. 83--91, May 1993.
.... where [Clark et al. 92, von Hanxleden et al. 92] To reduce this overhead, the run time system can analyze a communication pattern once at run time, and amortize the cost of the analysis over many reuses of the pattern [Wu et al. 91] Also, researchers have optimized the analysis phase itself [Leung Zahorjan 93] However, we wish to reduce this overhead further, even in cases where the above techniques are not applicable. 4.1.3 Our Compilation Strategy We now describe the strategy our compiler uses to generate code for shared memory and distributed memory target architectures. Distributed Memory As ....
S. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 83--91, July 1993.
....program. Several systems address this problem by parallelizing programs dynamically using information that is available only as the program runs. The inspector executor approach dynamically analyzes the values in index arrays to automatically parallelize computations that access irregular meshes [Leung and Zahorjan 1993; Saltz et al. 1991] The Jade implementation dynamically analyzes how tasks access data to exploit the concurrency in coarse grain parallel programs [Rinard et al. 1992] Speculative approaches optimistically execute loops in parallel, rolling back the computation if the parallel execution ....
LEUNG, S. AND ZAHORJAN, J. 1993. Improving the performance of runtime parallelization. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM Press, New York, NY, 83--91.
....out to have a large amount of parallelism. Software transformations are a possible way of extracting some parallelism from these codes. Some software schemes analyze the dependence structure of the code at run time and try to run parts of it in parallel protected by synchronization (for example [13]) Other software schemes speculatively run the code in parallel and later recover if a dependence violation is detected [5, 15] While these techniques are certainly promising, they all have various amounts of software overhead, which may limit their scalability. On the hardware side, there have ....
S.-T. Leung and J. Zahorjan. "Improving the Performance of Runtime Parallelization." Symp. on Principles and Practice of Parallel Programming, pages 83-91, May 1993.
....dependences in the presence of subscripted subscripts. Although more powerful analysis techniques could remove this last limitation when the index arrays are computed using only statically known values, nothing can be done at compile time when the index arrays are a function of the input data [20, 30, 38]. Most previous approaches to run time parallelization have concentrated on developing methods for constructing execution schedules for partially parallel loops, i.e. loops whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. 1 These ....
....conditions in parallel programs (see, e.g. 13, 24, 31] However, these methods are generally not appropriate for run time loop parallelization since they are optimized for other purposes, e.g. for them minimizing memory requirements is more important than speed. i.e. without side effects [8, 20, 23, 26, 28, 29, 30, 37, 38, 12]. The inspection phaseof these schemesusually yields a partitioning of the set of iterations into subsets that can be executed in parallel. These subsets, sometimes called wavefronts, are scheduled sequentially by placing synchronization barriers between them. Unfortunately the distribution of ....
S. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In 4th PPOPP, pages 83--91, May 1993.
....generally conservatively assume data dependences. Although more powerful analysis techniques could remove this last limitation when the index arrays are computed using only statically known values, nothing can be done at compile time when the index arrays are a function of the input data [12, 25, 28]. We will present the principles of the design and implementation of a compiling system that employs run time and classic techniques in tandem to automatically parallelize irregular, dynamic applications. We will show that run time optimizations always represent a tradeoff between a speculated ....
S. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In 4th PPOPP, pp. 83--91, May 1993.
....generally conservatively assume data dependences. Although more powerful analysis techniques could remove this last limitation when the index arrays are computed using only statically known values, nothing can be done at compile time when the index arrays are a function of the input data [12, 25, 28]. We will present the principles of the design and implementation of a compiling system that employs run time and classic techniques in tandem to automatically parallelize irregular, dynamic applications. We will show that run time optimizations always represent a tradeoff between a speculated ....
S. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In 4th PPOPP, pp. 83--91, May 1993.
....dependences in the presence of subscripted subscripts. Although more powerful analysis techniques could remove this last limitation when the index arrays are computed using only statically known values, nothing can be done at compile time when the index arrays are a function of the input data [22], 37] 49] A. Speculative doall parallelization In this paper we propose a novel framework for parallelizing do loops at run time. The proposed framework differs conceptually from previous methods in two major points. ffl Instead of finding a valid parallel execution schedule for the loop ....
.... of Saltz and Mirchandaney [35] in which processors are assigned iterations in a wrapped manner, and busy waits are used to ensure that values have been produced before they are used (again, this is only possible if the original loop has no output dependences) Recently, Leung and Zahorjan [22] have proposed some other methods of parallelizing the inspector of Saltz et al. These techniques are also restricted to loops with no output dependences. In sectioning, each processor computes an optimal parallel schedule for a contiguous set of iterations, and then the stages are concatenated ....
[Article contains additional citation context not shown here]
S. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In 4th PPOPP, pages 83--91, May 1993.
....constructing execution schedules for partially parallel loops, i.e. loops whose 7 parallelization requires synchronization to ensure that the iterations are executed in the correct order. Briefly, run time methods for parallelizing loops rely heavily on global synchronizations (communication) [13, 21, 26, 31, 35, 41, 43, 49], are applicable only to restricted types of loops [26, 41, 43] have significant sequential components [35, 41, 43] and or do not extract the maximum available parallelism (they make conservative assumptions) 13, 26, 35, 41, 43, 49] The only method that manages to combine the most advantageous ....
....7 parallelization requires synchronization to ensure that the iterations are executed in the correct order. Briefly, run time methods for parallelizing loops rely heavily on global synchronizations (communication) 13, 21, 26, 31, 35, 41, 43, 49] are applicable only to restricted types of loops [26, 41, 43], have significant sequential components [35, 41, 43] and or do not extract the maximum available parallelism (they make conservative assumptions) 13, 26, 35, 41, 43, 49] The only method that manages to combine the most advantageous features is that of [37] It does however rely on the ....
[Article contains additional citation context not shown here]
S. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In 4th PPOPP, pages 83--91, May 1993.
....the available loop level parallelism in a large number of cases, due to complex data access patterns in programs or inadequate level of static analysis. This has motivated efforts to complement compiler analysis with run time techniques to extract parallelism. A number of previous approaches [21, 11, 9, 16, 17, 10, 2] have focussed on constructing execution schedules to extract parallelism out of doacross loops, i.e. loops that need synchronization for parallelization. These techniques rely on an inspector computation, which pre processes the relevant data access patterns at run time to determine the ....
S. Leung and J. Zahorjan. Improving the performance of run-time parallelization. In Proc. ACM Symposium on Principles and Practices of Parallel Programming, pages 83--91, May 1993.
....dependences in the presence of subscripted subscripts. Although more powerful analysis techniques could remove this last limitation when the index arrays are computed using only statically known values, nothing can be done at compile time when the index arrays are a function of the input data [5, 16, 20]. In [12] we have presented the general principles of run time parallelization implementation. Briefly, such run time parallelization can be effective, i.e. obtain a large fraction of the available speedup, by reducing the associated run time overhead. This can be achieved through a careful ....
S. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In 4th PPOPP, pages 83--91, May 1993.
....be run fully or partially in parallel in an e ective manner on Distributed Shared Memory (DSM) multiprocessors, some important codes would bene t signi cantly. To run these codes in a fully or partially parallel manner, software approaches based on an inspector executor pair have been proposed ([3, 11, 12, 16] for example) An inspector loop analyzes the data access patterns at run time and yields a partitioning of the iteration space into subsets called wavefronts. Each wavefront is then executed in parallel by the executor, with synchronization separating the wavefronts. In general, however, the ....
S.-T. Leung and J. Zahorjan. Improving the Performance of Runtime Parallelization. In 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 83-91, May 1993.
....execution schedules for partially parallel loops. These are loops whose parallelization may require synchronization to ensure that the iterations are executed in the correct order. These methods are often based on the extraction of an inspector loop that analyzes the data access patterns ([5, 12, 13, 17] to name a few) The inspector usually yields a partitioning of the iteration space into subsets called wavefronts. Each wavefront is then executed in parallel by the executor, with barriers separating the wavefronts. Unfortunately, the inspector may be both computationally expensive and have ....
S.-T. Leung and J. Zahorjan. Improving the Performance of Runtime Parallelization. In 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 83--91, May 1993.
....approaches have been proposed. These schemes use information available at run time to construct execution schedules that are partially parallel. The right schedule is forced with direct synchronization. These methods are often based on an inspector loop that analyzes the data access patterns ([4, 16, 19, 22] to name a few) If the loop is not fully parallel, the inspector usually yields a partitioning of the iteration space into subsets called wavefronts. Each wavefront is then executed in parallel by the executor, with barriers separating the wavefronts. This inspector executor method is also ....
S.-T. Leung and J. Zahorjan. Improving the Performance of Runtime Parallelization. In 4th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pp. 83--91, May 1993.
....program. Several systems address this problem by parallelizing programs dynamically using information that is available only as the program runs. The inspector executor approach dynamically analyzes the values in index arrays to automatically parallelize computations that access irregular meshes [26, 37]. The Jade implementation dynamically analyzes how tasks access data to exploit the concurrency in coarse grain parallel programs [34] Speculative approaches optimistically execute loops in parallel, rolling back the computation if the parallel execution violates the data dependences [32] A ....
S. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 83--91, San Diego, CA, May 1993.
....data dependences in the presence of subscripted subscripts. More powerful analysis techniques could remove this last limitation when the index arrays are computed using only statically known values. However, nothing can be done at compile time when the index arrays are a function of the input data [13, 21, 29]. Run time techniques have been used practically from the beginning of parallel computing. During the 1960s, relatively simple run time techniques, used to detect parallelism between scalar operations, were implemented in the hardware of the CDC 6600 and the IBM 360 91 [23, 24] Some of today s ....
....either the original serial loop or its parallel version. The boolean expression in the if statement typically tests the value of a scalar variable. During the last few years, new techniques have been developed for the run time analysis and scheduling of loops with cross iteration dependences [5, 13, 16, 19, 20, 21, 28, 29]. Most of this work has focussed on developing run time methods for constructing parallel schedules for DOACROSS loops. Unfortunately, these methods have significant sequential components, rely heavily on global synchronizations (communication) or do not extract the maximum available parallelism ....
S. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In Proc. 4th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPOPP), pages 83--91, May 1993.
....degree and granularity. Section 5 concludes the paper with a summary of evaluation results. 2 Run time Parallelization Techniques In the past, many run time parallelization algorithms have been developed for different types of loops on both shared memory and distributed memory machines [6, 9, 14]. Most of the algorithms follow a so called INSPECTOR EXECUTOR approach. With this approach, a loop under consideration is transformed at compile time into an inspector routine and an executor routine. At run time, the inspector detects cross iteration dependences and produces a parallel schedule; ....
....the loop operations according to the wavefronts of iterations. Note that the inspector in the above scheme is sequential. It requires time commensurate with that of a serial loop execution. Parallelization of the inspector loop was also investigated by Saltz, et al. 15] and Leung and Zahorjan [9]. Their techniques respect flow dependences, but ignore anti flow and output dependences. Most recently, Rauchwerger, Amato and Padua presented a parallel inspector algorithm for a general form of loops [13] They extracted the function of scheduling and explicitly presented an inspector ....
S.-T. Leung and J. Zahorjan. "Improving the performance of runtime parallelization". In 4th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 8391, May 1993.
....methods for constructing execution schedules for partiallyparallel loops. These are loops whose parallelization may require synchronization to ensure that the iterations are executed in the correct order. These methods are often based on an inspector loop that analyzes the data access patterns ([4, 10, 13, 15] to name a few) The inspector usually yields a partitioning of the iteration space into subsets called wavefronts. Each wavefront is then executed in 1 This work was supported in part by the National Science Foundation under grants NSF Young Investigator Award MIP 9457436, ASC 9612099 and ....
S.-T. Leung and J. Zahorjan. Improving the Performance of Runtime Parallelization. In 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 83--91, May 1993.
....execution may enhance the performance of these techniques by simplifying and improving the memory reference behavior. Several speculative and run time parallelization methods have been proposed to attempt parallel execution of loops that cannot be analyzed sufficiently accurately at compile time [12, 15]. Like cascaded execution, these techniques make use of processors that would otherwise be idle if the compiler resorted to simple, sequential execution. In cases where enough parallelism is available at run time to overcome the overheads associated with run time parallelization, or when memory ....
S.-T. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 83--91, San Diego, CA, May 1993.
No context found.
# S. Leung and J. Zahorjan, "Improving the Performance of Runtime Parallelization," Proc. Fourth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPOPP), pp. 83--91, May 1993.
No context found.
S. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 83--91, San Diego, CA, May 1993.
No context found.
S. Leung and J. Zahorjan, Improving the performance of runtime parallelization, In Proc. 4th ACM SigPlan Symp. Prin. Pract. Parall. Prog., pages 83--91, 1993.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC