31 citations found. Retrieving documents...
# D.K. Chen, P.C. Yew, and J. Torrellas, "An Efficient Algorithm for the Run-Time Parallelization of doacross Loops," Proc. Supercomputing 1994.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Exploiting Locality in the Run-Time Parallelization.. - Martín, Singh.. (2002)   (Correct)

....In this strategy, inspector and executor are uncoupled. Leung and Zahorjan [12] extend the previous work to consider output dependences and propose different strategies to parallelize the inspector. These proposals exploit iteration level parallelism. A different approach can be found in [2], where finer grain parallelism (operation level) is exploited also using an inspector executor method called CYT algorithm. Dependences are analyzed in the inspector phase; if the indirection arrays do not change between invocations of the loop, then the inspector can be reused. In the executor ....

....patterns and computational workloads. This work shows experimentally that operationlevel methods outperform iteration level methods. This paper describes two new operation level algorithms: Local CYT (LCYT from now on) and Low Overhead LCYT (LO LCYT) They are based on the approach developed in [2], but they use different iteration distribution schemes to exploit data locality. The effectiveness of our algorithms is assessed on an SGI Origin 2000 and the results are compared with those obtained in the CYT proposal. 3. LCYT algorithms Our methods are split in two phases: inspector, where ....

[Article contains additional citation context not shown here]

D.-K. Chen, J. Torrellas, and P.-C. Yew. An Efficient Algorithm for the Run-Time Parallelization of DOACROSS Loops. In Supercomputing Conference, pages 518--527, Washington DC, 1994.


New OPENMP Directives for Irregular Data Access Loops - Labarta, Ayguadé.. (2000)   (1 citation)  (Correct)

....analysis is restricted) to runtime. In most of the reviewed implementations, at runtime, the inspector builds data dependence graphs based on the addresses accessed by the loop and, based on those graphs, schedules interations in wavefronts (sets of iterations which are dependencefree among them) [1, 2, 3, 4, 5, 6, 7]. Many of these implementations can be applied by the compiler to implement the semantics of the indirect clause applied to an ordered directive. 6 Experiments The results presented here come from a code whose major computational loop is of the same form as that discussed in Section 2. The ....

D. K. Chen, P. C. Yew and J. Torrellas, "An efficient algorithm for the run-time parallelization of rioacross loops", In proceedings of Supercomputing '94, Nov. 1994.


Run-Time Methods for Parallelizing Partially Parallel Loops - Rauchwerger, Amato, Padua (1995)   (18 citations)  (Correct)

....pattern is input data dependent. For example, most dependence analysis algorithms conservatively assume dependences when presented with non linear or subscripted subscript expressions. During the past few years, techniques have been developed for the run time analysis and scheduling of loops [5, 9, 13, 17, 20, 23, 25, 26, 27, 28, 29, 30, 33, 34]. The majority of this workhas concentrated on developing run time methods for constructing execution schedules for partially parallel loops, i.e. loops whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original, or source ....

....sequential code. Since compile time data dependence analysis techniques cannot be used on such programs, methods of performing the analysis at run time are required. Several techniques have been developed for the run time analysis and scheduling of loops with cross iteration dependences [5, 9, 13, 17, 20, 23, 28, 29, 30, 33, 34]. However, for various reasons, such techniques have not achieved wide spread use in current parallelizing compilers. In the following we describe a new run time scheme for constructing a parallel execution schedule for the iterations of a loop. The general structure of our method is similar to ....

[Article contains additional citation context not shown here]

D. K. Chen, P. C. Yew, and J. Torrellas. An efficient algorithm for the run-time parallelization of doacross loops. In Proc. of Supercomputing 1994, pp. 518--527, Nov. 1994.


OPENMP Directives for Irregular Data Access Loops - Labarta, Ayguade, Oliver, Henty   (Correct)

....analysis is restricted) to runtime. In most of the reviewed implementations, at runtime, the inspector builds data dependence graphs based on the addresses accessed by the loop and, based on those graphs, schedules interations in wavefronts (sets of iterations which are dependence free among them) [1, 2, 3, 4, 5, 6, 7]. Many of these implementations can be applied by the compiler to implement the semantics of the indirect clause applied to an ordered directive. 6 Experiments The results presented here come from a code whose major computational loop is of the same form as that discussed in Section 2. The ....

D. K. Chen, P. C. Yew and J. Torrellas, "An efficient algorithm for the run-time parallelization of doacross loops", In proceedings of Supercomputing '94, Nov. 1994.


EXPLORER: Supporting Run-time Parallelization of DO-ACROSS Loops .. - Liu, King   (Correct)

....in EXPLORER were implemented using UNIX sockets through TCP IP. which supported the Pthread package. The communications are designed to be implemented using UNIX sockets on top of TCP IP protocols. In the experiments we used a synthetic loop, which is shown in Fig. 4 and modified from that in [2]. The iteration count N is set to half of the size of array a. The parameter r is the number of references to array a per iteration, and W is the workload per iteration. The base value of W was taken to be one fourth of the message startup time in our system. The array INDEX controls the locations ....

D. K. Chen, J. Torrelas, and P. C. Yew, "An efficient algorithm for the run-time parallelization of DOACROSS loops," Proc. of Supercomputing 1994, pp.518-527, November 1994.


Coarse-Grained Thread Pipelining - A Speculative Parallel.. - Kazi, Lilja   (Correct)

....the necessary dependence analysis must be performed at run time to be able to exploit whatever parallelism may be available. Several run time parallelization schemes have been proposed that can improve the performance of application programs that would otherwise have to be executed sequentially [1, 8, 9, 16]. These approaches typically use an inspector phase to determine the dependences that This work is supported in part by the National Science Foundation under grant Nos. MIP 9610379, MIP 9971666, CDA9502979, and CDA 941405. 1 actually exist at run time, followed by an executor phase that actually ....

....active simultaneously, the worst case overhead due to misspeculation will be W (P Gamma 1) Theta Twb . Thus, in the worst case, the cost of misspeculation is determined by the work done within each thread and the total number of threads initiated. 5 Related Work A number of run time schemes [1, 8, 9, 10, 16] have been developed to exploit medium to coarse grained looplevel parallelism in programs in which the parallelism cannot be detected at compile time. In general, these schemes consist of an inspector stage followed by an executor stage. The inspector determines the dependence relations among ....

[Article contains additional citation context not shown here]

D.K. Chen, P.C. Yew, and J. Torrellas, An Efficient Algorithm for the Run-time Parallelization of Doacross Loops, Proceedings of Supercomputing 1994, pp. 518-527, Nov. 1994.


Coarse-Grained Speculative Execution in Shared-Memory.. - Kazi (1998)   (12 citations)  (Correct)

....the necessary dependence analysis must be performed at runtime to be able to exploit whatever parallelism may be available. Several run time parallelization schemes have been proposed that can improve the performance of application programs that would otherwise have to be executed sequentially [1,5,6,10,11]. These approaches typically use an inspector phase to determine the dependences that actually exist at run time, followed by an executor phase that actually performs the computation. Although the executor phase can often be run speculatively at the same time as the inspector phase, these ....

....multiprocessor. In this chapter, we also compare the coarse grained thread pipelining model with other run time parallelization schemes based on the performance results. Finally, Section 6 summarizes our results and conclusions. 2 Chapter 2 Related Work A number of run time schemes [1,5,6,10,11] have been developed to exploit medium to coarse grained loop level parallelism in programs in which the parallelism cannot be detected at compile time. In general, these schemes consist of an inspector stage followed by an executor stage. The inspector determines the dependence relations among ....

[Article contains additional citation context not shown here]

D.K. Chen, P.C. Yew, and J. Torrellas. An Efficient Algorithm for the Run-time Parallelization of Doacross Loops. In Proceedings of Supercomputing 1994, pp. 518-527, Nov. 1994.


The LRPD Test: Speculative Run-Time Parallelization of.. - Rauchwerger, Padua (1995)   (57 citations)  (Correct)

....conditions in parallel programs (see, e.g. 13, 24, 31] However, these methods are generally not appropriate for run time loop parallelization since they are optimized for other purposes, e.g. for them minimizing memory requirements is more important than speed. i.e. without side effects [8, 20, 23, 26, 28, 29, 30, 37, 38, 12]. The inspection phaseof these schemesusually yields a partitioning of the set of iterations into subsets that can be executed in parallel. These subsets, sometimes called wavefronts, are scheduled sequentially by placing synchronization barriers between them. Unfortunately the distribution of ....

D. K. Chen, P. C. Yew, and J. Torrellas. An efficient algorithm for the run-time parallelization of doacross loops. In Proceedings of Supercomputing 1994, pages 518--527, Nov. 1994.


The LRPD Test: Speculative Run-Time Parallelization of.. - Rauchwerger, Padua (1995)   (57 citations)  (Correct)

....such iteration in a shadow version of that variable. By using separate shadow variables to process the read and write operations, Midkiff and Padua [27] improved this basic method so that concurrent reads from a memory location are allowed in multiple iterations. Recently, Chen, Yew and Torrellas [13] proposed another variant of the Zhu and Yew method which improves performance in the presence of hot spots (i.e. many accesses to the same memory location) by first doing some of the computation in private storage. Xu and Chaudhary [46] 45] improve upon [13] by not serializing on multiple ....

....Recently, Chen, Yew and Torrellas [13] proposed another variant of the Zhu and Yew method which improves performance in the presence of hot spots (i.e. many accesses to the same memory location) by first doing some of the computation in private storage. Xu and Chaudhary [46] 45] improve upon [13] by not serializing on multiple reads to the same location. All of the above mentioned methods construct maximal stages in the sense that each iteration is placed in the earliest possible stage, giving a minimal depth schedule, i.e. a minimal number of stages. Polychronopoulos [30] gives a ....

[Article contains additional citation context not shown here]

D. K. Chen, P. C. Yew, and J. Torrellas. An efficient algorithm for the run-time parallelization of doacross loops. In Proceedings of Supercomputing 1994, pages 518--527, Nov. 1994.


Run-Time Parallelization: It's Time Has Come - Rauchwerger (1998)   (3 citations)  (Correct)

....constructing execution schedules for partially parallel loops, i.e. loops whose 7 parallelization requires synchronization to ensure that the iterations are executed in the correct order. Briefly, run time methods for parallelizing loops rely heavily on global synchronizations (communication) [13, 21, 26, 31, 35, 41, 43, 49], are applicable only to restricted types of loops [26, 41, 43] have significant sequential components [35, 41, 43] and or do not extract the maximum available parallelism (they make conservative assumptions) 13, 26, 35, 41, 43, 49] The only method that manages to combine the most advantageous ....

.... rely heavily on global synchronizations (communication) 13, 21, 26, 31, 35, 41, 43, 49] are applicable only to restricted types of loops [26, 41, 43] have significant sequential components [35, 41, 43] and or do not extract the maximum available parallelism (they make conservative assumptions) [13, 26, 35, 41, 43, 49]. The only method that manages to combine the most advantageous features is that of [37] It does however rely on the availability of an inspector loop, which is not a generally applicable technique. A high level comparison of the various methods is given in Table 2. obtains contains requires ....

[Article contains additional citation context not shown here]

D. K. Chen, P. C. Yew, and J. Torrellas. An efficient algorithm for the run-time parallelization of doacross loops. In Proceedings of Supercomputing 1994, pages 518--527, Nov. 1994.


Techniques for Speculative Run-Time Parallelization of Loops - Gupta, Nim (1998)   (14 citations)  (Correct)

....the available loop level parallelism in a large number of cases, due to complex data access patterns in programs or inadequate level of static analysis. This has motivated efforts to complement compiler analysis with run time techniques to extract parallelism. A number of previous approaches [21, 11, 9, 16, 17, 10, 2] have focussed on constructing execution schedules to extract parallelism out of doacross loops, i.e. loops that need synchronization for parallelization. These techniques rely on an inspector computation, which pre processes the relevant data access patterns at run time to determine the ....

D.K. Chen, P.C. Yew, and J. Torrellas. An efficient algorithm for the run-time parallelization of doacross loops. In Proc. Supercomputing '94, pages 518--527, November 1994.


Run-Time Methods For Parallelizing Do Loops - Rauchwerger, Padua   (Correct)

....in A np , i.e. mark it as non privatizable. c) Count the total number of write accesses to A that are marked in this iteration, and store the result in tw i (A) where i is the iteration number. S1: DO i = 1, n S2: A[R1[i] S3: A[W[i] S4: A[R2[i] S5: ENDDO R1[1:8] [ 2 2 2 10 8 8 8 10] W[1:8] 1 3 5 4 7 3 6 12] R2[1:8] 1 3 2 10 7 3 8 12] Position in shadow arrays Written Counted 1 2 3 4 5 6 7 8 9 10 11 12 tw(A) tm(A) Aw [1 : 12] 1 0 1 1 1 1 1 0 0 0 0 1 8 7 A r [1 : 12] 0 1 0 0 0 0 0 1 0 1 0 0 A np [1 : 12] 0 1 0 0 0 0 0 1 0 1 0 0 Aw [1 : 12] A r [1 : 12] 0 0 0 0 ....

....total number of write accesses to A that are marked in this iteration, and store the result in tw i (A) where i is the iteration number. S1: DO i = 1, n S2: A[R1[i] S3: A[W[i] S4: A[R2[i] S5: ENDDO R1[1:8] 2 2 2 10 8 8 8 10] W[1:8] 1 3 5 4 7 3 6 12] R2[1:8] [ 1 3 2 10 7 3 8 12] Position in shadow arrays Written Counted 1 2 3 4 5 6 7 8 9 10 11 12 tw(A) tm(A) Aw [1 : 12] 1 0 1 1 1 1 1 0 0 0 0 1 8 7 A r [1 : 12] 0 1 0 0 0 0 0 1 0 1 0 0 A np [1 : 12] 0 1 0 0 0 0 0 1 0 1 0 0 Aw [1 : 12] A r [1 : 12] 0 0 0 0 0 0 0 0 0 0 0 0 Aw [1 : 12] A np [1 : 12] 0 0 0 0 0 0 0 0 0 ....

D. K. Chen, P. C. Yew, and J. Torrellas, An efficient algorithm for the run-time parallelization of doacross loops, manuscript, 1994.


JavaSpMT: A Speculative Thread Pipelining Parallelization Model .. - Kazi, Lilja (2000)   (1 citation)  (Correct)

....structures, including do while loops and loops with cross iteration dependences that cannot be resolved statically. Although there are no other existing Java parallelization techniques that support run time data dependence checking or speculative execution, a number of inspector executor schemes [5, 14, 15, 19] have been developed for parallelization of compiled languages, such as Fortran and C. These techniques exploit medium to coarsegrained loop level parallelism in programs in which the parallelism cannot be detected at compile time. These approaches differ in the types of dependence patterns they ....

D.K. Chen, P.C. Yew, and J. Torrellas, An Efficient Algorithm for the Run-time Parallelization of Doacross Loops, Proceedings of Supercomputing 1994, Nov. 1994, pp. 518-527.


Automating Runtime Optimizations For Parallel Object-Oriented.. - Krishnan   (Correct)

....use profile information to accurately find the cost of various computation and communication operations. Parallelizing compilers for loop based programs have explored the use of runtime dependence analysis and parallelism detection (e.g. using inspector and executor code) and speculative execution [47, 48, 49, 50], especially for irregular programs with input dependent data access patterns. However, such work has been restricted to loopbased scientific programs, usually running on shared memory multiprocessors. To the best of our knowledge, our framework is one of the first efforts in parallel ....

D. K. Chen, J. Torrellas, and P. C. Yew. An efficient algorithm for the run-time parallelization of doacross loops. In Proceedings of Supercomputing 1994, November 1994.


Coarse-Grained Speculative Execution in Shared-Memory.. - Kazi, Lilja (1998)   (12 citations)  (Correct)

....much of the available loop level parallelism in application programs due to the inherent limitations of compile information. Several run time parallelization schemes have been proposed that can improve the performance of application programs that would otherwise have to be executed sequentially [1,5,6,10,11]. These approaches typically use an inspector phase to determine the dependences that actually exist at run time, followed by an executor phase that actually performs the computation. Although the executor phase can often be run speculatively at the same time as the inspector phase, these ....

....threads will complete their computation stages and begin their writeback stages when they detect the abort command. So, in the worst case, the overhead due to incorrect speculation will be the time for the last thread to complete its computation stage. 5 Related Work A number of run time schemes [1,5,6,10,11] have been developed to exploit medium to coarse grained loop level parallelism in programs in which the parallelism cannot be detected at compile time. In general, these schemes consist of an inspector stage followed by an executor stage. The inspector determines the dependence relations among ....

[Article contains additional citation context not shown here]

D.K. Chen, P.C. Yew, and J. Torrellas. An Efficient Algorithm for the Run-time Parallelization of Doacross Loops. In Proceedings of Supercomputing 1994, pp. 518-527, Nov. 1994.


Effects of Parallelism Degree on Run-Time Parallelization of Loops - Xu (1998)   (1 citation)  (Correct)

....degree and granularity. Section 5 concludes the paper with a summary of evaluation results. 2 Run time Parallelization Techniques In the past, many run time parallelization algorithms have been developed for different types of loops on both shared memory and distributed memory machines [6, 9, 14]. Most of the algorithms follow a so called INSPECTOR EXECUTOR approach. With this approach, a loop under consideration is transformed at compile time into an inspector routine and an executor routine. At run time, the inspector detects cross iteration dependences and produces a parallel schedule; ....

....analysis of cross iteration dependences, tight coupling of the dependence analysis and the executor causes high synchronization overhead in the executor. Most recently, Chen, et al. developed the DOACROSS technique by decoupling the function of the dependence analysis from the executor [6]. We refer to their technique as the CTY algorithm. Separation of the inspector and execu tor not only reduces synchronization overhead in the executor, but also provides the possibility of reusing the dependence information developed in the inspector across multiple invocations of the same ....

[Article contains additional citation context not shown here]

D. K. Chen, P. C. Yew, and J. Torrellas. "An efficient algorithm for the run-time parallelization of doacross loops". In Proc.of Supercomputing 1994, pages 518-527, Nov.1994.


Run-Time Parallelization: A Framework For Parallel Computation - Lawrence Rauchwerger (1995)   (8 citations)  (Correct)

....not possible because inter procedural analysis generates extremely complex expressions that are in the end intractable although statically defined. This thesis will present several new run time methods some are improvements of previous ones [BS90, LZ93, MP87, SM91, SMC89, SMC91, WSHB91, ZY87, CYT94] and others represent represent totally new approaches. They collectively constitute an effective framework for run time parallelization. In the next sections we will be mostly concerned with the parallelism detection side of the compiler. We will narrow the scope of our discussion to loop level ....

....fully parallel loops when the two most important transformations are applied. This method is centered around the possibility of extracting an inspector loop that analyzes the data access pattern off line, i.e. without side effects [BS90, LZ93, MP87, RP94a, SM91, SMC89, SMC91, WSHB91, ZY87, CYT94] Unfortunately the extraction of an inspector loop that can traverse the access pattern without actually having to perform the data computation is often not possible: if the address computation of the array under test depends on the actual data computation, as exemplified by Fig. 2.1(a) then ....

[Article contains additional citation context not shown here]

D. K. Chen, P. C. Yew, and J. Torrellas. An efficient algorithm for the run-time parallelization of doacross loops. In Proceedings of Supercomputing 1994, pages 518-- 527, Nov. 1994.


The Privatizing DOALL Test: A Run-Time Technique for DOALL.. - Rauchwerger, Padua (1994)   (23 citations)  (Correct)

....either the original serial loop or its parallel version. The boolean expression in the if statement typically tests the value of a scalar variable. During the last few years, new techniques have been developed for the run time analysis and scheduling of loops with cross iteration dependences [7, 11, 16, 19, 22, 25, 27, 28, 29, 36, 37]. The majority of this work has concentrated on developing run time methods for constructing execution schedules for partially parallel loops, i.e. loops whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Most of these schemes ....

....such iteration in a shadow version of that variable. By using separate shadow variables to process the read and write operations, Midkiff and Padua [22] improved this basic method so that concurrent reads from a memory location are allowed in multiple iterations. Recently, Chen, Yew and Torrellas [11] proposed another variant of the Zhu and Yew method which improves performance in the presence of hot spots (i.e. many accesses to the same memory location) by first doing some of the computation in private storage. All of the above mentioned methods construct maximal stages in the sense that ....

D. K. Chen, P. C. Yew, and J. Torrellas. An efficient algorithm for the run-time parallelization of doacross loops. manuscript, 1994.


Run-time parallelization of irregular DOACROSS loops - Thulasiraman, Krothapalli, .. (1995)   (Correct)

....For loops not amenable to compile time parallelization, we can still perform the dependence analysis at run time and maybe execute the loop in parallel. This approach is called run time parallelization. Much previous research has been done to design effective run time parallelization algorithms [2, 4, 5, 6, 8, 9, 12, 17]. The main differences among the schemes proposed are the types of dependence patterns that are handled and the required system or architecture support. The key to success in these schemes is to minimize the time spent on dependency analysis and on process synchronization. Indeed, if too much time ....

....success in these schemes is to minimize the time spent on dependency analysis and on process synchronization. Indeed, if too much time is spent on these overheads, it is better to run the loop serially. In general, run time parallelization schemes have two stages, namely inspector and the executor [17, 6, 12, 2]. The inspector determines the dependence relations among the data accesses. The executor uses this information to execute the iterations in parallel in an order that preserves the dependences. In this paper, we describe and evaluate a new algorithm for the run time parallelization of DOACROSS ....

[Article contains additional citation context not shown here]

D.-K. Chen, J. Torrellas and P.-C. Yew, An efficient algorithm for the run-time parallelization of DOACROSS Loops, Supercomputing 1994.


Time-Stamping Algorithms For Parallelization of Loops at Run-Time - Xu, Chaudhary (1997)   (2 citations)  (Correct)

.... circuit simulation, CHARMM and DISCOVER for molecular dynamics simulation of organic systems, and FIDAP for modeling complex fluid flows [2] In the past, many run time parallelization algorithms have been developed for different types of loops on both shared memory and distributed memory machines [4, 10, 6]. Most of the algorithms follow a so called INSPEC TOR EXECUTOR approach. With this approach, a loop under consideration is transformed at compile time into an inspector routine and an executor routine. At run time, the inspector examines loop carried dependencies between the statements in the ....

....analysis of cross iteration dependencies, tight coupling of the dependence analysis and the executor incurs high synchronization overhead in the executor. Most recently, Chen, Torrellas and Yew developed the DOACROSS technique by decoupling the function of the dependence analysis from the executor [4]. Separation of inspector and executor not only reduces synchronization overhead in the executor, but also provides possibility of reusing the dependence information developed in the inspector across multiple invocations of the same loop. Their inspector is parallel, but sacrifices concurrent ....

[Article contains additional citation context not shown here]

D. K. Chen, P. C. Yew, and J. Torrellas. "An efficient algorithm for the run-time parallelization of doacross loops". In Proc.of Supercomputing 1994, pp. 518-527, Nov.1994.


Implementation Of Run Time Techniques In The Polaris Fortran.. - Lawrence (1996)   (4 citations)  (Correct)

....and output dependences. Their method dynamically allocates additional storage to break the output dependences and maps the array references to the new storage using indirection. Flow dependences are enforced using synchronization on the data elements with full empty bits. Chen, Yew, and Torrellas [8] have proposed a method which performs part of the work of an inspector in private storage. Each processor builds a list of all accesses to each memory location. Then the lists are linked across processors and analyzed in a manner similar to that proposed by Zhu and Yew [36] The loop is ....

D. Chen, P. Yew, J. Torrellas. An efficient algorithm for the run-time parallelization of doacross loops. In Proceedings of Supercomputing 1994 , pages 518-527, November 1994.


A Scalable Method for Run-Time Loop Parallelization - Rauchwerger, Amato, Padua (1995)   (4 citations)  (Correct)

....consist of an if statement that selects either the original serial loop or its parallel version. The boolean expression in the if statement typically tests the value of a scalar variable. During the last few years, techniques have been developed for the run time analysis and scheduling of loops [5, 9, 16, 21, 24, 28, 33, 34, 30, 31, 32, 35, 42, 43]. The majority of this work has concentrated on developing run time methods for constructing execution schedules for partially parallel loops, i.e. loops whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original, or source ....

....compile time data dependence analysis techniques cannot be used on such programs, methods of performing the analysis at run time are required. During the past few years, several techniques have been developed for the run time analysis and scheduling of loops with cross iteration dependences [5, 9, 16, 21, 24, 28, 33, 34, 35, 42, 43]. However, for various reasons, such techniques have not achieved wide spread use in current parallelizing compilers. In the followingwe describe a new run time scheme for constructing a parallel execution schedule for the iterations of a loop. The general structure of our method is similar to ....

[Article contains additional citation context not shown here]

D. K. Chen, P. C. Yew, and J. Torrellas. An efficient algorithm for the run-time parallelization of doacross loops. In Proceedings of Supercomputing 1994, pages 518--527, Nov. 1994.


Run-Time Methods for Parallelizing Partially Parallel Loops - Rauchwerger, Amato, Padua (1995)   (18 citations)  (Correct)

....pattern is input data dependent. For example, most dependence analysis algorithms conservatively assume dependences when presented with non linear or subscripted subscript expressions. During the past few years,techniques have been developed for the run time analysis and scheduling of loops [5, 9, 13, 17, 20, 23, 25, 26, 27, 28, 29, 30, 33, 34]. The majority of this workhas concentrated on developing run time methods for constructing execution schedules for partially parallel loops, i.e. loops whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original, or source ....

....sequential code. Since compile time data dependence analysis techniques cannot be used on such programs, methods of performing the analysis at run time are required. Several techniques have been developed for the run time analysis and scheduling of loops with cross iteration dependences [5, 9, 13, 17, 20, 23, 28, 29, 30, 33, 34]. However, for various reasons, such techniques have not achieved wide spread use in current parallelizing compilers. In the following we describe a new run time scheme for constructing a parallel execution schedule for the iterations of a loop. The general structure of our method is similar to ....

[Article contains additional citation context not shown here]

D. K. Chen, P. C. Yew, and J. Torrellas. An efficient algorithm for the run-time parallelization of doacross loops. In Proc. of Supercomputing 1994, pp. 518--527, Nov. 1994.


Run-Time Parallelization Of Irregular Doacross Loops - Jeyaraman, Krothapalli.. (1996)   (Correct)

....to remain constant during the execution of one invocation of the loop, although they are allowed to change outside the loop. In order to parallelize the loop, it is necessary to perform run time analysis. Much previous research has been done to design effective run time parallelization algorithms [2,3,4,5,6,7,8,9,10]. The main differences among the schemes proposed are the types of dependence patterns that are handled and the required system or architecture support. In general, run time parallelization schemes have two stages, namely the inspector and the executor [2,6,7,10] In our scheme, the inspector ....

....algorithms [2,3,4,5,6,7,8,9,10] The main differences among the schemes proposed are the types of dependence patterns that are handled and the required system or architecture support. In general, run time parallelization schemes have two stages, namely the inspector and the executor [2,6,7,10]. In our scheme, the inspector inspects the various dependencies existing between the iterations and constructs an iteration schedule. This involves reordering the iterations. The entire set of iterations is partitioned into subsets called wavefronts. The executor is a transformed version of the ....

[Article contains additional citation context not shown here]

D.-K. Chen, J. Torrellas and P.-C. Yew, An efficient algorithm for the run-time parallelization of DOACROSS Loops, Proceedings of Supercomputing, 1994.


A Dynamically Adaptive Parallelization Model Based on Speculative.. - Kazi (2000)   Self-citation (Yew)   (Correct)

....implementation issues. Chapter 5 evaluates the performance of these proposed techniques on shared memory multiprocessor systems. Finally, the thesis contributions are summarized in chapter 6. 14 Chapter 2 Related Work A variety of existing run time parallelization techniques, both software [11, 37, 38, 39, 54] and hardware based [34, 43, 53] can exploit coarse grained loop level parallelism from application programs that cannot be easily parallelized using traditional parallelization tools. Most of the softwarebased techniques can handle loops with run time data dependence only. Other techniques can ....

....for loops with early exits, also cannot be parallelized by these techniques since the loop bound is not known at compile time. A number of schemes have been developed to exploit medium to coarse grained loop level parallelism in programs in which the parallelism cannot be detected at compile time [11, 34, 37, 38, 39, 43, 53, 54]. These schemes are generally referred to as run time parallelization schemes. Most of the run time parallelization techniques are based on inspector executor algorithms [11, 37, 38, 54] These techniques, most of which are implemented entirely in software, consist of an inspector stage followed ....

[Article contains additional citation context not shown here]

D.K. Chen, P.C. Yew, and J. Torrellas, An Efficient Algorithm for the Run-time Parallelization of Doacross Loops, Supercomputing, Nov. 1994, pp. 518-527.


Hardware for Speculative Run-Time Parallelization in.. - Zhang, Rauchwerger.. (1997)   (25 citations)  Self-citation (Torrellas)   (Correct)

....execution schedules for partially parallel loops. These are loops whose parallelization may require synchronization to ensure that the iterations are executed in the correct order. These methods are often based on the extraction of an inspector loop that analyzes the data access patterns ([5, 12, 13, 17] to name a few) The inspector usually yields a partitioning of the iteration space into subsets called wavefronts. Each wavefront is then executed in parallel by the executor, with barriers separating the wavefronts. Unfortunately, the inspector may be both computationally expensive and have ....

D. K. Chen, J. Torrellas, and P. C. Yew. An Efficient Algorithm for the Run-Time Parallelization of Do-Across Loops. In Supercomputing '94, pages 518--527, November 1994.


Speculative Parallel Execution of Loops with.. - Zhang, Rauchwerger.. (1997)   Self-citation (Torrellas)   (Correct)

....approaches have been proposed. These schemes use information available at run time to construct execution schedules that are partially parallel. The right schedule is forced with direct synchronization. These methods are often based on an inspector loop that analyzes the data access patterns ([4, 16, 19, 22] to name a few) If the loop is not fully parallel, the inspector usually yields a partitioning of the iteration space into subsets called wavefronts. Each wavefront is then executed in parallel by the executor, with barriers separating the wavefronts. This inspector executor method is also ....

D. K. Chen, J. Torrellas, and P. C. Yew. An Efficient Algorithm for the Run-Time Parallelization of Do-Across Loops. In Supercomputing '94, pp. 518--527, November 1994.


Hardware for Speculative Run-Time Parallelization in.. - Zhang, Rauchwerger.. (1998)   (25 citations)  Self-citation (Torrellas)   (Correct)

....methods for constructing execution schedules for partiallyparallel loops. These are loops whose parallelization may require synchronization to ensure that the iterations are executed in the correct order. These methods are often based on an inspector loop that analyzes the data access patterns ([4, 10, 13, 15] to name a few) The inspector usually yields a partitioning of the iteration space into subsets called wavefronts. Each wavefront is then executed in 1 This work was supported in part by the National Science Foundation under grants NSF Young Investigator Award MIP 9457436, ASC 9612099 and ....

D. K. Chen, J. Torrellas, and P. C. Yew. An Efficient Algorithm for the Run-Time Parallelization of Do-Across Loops. In Supercomputing '94, pages 518--527, November 1994.


On Effective Execution of Non-Uniform DOACROSS Loops - Chen, Yew (1996)   (4 citations)  Self-citation (Chen Yew)   (Correct)

....17] If this is not possible, we can still execute the loop in parallel, provided proper synchronization is added to enforce loop carried dependences. The need for synchronization can be determined either at compile time [15, 25] for uniform (i.e. constant distance) dependences, or at run time [30, 6] for complicated non uniform dependences. If all these techniques fail, the DOACROSS loop must be executed serially which, according to Amdahl s law [2] could severely degrade the performance [7] The most common type of loop carried dependences in engineering and scientific programs arises from ....

D.-K. Chen, J. Torrellas, and P.-C. Yew. An efficient algorithm for the run-time parallelization of DOACROSS loops. In Supercomputing '94, pages 518--527, November 1994. Also available as CSRD tech report No. 1345.


The Illinois Aggressive Coma Multiprocessor Project (I-ACOMA) - Torrellas, Padua (1996)   (12 citations)  Self-citation (Torrellas)   (Correct)

....loops often are fully parallel. DO i=1,N : A(f(i) S 1 ) A(g(i) S 2 ) ENDDO Figure 2: Typical loop that, in most cases, cannot be analyzed at compile time. One possibility is to parallelize these loops with costly runtime parallelization algorithms [4, 10, 12, 13, 16, 17, 19, 28]. In most of these proposals, the compiler inserts code that, at run time, determines what iterations do not depend on each other (inspector phase) and then executes these iterations in parallel (executor phase) However, these algorithms often have high overhead that precludes any speedup from ....

D. K. Chen, J. Torrellas, and P. C. Yew. An Efficient Algorithm for the Run-Time Parallelization of Do-Across Loops. In Supercomputing '94, pages 518--527, November 1994.


The LRPD Test: Speculative Run-Time Parallelization of.. - Rauchwerger, Padua (1999)   (57 citations)  (Correct)

No context found.

# D.K. Chen, P.C. Yew, and J. Torrellas, "An Efficient Algorithm for the Run-Time Parallelization of doacross Loops," Proc. Supercomputing 1994.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC