| C. Fu and T. Yang. Run-time techniques for exploiting irregular task parallelism on distributed memory architectures. J. Parallel and Distributed Computing, 42(2):143--156, 1997. |
....in running time is not as high as the worst complexity shows, as CPR runs with no more than slower than TwoL, and in some cases it CPR is even faster compared to TwoL. Our future work includes moving the M task scheduling at run time in a manner similar to the RAPID system for Stask scheduling [6]. The reason for moving the scheduling at run time is that the scheduling is dependent on the problem size, and therefore needs to be recomputed for each problem size. Moving the scheduling at runtime implies only one compilation at the expense of a scheduling offset. In order to reduce this ....
C. Fu and T. Yang. Run-time techniques for exploiting irregular task parallelism on distributed memory architectures. J. Parallel and Distributed Computing, 42(2):143--156, 1997.
....3 P 2 ) This is also confirmed by the running times that were measured, showing that CPA runs up to 20 and 15 times faster compared to TwoL and CPR, respectively. Our future work includes moving the M task scheduling at run time in a manner similar to the RAPID system for S task scheduling [8]. The reason for moving the scheduling at run time is that the scheduling is dependent on the problem size, and therefore needs to be recomputed for each problem size. Moving the scheduling at runtime implies only one compilation at the expense of extra scheduling runtime cost. CPA is a good ....
C. Fu and T. Yang. Run-time techniques for exploiting irregular task parallelism on distributed memory architectures. Journal of Parallel and Distributed Computing, 42(2):143--156, May 1997.
....running time is not as high as the worst complexity shows, as CPR runs with no more than 60 slower than TwoL, and in some cases it CPR is even faster compared to TwoL. Our future work includes moving the M task scheduling at run time in a manner similar to the RAPID system for Stask scheduling [6]. The reason for moving the scheduling at run time is that the scheduling is dependent on the problem size, and therefore needs to be recomputed for each problem size. Moving the scheduling at runtime implies only one compilation at the expense of a scheduling offset. In order to reduce this ....
C. Fu and T. Yang. Run-time techniques for exploiting irregular task parallelism on distributed memory architectures. J. Parallel and Distributed Computing, 42(2):143--156, 1997.
....the increase in running time is not as high as the O(V ) suggest, as CPR runs no more than 60 slower than TwoL, and in some cases is even faster, compared to TwoL. Our future work includes moving the M task scheduling at run time in a manner similar to the RAPID system for Stask scheduling [8]. The reason for moving the scheduling at run time is that the scheduling is dependent on the problem size, and therefore needs to be recomputed for each problem size. Moving the scheduling at runtime implies only one compilation at the expense of a scheduling runtime cost offset. In order to ....
C. Fu and T. Yang. Run-time techniques for exploiting irregular task parallelism on distributed memory architectures. Journal of Parallel and Distributed Computing, 42(2):143--156, May 1997.
....we assume that the factorization is done on a distributedmemory system with sufficient memory to handle the work assigned to each processor. Even though there has been an effort to use DAG based scheduling for irregular computations on a parallel system with a low overhead communication mechanism [9], this paper presents the first work that deals with the entire framework of applying a scheduling approach for block oriented sparse Cholesky factorization in a distributed system. The next section describes the block fan out method for parallel sparse Cholesky factorization. In Section 3, the ....
C. Fu and T. Yang. Run-time techniques for exploiting irregular task parallelism on distributed memory architectures. J. of Parallel and Distributed Computing, 42:143--156, 1997.
....They are efficient for solving iterative irregular problems in which communication and computation phases alternate; indeed, in those kinds of applications, the cost of optimizations performed at the inspector stage can be amortized over many computation iterations at the executor stage. RAPID [9] is another run time system based on a computation specification library for specifying irregular data objects and tasks manipulating them. The inspector extracts a task dependence graph from the accessing patterns, schedules the tasks and distributes the data and tasks onto processors for ....
C. Fu and T. Yang. Run-Time Techniques for Exploiting Irregular Task Parallelism on Distributed Memory Architectures. Journal of Parallel and Distributed Computing, 42:143-- 156, 1997.
....transformation. A transformed dependence graph contains true dependencies only. An extension to the classical task graph model is that commuting tasks can be marked in a task graph so that it can capture parallelism arising from commutative operations. The details on this parallelism model are in [5, 7] and this paper deals with scheduling and execution of a transformed task graph with an acyclic structure (DAG) The proposed memory optimizing techniques are intended for executing general task parallelism. The experiments are conducted in the context of RAPID [5] which is a run time system that ....
....data access patterns. Data dependence graph (DDG) Dependencecomplete task graph Iterative asynchronous Task assignments, data object owners schedules and execution Figure 1: The stages of run time parallelization in RAPID. T[8] T[8,9] T[8,9] T[8,9] 0 3 4 7 10 11 12 1 2 5 6 8 9 (a) T[1] T[3] T[4] T[7] T[2] T[1,6] T[1,10] T[3,8] T[3,9] T[3,10] T[4,8] T[5,8] T[5,9] T[5,10] T[7,8] T[7,10] T[5] Proc0 Proc1 Proc0 Proc1 (b) c) T[3] T[5] T[7] T[4] T[2] T[3,8] T[4,8] T[5,8] T[1,6] T[1,10] T[7,8] T[8] T[7,10] T[1] T[3,9] T[5,9] T[3] T[5] T[7] T[4] T[2] T[3,8] T[4,8] T[5,8] T[7,8] T[8] T[1,10] T[3,10] ....
[Article contains additional citation context not shown here]
C. Fu and T. Yang. Run-time Techniques for Exploiting Irregular Task Parallelism on Distributed Memory Architectures. Journal of Parallel and Distributed Computing, 1997. Accepted for publication. Also as UCSB technical report TRCS97-03.
....a DDG of a program to a task graph, we first delete all redundant output and anti dependence edges. If the resulting graph contains true data dependence only, then this graph is a task graph. Otherwise other transformations are necessary to remove the remaining anti and output dependence [FY95] We call this kind of task graph a dependence complete task graph. Formally a task graph G is dependence complete if it satisfies the following four properties: ffl DA1: A task only uses distinct data items. A task Tx does not receive data items with the same ID from different predecessors. ....
....without any transformation. For example, the sparse LU graphs discussed in Section 5.1 are dependence complete. One disadvantage of a dependence complete task graph (or DDG) is that it cannot model parallelism arising from commutative operations. Cholesky factorization is one typical example. In [FY95] we have proposed an approach that uses a dependence incomplete task graph to model the Cholesky factorization before graph scheduling. After the scheduling algorithm exploits the commutativity of the operations, we transform the scheduled Cholesky graph into a dependence complete graph and then ....
[Article contains additional citation context not shown here]
C. Fu and T. Yang. Run-time Techniques for Exploiting Irregular Task Parallelismon Distributed Memory Architectures. Technical Report TRCS95-21, Dept. of Computer Science, UCSB, 1995. http://www.cs.ucsb.edu/Research/rapid sweb/RAPID.html.
....irregular applications at run time. The original schedule execution scheme in RAPID does not support incremental memory allocation. With sufficient space, RAPID delivers good performance for several tested irregular programs such as sparse Cholesky factorization, sparse triangular solvers [Fu and Yang 1997], and Fast Multipole Method for N body simulation [Fu 1997] In particular we show that RAPID can be used to parallelize sparse LU (Gaussian elimination) with dynamic partial pivoting, which is an important open parallelization problem in the literature, and deliver high megaflops on the ....
....scheme in RAPID does not support incremental memory allocation. With sufficient space, RAPID delivers good performance for several tested irregular programs such as sparse Cholesky factorization, sparse triangular solvers [Fu and Yang 1997] and Fast Multipole Method for N body simulation [Fu 1997]. In particular we show that RAPID can be used to parallelize sparse LU (Gaussian elimination) with dynamic partial pivoting, which is an important open parallelization problem in the literature, and deliver high megaflops on the Cray T3D T3E [Fu and Yang 1996b] Another usage of RAPID is ....
[Article contains additional citation context not shown here]
Fu, C. and Yang, T. 1997. Run-time Techniques for Exploiting Irregular Task Parallelism on Distributed Memory Architectures. Journal of Parallel and Distributed Computing 42, 143--156. Also as UCSB technical report TRCS97-03.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC