| Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with finegrained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 1995. |
....threads libraries. In addition, programming languages, such as Cilk [7, 21] and Java [3] support multithreading with linguistic abstractions. A major factor in the performance of such multithreaded parallel applications is the operation of the thread scheduler. Prior work on thread scheduling [4, 5, 8, 13, 14] has dealt exclusively with non multiprogrammed environments in which a multithreaded computation executes on P dedicated processors. Such scheduling algorithms dynamically map threads onto the processors with the goal of achieving P fold speedup. Though such algorithms will work in some ....
.... to page faults [6] For these reasons, work stealing is practical and variants have been implemented in many systems [7, 19, 20, 24, 34, 38] For general multithreaded computations, other scheduling algorithms have also been shown to be simultaneously efficient with respect to time and space [4, 5, 13, 14]. Of particular interest here is the idea of deriving parallel depth first schedules from serial schedules [4, 5] which produces strong upper bounds on time and space. The practical application and possible adaptation of this idea to multiprogrammed environments is an open question. Prior work ....
[Article contains additional citation context not shown here]
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with finegrained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 1995.
.... that they contained no provision for handling overflow of renaming buffers [27] Recently, work on efficient task queue implementations for explicitly parallel functional programming languages [85] has been extended to provide theoretical bounds on the renaming resources required by such systems [18, 17]. SUDS provides constant bounded resource guarantees through its checkpoint repair mechanism. This mechanism allows SUDS to rollback and sequentially reexecute any program fragment that exhausts renaming resources 63 when run in parallel. Further, the SUDS memory dependence speculation mechanism ....
....of the last decade. The dataflow machines of the past, however, had two problems. Fortunately, a system like SUDS can help to address these problems. The first problem was that dataflow machines did not run imperative programs, but only programs written in functional programming languages [35, 34, 91, 85, 28, 18, 17]. Scalar queue conversion can help address this problem because it converts scalar updates into function (closure) calls. The second problem with dataflow machines was that their renaming mechanisms were not fundamentally deadlock free [27] Checkpoint repair mechanisms, like that provided by ....
[Article contains additional citation context not shown here]
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with finegrained parallelism. Journal of the ACM, 46(2):281--321, 1999.
....on an allocation of ready tasks to idle processors. Using it, several results have been obtained that proves that well defined classes of parallel programs can be executed in asymptotic optimal time on theoretical machine models such as the PRAM or the local PRAM, including scheduling overheads [2, 5, 3]. Basically, a list scheduling do not require much information about processes in the application, although some knowledge may appear useful l. For instance, a total sequential ordering of tasks allow to bound the amount of space required for the computation while achieving linear speed up for ....
....processes in the application, although some knowledge may appear useful l. For instance, a total sequential ordering of tasks allow to bound the amount of space required for the computation while achieving linear speed up for certain classes of programs: strict computations [5] nested computations [2] or planar graphs [3] Furthermore, in practice, due to magnitude of the ratio between local and remote memory access costs, some significant improvement can be brought to a schedule by some knowledge about the data flow graph corresponding to the execution [10] Exploiting this knowledge leads ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa-Barbara, California, 1995. ACM Press.
....number IST 1999 14186 (ALCOM FT) dynamic, unstructured parallelism. During the execution of a multithreaded computation, a thread may spawn child threads which can be executed in parallel, and it can synchronize with other currently executing threads. In most of the work in the literature [1, 4, 5, 6, 7, 9, 15, 16, 24, 25, 26, 27], a multithreaded computation is modeled as a directed acyclic graph (see Figure 1(a) Of much concern is how a multithreaded computation can be executed efficiently on a parallel computer. A parallel execution of a multithreaded computation specifies which processor executes each thread and ....
....that is the maximum length of any path in the computation graph. For computations with sufficient parallelism, depth first schedulers improve upon the previous space bound of O(PS1) achieved by work stealing algorithms [7, 9, 15, 16] Moreover, depth first schedulers cope with heap allocation [5] which is more general than the stack based model assumed in work on work stealing [7, 8, 9, 15, 16] However, depthfirst schedulers use a globally ordered centralized data structure of active threads and thus they are not as practical as work stealing schedulers. Especially for fine grained ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. Journal of the ACM, 46(2):281--321, March 1999.
....number IST 1999 14186 (ALCOM FT) dynamic, unstructured parallelism. During the execution of a multithreaded computation, a thread may spawn child threads which can be executed in parallel, and it can synchronize with other currently executing threads. In most of the work in the literature [1, 4, 5, 6, 7, 9, 15, 16, 24, 25, 26, 27], a multithreaded computation is modeled as a directed acyclic graph (see Figure 1(a) Of much concern is how a multithreaded computation can be executed efficiently on a parallel computer. A parallel execution of a multithreaded computation specifies which processor executes each thread and ....
....a scheduling algorithm to achieve all of the above goals is not a trivial task. Several algorithms [7, 9, 15, 16] employ work stealing, a technique in which underutilized processors try to steal work from over utilized ones, to achieve the above scheduling goals. Recently, a flurry of research [4, 6, 24, 26] has resulted in depth first schedulers, which schedule threads prioritized by their (serial) left to right depth first execution order and are highly space efficient; the space complexity of an algorithm is the total amount of memory used by all processors to execute the computation. ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12. Santa Barbara, California, July 1995.
....the performance model for a given program. In particular, tools for measuring the critical path of a given program are essential. Determining the critical path length is a well known and extremely effective way of understanding the performance of parallel programs [BL94, BJK 95, BJK 96, BGM95, NB97, Nar99] 1.4 Contributions The principal contributions of our work are as follows. ffl We propose a technique for achieving efficient execution in bottlenecks. It reduces the number of mutual exclusion operations that accompany bottleneck modules and enhances the cache efficiency in the ....
....excessive demands on memory because processors can rapidly send repeated requests to an owner, resulting in the creation of a huge number of data structures containing the information needed for the execution of the requested method. See Cilk s work [BL94, BJK 95, BJK 96] and NESL s work [BGM95, BGMN97, NB97] for a theoretical background on spaceefficiency. Local based execution, however, is not always the best choice. If an object is updated frequently by multiple processors, for example, local based execution will be subject to serious slowdowns caused by overheads, such as cache ....
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably Efficient Scheduling for Languages with Fine-Grained Parallelism. In Proceedings of the 7th ACM Symposium on Parallel Algorithms and Architectures (SPAA '95), pages 1--12, Santa Barbara, CA, July 1995.
....at run time. Cilk can handle big size problems but, communication cost is not taken into consideration. The Cilk system performs very well for tree like style computation (min max search, backtrack exploration, etc. but, it has not been designed for scientific loop nest computations. In [2, 13] run time methods to schedule task graphs are described addressing the problem of processor memory requirement, but these works do not consider DAG memory requirement. In [1] a tool CASCH, is presented. It allows to generate a schedule and a parallel code for a sequential program. Nevertheless ....
G. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings Symposium on Parallel Algorithms and Architectures, pages 1--12, July 1995.
....at run time. Cilk can handle big size problems but, communication cost is not taken into consideration. The Cilk system performs very well for tree like style computation (min max search, backtrack exploration, etc. but, it has not been designed for scientific loop nest computations. In [2, 13] run time methods to schedule task graphs are described addressing the problem of processor memory requirement, but these works do not consider DAG memory requirement. In [1] a tool CASCH, is presented. It allows to generate a schedule and a parallel code for a sequentiel program. Nevertheless ....
G. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings Symposium on Parallel Algorithms and Architectures, pages 1--12, July 1995.
....a FORALL construct. The compile time component constructs the threads from the nested loops. A run time component dynamically schedules these threads across processors. Recent work has resulted in run time scheduling techniques that minimize completion time and memory use of the generated threads [9, 6, 20]. Scheduling very fine grained threads (e.g. a single multiplication in the sparse matrix vector product example) is impractical, hence compile time techniques are required to increase thread granularity, although this may result in lost parallelism and increased load imbalance. FORALL (i = ....
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, CA, June 1995.
....scheduled at run time. Cilk can handle big size problems but communication cost is not taken into consideration. The Cilk system give good results with tree like style computation (min max search, backtrack exploration, etc. but it has not been designed for scientific loop nest computations. In [2, 13] run time methods to schedule task graphs are described addressing the problem of processor memory requirement, but these works do not consider DAG memory requirement. In [1] a tool CASCH, is presented. It allows to generate a schedule and a parallel code for a sequential program. Nevertheless ....
G. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings Symposium on Parallel Algorithms and Architectures, pages 1--12, July 1995.
....serial, depth first space requirement [9] A computation with work (total number of operations) and depth (length of the critical path) was shown to require # time on processors [9] We will henceforth refer to such schedulers as work stealing schedulers. Recent work [6, 34] has resulted in depth first scheduling algorithms that require # ( space for nestedparallel computations with depth . For programs that have a low depth (a high degree of parallelism) such as all programs in the class , 14] the space bound of .# ( is ....
....in the class , 14] the space bound of . # ( is asymptotically lower than the work stealing bound of ) Further, the depth first approach allows a more general memory allocation model compared to the stack based allocations assumed in space efficient work stealing [6]. The depth first approach has been extended to handle computations with futures [39] or I structures [16] resulting in similar space bounds [4] Experiments showed that an asynchronous, depth first scheduler often results in lower space requirement in practice, compared to a work stealing ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proc. ACM symp. Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 17--19, 1995.
....ignoring though space requirements and communication costs. Burton shows in [10] how to limit space in certain parallel computations without causing deadlock. More recently, Burton [9] has developed and analyzed a scheduling algorithm with provably good time and space bounds. Blelloch et al. [3, 4] have also recently developed and analyzed scheduling algorithms with provably good time and space bounds for languages with nested fine grained parallelism (that is, languages that lead to series parallel DAGs) All these algorithms are analyzed only for shared memory machines and do not account ....
G. E. Blelloch, P.B. Gibbons and Y. Matias, "Provably efficient scheduling for languages with finegrained parallelism," Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, Santa Barbara, California, pp. 1--12, July 1995.
....a ###### construct. The compile time component constructs the threads from the nested loops. A run time component dynamically schedules these threads across processors. Recent work has resulted in run time scheduling techniques that minimize completion time and memory use of the generated threads [9, 6, 20]. Scheduling very fine grained threads (e.g. a single multiplication in the sparse matrix vector product example) is impractical, hence compile time techniques are required to increase thread granularity, although this may result in lost parallelism and increased load imbalance. ###### ## # ....
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, CA, June 1995.
....is impractical, hence compile time techniques are required to increase thread granularity, although this may result in lost parallelism and increased load imbalance. Recent work has resulted in run time scheduling techniques that minimize completion time and memory use of the generated threads [11, 8, 27]. The automatic construction of threads of appropriate granularity is currently being investigated by several researchers [26, 19] In Fig. 4(c) we show a decomposition of the total work into four parallel threads T 1 ; T 4 . In this decomposition the body of the inner FORALL loop has been ....
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, CA, June 1995. 4.1
....where S1 is the serial, depth first space requirement [9] A computation with W work (total number of operations) and D depth (length of the critical path) was shown to require W=p O(D) time on p processors [9] We will henceforth refer to such schedulers as work stealing schedulers. Recent work [6, 34] has resulted in depth first scheduling algorithms that require S1 O(p Delta D) space for nestedparallel computations with depth D. For programs that have a low depth (a high degree of parallelism) such as all programs in the class NC [14] the space bound of S1 O(p Delta D) is ....
....in the class NC [14] the space bound of S1 O(p Delta D) is asymptotically lower than the work stealing bound of p Delta S1 . Further, the depth first approach allows a more general memory allocation model compared to the stack based allocations assumed in space efficient work stealing [6]. The depth first approach has been extended to handle computations with futures [39] or I structures [16] resulting in similar space bounds [4] Experiments showed that an asynchronous, depth first scheduler often results in lower space requirement in practice, compared to a work stealing ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proc. ACM symp. Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 17--19, 1995.
....by the owning thread (unlike shared variables which can be changed by simultaneously executing threads) Private variables are useful for communicating between a Cilk thread and C functions it calls, because these C functions are completely contained in the Cilk thread. An private char alternates[10][MAXWORDLEN] int checkword(const char word) Check spelling of word . If spelling is correct, return 0. Otherwise, put up to 10 alternate spellings in alternates array. Return number of alternate spellings. cilk void spellcheck(const char wordarray, int num) if (num = 1) ....
.... running time T P (C; n) The computational work of blockedmul is T 1 (n) Theta(n 3 ) so the total work is T 1 (C; n) T 1 (n) mF 1 (C; n) Theta(n 3 ) The critical path is T1 = Theta(lg 2 n) so using our performance model, the 4 In recent work, Blelloch, Gibbons, and Matias [10] have shown that series parallel dag computations can be scheduled to achieve substantially better space bounds than we report here. For example, they give a bound of SP (n) O(n 2 P lg 2 n) for matrix multiplication. Their improved space bounds come at the cost of substantially more ....
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 1--12, Santa Barbara, California, July 1995.
....on an allocation of ready tasks to idle processors. Using it, several results have been obtained that proves that well defined classes of parallel programs can be executed in asymptotic optimal time on theoretical machine models such as the PRAM or the local PRAM, including scheduling overheads [2, 5, 3]. Basically, a list scheduling do not require much information about processes in the application, although some knowledge may appear useful. For instance, a total sequential ordering of tasks allow to bound the amount of space required for the computation while achieving linear speed up for ....
....processes in the application, although some knowledge may appear useful. For instance, a total sequential ordering of tasks allow to bound the amount of space required for the computation while achieving linear speed up for certain classes of programs: strict computations [5] nested computations [2] or planar graphs [3] Furthermore, in practice, due to magnitude of the ratio between local and remote memory access costs, some significant improvement can be brought to a schedule by some knowledge about the data flow graph corresponding to the execution [10] For instance, Schedules computed ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with finegrained parallelism. In Proceedings of the 7th Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa-Barbara, California, 1995. ACM Press.
....=Q(P(n=P 1=3 ) 2 ) Q(n 2 P 1=3 ) 4 The work and critical path length for matrixmul can also be computed using recurrences. The computational work T 1 (n) to multiply n Theta n matrices satisfies T 1 (n) 8T 1 (n=2) Q(n 2 ) since 4 In recent work, Blelloch, Gibbons, and Matias [6] have shown that series parallel dag computations can be scheduled to achieve substantially better space bounds than we report here. For example, they give a bound of S P (n) O(n 2 Plg 2 n) for matrix multiplication. Their improved space bounds come at the cost of substantially more ....
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 1995.
....For applications where the amount of computation grows faster with problem size than communication, choosing a bigger problem size can reduce the relative impact of overheads such as communication latencies. Basically, we are applying Amdahl s law here, improving speedup by reducing critical path [8, 7]. In our situation, we could have increased the problem sizes to compensate for the slowness of the WANs. However, we have decided not to do so, since determining the impact of the WAN is precisely what we want to do. Thus, we believe and expect that the speedup figures that follow can be ....
G.E. Blelloch, P.B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with finegrained parallelism. In Proc. 7th ACM Symp. Par. Alg. and Arch. (SPAA), pages 1--12, July 1995.
....in the fast communication research such as active messages. Thus we expect other software system researchers can also benefit from our results in using fast communication support to design software layers. Most of previous research on scheduling [16, 19, 20] does not address memory issues. In [1], a dynamic scheduling algorithm for directed acyclic graphs is proposed with memory space usage S1=p O(D) on each processor, where S1 is the sequential space requirement, p is the total number of processors and D is the depth of a DAG. This work provides a solid theoretical ground for ....
....at most S1 space per processor. This paper assumes that each processor has a maximum space limit and the goal is to make the data space cost to be close to S1=p per processor in order to solve large scale problems. The scheduling scheme we use is static in the run time preprocessing stage while [1] and [2] use dynamic scheduling. This is mainly because in practice it is difficult to minimize the run time control overhead of dynamic scheduling in parallelizing sparse code with mixed granularities. It should be noted that there exists other space overhead which includes the space for the ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably Efficient Scheduling for Languages with Fine-Grained Parallelism. In Proceedings of 7th ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, July 1995.
....applications, multiprocessors also support multiprogrammed workloads in which a mix of serial and parallel, interactive and batch applications may execute concurrently. A major factor in the performance of such workloads is the operation of the thread scheduler. Prior work on thread scheduling [4, 5, 8, 11, 12] has dealt exclusively with non multiprogrammed environments in which a multithreaded computation executes on P dedicated processors. Such scheduling algorithms dynamically map threads onto the processors with the goal of achieving P fold speedups. Though such algorithms will work in some ....
.... to page faults [6] For these reasons, work stealing is practical and variants have been implemented in many systems [7, 16, 17, 21, 30, 34] For general multithreaded computations, other scheduling algorithms have also been shown to be simultaneously efficient with respect to time and space [4, 5, 11, 12]. Of particular interest here is the idea of deriving parallel schedules from serial schedules [4, 5] which produces strong upper bounds on time and space. The practical application and possible adaptation of this idea to multiprogrammed environments is an open question. Prior work that has ....
[Article contains additional citation context not shown here]
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 1995.
....that results in a correct sequential execution) may be used. In this way, list scheduling leads to parallel computations achieving a linear speed up while requiring a space related to the one of the sequential execution for certain classes of programs: strict computations [4] nested computations [2] or planar graphs [3] Furthermore, in practice, due to the magnitude of the ratio between local and remote memory access costs, some significant improvement can be brought to a schedule some knowledge about the data flow corresponding to the execution [8] Some programming environments use such a ....
G. E. Blelloch, P. B. Gibbons and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proc. of the 7th Symp. on Parallel Algorithms and Architectures, pp 1-12, Santa-Barbara, 1995. ACM Press.
....where S1 is the serial, depth first space requirement [9] A computation with W work (total number of operations) and D depth (length of the critical path) was shown to require W=p O(D) time on p processors [9] We will henceforth refer to such schedulers as work stealing schedulers. Recent work [6, 36] has resulted in depth first scheduling algorithms that require S1 O(p Delta D) space for nestedparallel computations with depth D. For programs that have a low depth (a high degree of parallelism) such as all programs in the class NC [15] the space bound of S1 O(p Delta D) is ....
....in the class NC [15] the space bound of S1 O(p Delta D) is asymptotically lower than the work stealing bound of p Delta S1 . Further, the depth first approach allows a more general memory allocation model compared to the stack based allocations assumed in space efficient work stealing [6]. The depth first approach has been extended to handle computations with futures [41] or I structures [17] resulting in similar space bounds [4] Experiments showed that an asynchronous, depth first scheduler often results in lower space requirement in practice, compared to a work stealing ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proc. ACM symp. Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 17--19, 1995.
.... Prolog. 2. We argue that the performance and execution models for Reform Prolog are clearer than for implicitly AND parallel systems. Possible future work include investigating to which extent the results for scheduling data parallelism in imperative languages, e.g. Blelloch, Gibbons and Matias [39], can be applied in Reform Prolog. 5.5 CONCLUSION This paper aims to show that there is no need for a complicated implementation technique to efficiently take advantage of nested data parallelism, and consequently a substantial part of nested dependent AND parallelism using Lindgren s ....
G. Blelloch, P. Gibbons, Y. Matias, Provably Efficient Scheduling for Languages with Fine-Grained Parallelism, 7th ACM Symp. on Parallel Algorithms and Architectures, 1995. --82
....of algorithms for the scheduling of dynamically unfolding DAGs on p parallel processors so as to minimize the completion time. The work of Blelloch et al. studies both the space complexity of the algorithm, the time complexity of the algorithm, and the quality of the performance guarantee [2]. When average completion time is used as the cost, Motwani et al. have shown that a preemptive time sharing policy for uniprocessor systems, RoundRobin, can achieve the optimal competitive ratio. It guarantees an average completion time which is within 2 Gamma 2 n 1 times optimal and no ....
G.E. Blelloch, P.B. Gibbons, and Y. Matias, "Provably Efficient Scheduling for Languages with Fine-Grained Parallelism", Proceedings of 7th ACM Symposium on Parallel Algorithms and Architecture, pp. 1--12, 1995.
....LU is written as A 00 A 01 A 10 A 11 = L 00 0 L 10 L 11 Delta U 00 U 01 0 U 11 : The parallel algorithm computes L and U as follows. It recursively factors A 00 into L 00 Delta U 00 . Then, it uses back substitution to solve 4 In recent work, Blelloch, Gibbons, and Matias [6] have shown that series parallel dag computations can be scheduled to achieve substantially better space bounds than we report here. For example, they give a bound of S P (n) O(n 2 P lg 2 n) for matrix multiplication. Their improved space bounds come at the cost of substantially more ....
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 1995.
....approaches. This algorithm has good memory access locality while keeping the memory overhead below a fixed fraction of the total memory usage. Combining of breadth first and depth first approaches to bound memory overhead has been proven successful in the parallel computation communities [7, 2, 12]. Experimental results on ISCAS85 [4] and multiplier circuits [6] show that our new approach is generally faster than other breadth first and depth first implementations, while keeping memory overhead comparable to the depth first approach. In particular, for the 13 bit multiplier circuit, our ....
BLELLOCH, G. E., GIBBONS, P. B., AND MATIAS, Y. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 1995 ACM Symposium on Parallel Algorithms and Architectures (Santa Barbara, July 1995), pp. 420--430.
....cost model as a DAG of depth 2 and breadth n. When coming to an array scan in the code the implementation spawns n threads and places them in the set of active threads. Since creating n threads could take more than constant time on p processors, they are created lazily using a stub as described in [8] threads are expanded when taken from S instead of when inserted. For each block of p or less threads that are scheduled from the set in a particular step, we can use the unit time scan primitive assumed in the machine model to execute the scan across that subset and place the new running sum ....
....is through the future cells themselves and there is no specification in the algorithms of what happens on what step. This gives freedom to the implementation as to how to schedule the tasks. The implementation, for example, could optimize the schedule for either space efficiency [12] [8], 9] or locality [13] On a uniprocessor the implementation could run the code in a purely sequential mode without any need for synchronization. We are not yet sure how general the approach is. We have not been able to show, for example, whether the method can be used to generate a sort that has ....
G. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th AnnualACMSymposiumonParallel Algorithms andArchitectures, pages 1--12, July 1995.
....fork join or loop parallelism using nonpreemptive, stateless threads; it further reduces overheads by coarsening and pruning excess parallelism. Recent work has resulted in provably efficient scheduling techniques that provide upper bounds on the space required by the parallel computation [9, 11, 12, 13, 35]. Since there are several possible execution orders for lightweight threads in a computation with a high degree of parallelism, the provably space efficient schedulers restrict the execution order for the threads to bound the space requirement. For example, the Cilk multithreaded system [11] ....
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 17--19, 1995. ACM SIGACT/SIGARCH and EATCS.
....static partioning Figure 1: The speedup obtained by three different over relaxation algorithms. nested parallel computation on P processors is O( T 1 (C) P md m s eCT1 (m s)T1) where T1(C) is the uniprocessor execution time of the computation including cache misses. As in previous work [6, 9], we represent a multithreaded computation as a directed, acyclic graph (dag) of instructions. Each node in the dag represents a single instruction and the edges represent ordering constraints. A nested parallel computation [5, 6] is a race free computation that can be represented with a ....
....time of the computation including cache misses. As in previous work [6, 9] we represent a multithreaded computation as a directed, acyclic graph (dag) of instructions. Each node in the dag represents a single instruction and the edges represent ordering constraints. A nested parallel computation [5, 6] is a race free computation that can be represented with a series parallel dag [33] Nested parallel computations include computations consisting of parallel loops and fork an joins and any nesting of them. This class includes most computations that can be expressed in Cilk [8] and all ....
[Article contains additional citation context not shown here]
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 1--12, Santa Barbara, California, July 1995.
....a total of w operations (work) and has a critical path length (depth) of d can be implemented to run in O(w=p d) time, which is within a constant factor of optimal. These results were used to bound the time and space used by the Cilk programming language [7] Blelloch, Gibbons and Matias [4] showed that for nested computations, the time bounds can be maintained while bounding the space by s1 O(pd) which for sufficient parallelism is just an additive factor over the sequential space. This was used to bound the space of the nesl programming language [5] Narlikar and Blelloch [30] ....
..... In addition, we show that if the dag is planar, or close to it, then the algorithm executes the computation in s1 O(pd log p) space and O(w=p d log p) time, independent of the number of synchronizations. Planar dags are a more general class of dags than the computation dags considered in [8, 9, 4]. Previously, no space bounds were known for computations with synchronization variables, even in the case where the dags are planar. As with previous work [4, 29] the idea behind the implementation is to schedule the threads in an order that is as close as possible to the sequential order (while ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proc. Symposium on Parallel Algorithms and Architectures, pages 1--12, July 1995.
....parallel programs [6, 8, 9] If S 1 is the space required by the serial execution, these techniques generate executions for a multithreaded computation on p processors that require no more than p Delta S 1 space. A scheduling algorithm that significantly improved this bound was recently proposed [2], and was used to prove time and space bounds for the implementation of NESL [4] It generates a schedule that uses only S 1 O(p Delta D Delta log p) space on a standard p processor EREW PRAM, where D is the depth of the parallel computation (i.e. the longest sequence of dependencies or the ....
....for any task graph, and prove upper bounds on their space and time requirements. We then present an online, asynchronous scheduling algorithm called Async Q, which generates a schedule with the same space bound of S 1 O(p Delta D Delta log p) including scheduler space) that is obtained in [2]; the algorithm assumes an EREW PRAM with a unit time fetch and add operation. As with their scheduling algorithm, our Async Q algorithm applies to task graphs representing nested parallelism. However, our algorithm overcomes the above problems with their scheduling algorithm: it allows a thread ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbins, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 1995 ACM Symposium on Parallel Algorithms and Architectures, Santa Barbara, July 1995. ACM.
....computation on p processors that require no more than p Delta S1 space. These ideas are used in the implementation of the Cilk programming language [5] A recent scheduling algorithm improved these space bounds from a multiplicative factor on the number of processors to an additive factor [3]. The algorithm generates a schedule that uses only S1 O(p Delta D) space, where D is the depth of the parallel computation (i.e. the longest sequence of dependencies or the critical path in the computation) This bound is asymptotically lower than the previous bound of p Delta S1 when D S1 ....
....algorithms in Cilk [4] rescheduled after every unit computation to guarantee the space bounds. Moreover, it ignores the issue of locality a thread may be moved from processor to processor at every timestep. This paper presents a variant on the scheduling algorithm proposed in [3] that overcomes the above mentioned problems. The paper then gives experimental results that demonstrate that the algorithm does achieve good performance both in terms of memory and time. The main goal in the design of the algorithm was to allow threads to execute nonpreemptively and ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proc. Symposium on Parallel Algorithms and Architectures, Santa Barbara, July 1995.
....on p processors that require no more than p Delta S 1 space. These ideas are used in the implementation of the Cilk programming language [Blumofe et al. 1995] A recent scheduling algorithm improved these space bounds from a multiplicative factor on the number of processors to an additive factor [Blelloch et al. 1995]. The algorithm generates a schedule that uses only S 1 O(p Delta D) space, where D is the depth of the parallel computation (i.e. the length of the longest sequence of dependencies or the critical path in the computation) This bound is asymptotically lower than the previous bound of p Delta ....
....a stronger upper bound than p Delta S 1 for space requirements of regular divide and conquer algorithms in Cilk [Blumofe et al. 1996] Space Efficient Scheduling of Nested Parallelism Delta 3 AsyncDF. This algorithm is a variant of the synchronous scheduling algorithm proposed in previous work [Blelloch et al. 1995], and overcomes the above mentioned problems. We also provide experimental results that demonstrate that the AsyncDF algorithm does achieve good performance both in terms of memory and time. The main goal in the design of the algorithm was to allow threads to execute nonpreemptively and ....
[Article contains additional citation context not shown here]
Blelloch, G. E., Gibbons, P. B., and Matias, Y. 1995. Provably efficient scheduling for languages with fine-grained parallelism. In Proc. Symposium on Parallel Algorithms and Architectures, Santa Barbara, pp. 420--430.
....are scheduled while still maintaining sufficient parallelism. In particular, we schedule at most pT fetchadd (p) threads per step. Also, if we can choose the scheduled states appropriately, we might also be able to minimize the maximum number of active states on any step, for space efficiency [1]. But it seems unlikely that call by speculation allows an efficient implementation of a depth first p traversal of the computation DAG. Consider a modification of the FSAM model, called the Partially Speculative Abstract Machine (PSAM) which incorporates these changes. Before we can discard ....
Guy Blelloch, Phil Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In ACM Symposium on Parallel Algorithms and Architectures, July 1995.
....fork join or loop parallelism using non preemptive, stateless threads; it further reduces overheads by coarsening and pruning excess parallelism. Recent work has resulted in provably efficient scheduling techniques that provide upper bounds on the space required by the parallel computation [11, 12, 10, 8, 32]. Since there are several possible execution orders for lightweight threads in a computation with a high degree of parallelism, the provably space efficient schedulers restrict the execution order for the threads to bound the space requirement. For example, the Cilk multithreaded system [10] ....
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 17--19, 1995. ACM SIGACT/SIGARCH and EATCS.
....that n could be much larger than p) When coming to an array scan in the code the implementation spawns n threads and places them in the set of active threads. Since creating n threads could take more than constant time on p processors, they are created lazily using a stub as described in [7] threads are expanded when taken from S instead of when inserted. For each block of p or less threads that are scheduled from the set in a particular step, we can use the scan primitive assumed in the machine model to execute the scan across that subset and place the new running sum back into ....
....synchronization is through the futurecells themselves and there is no specification in the algorithms of what happens on what step. This gives freedom to the implementation as to how to schedule the tasks. The implementation, for example, could optimize the schedule for either space efficiency [12, 7, 8] or locality [13] On a uniprocessor the implementation could run the code in a purely sequential mode without any need for synchronization. We are not yet sure how general the approach is. We have not yet been able to show, for example, whether the method can be used to generate a sort that has ....
G. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proc. ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, July 1995.
....on the space needed by the implementation based on this measure. These bounds show that for programs with sufficient parallelism, the parallel execution requires very little extra memory beyond a standard call by value sequential execution. These space bounds use recent results on DAG scheduling [2] and are non trivial. Although we use these extensions to prove bounds for Nesl, the techniques and results can be applied in a broader context. In particular we translate Nesl into a generic array language which could be used to express other array extensions, and the space bounds we derive can ....
....in the store. Our aim is to place bounds on how much extra space is needed. As mentioned, the idea behind the proof is to show that the P CEK(q) executes a p DFT traversal of the DAG g returned by the semantics, then use previous results on the number of nodes scheduled prematurely in a p DFT [2], and finally use these results to bound the space. By the machine traversing the DAGwe mean there is a one to one correspondence between substate transitions and nodes in the DAG. This implies that each parallel step of the P CEK(q) processes min(q; jQj) nodes of the DAG, and the total number of ....
[Article contains additional citation context not shown here]
Guy Blelloch, Phil Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In ACM Symposium on Parallel Algorithms and Architectures, July 1995.
No context found.
G. E. Blelloch, P. B. Gibbons and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proc. of the 7th Symp. on Parallel Algorithms and Architectures, pp 1-12, Santa-Barbara, 1995. ACM Press.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC