| Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with finegrained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 1995. |
....threads libraries. In addition, programming languages, such as Cilk [7, 21] and Java [3] support multithreading with linguistic abstractions. A major factor in the performance of such multithreaded parallel applications is the operation of the thread scheduler. Prior work on thread scheduling [4, 5, 8, 13, 14] has dealt exclusively with non multiprogrammed environments in which a multithreaded computation executes on P dedicated processors. Such scheduling algorithms dynamically map threads onto the processors with the goal of achieving P fold speedup. Though such algorithms will work in some ....
.... to page faults [6] For these reasons, work stealing is practical and variants have been implemented in many systems [7, 19, 20, 24, 34, 38] For general multithreaded computations, other scheduling algorithms have also been shown to be simultaneously efficient with respect to time and space [4, 5, 13, 14]. Of particular interest here is the idea of deriving parallel depth first schedules from serial schedules [4, 5] which produces strong upper bounds on time and space. The practical application and possible adaptation of this idea to multiprogrammed environments is an open question. Prior work ....
[Article contains additional citation context not shown here]
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with finegrained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 1995.
.... that they contained no provision for handling overflow of renaming buffers [27] Recently, work on efficient task queue implementations for explicitly parallel functional programming languages [85] has been extended to provide theoretical bounds on the renaming resources required by such systems [18, 17]. SUDS provides constant bounded resource guarantees through its checkpoint repair mechanism. This mechanism allows SUDS to rollback and sequentially reexecute any program fragment that exhausts renaming resources 63 when run in parallel. Further, the SUDS memory dependence speculation mechanism ....
....of the last decade. The dataflow machines of the past, however, had two problems. Fortunately, a system like SUDS can help to address these problems. The first problem was that dataflow machines did not run imperative programs, but only programs written in functional programming languages [35, 34, 91, 85, 28, 18, 17]. Scalar queue conversion can help address this problem because it converts scalar updates into function (closure) calls. The second problem with dataflow machines was that their renaming mechanisms were not fundamentally deadlock free [27] Checkpoint repair mechanisms, like that provided by ....
[Article contains additional citation context not shown here]
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with finegrained parallelism. Journal of the ACM, 46(2):281--321, 1999.
....on an allocation of ready tasks to idle processors. Using it, several results have been obtained that proves that well defined classes of parallel programs can be executed in asymptotic optimal time on theoretical machine models such as the PRAM or the local PRAM, including scheduling overheads [2, 5, 3]. Basically, a list scheduling do not require much information about processes in the application, although some knowledge may appear useful l. For instance, a total sequential ordering of tasks allow to bound the amount of space required for the computation while achieving linear speed up for ....
....processes in the application, although some knowledge may appear useful l. For instance, a total sequential ordering of tasks allow to bound the amount of space required for the computation while achieving linear speed up for certain classes of programs: strict computations [5] nested computations [2] or planar graphs [3] Furthermore, in practice, due to magnitude of the ratio between local and remote memory access costs, some significant improvement can be brought to a schedule by some knowledge about the data flow graph corresponding to the execution [10] Exploiting this knowledge leads ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa-Barbara, California, 1995. ACM Press.
....number IST 1999 14186 (ALCOM FT) dynamic, unstructured parallelism. During the execution of a multithreaded computation, a thread may spawn child threads which can be executed in parallel, and it can synchronize with other currently executing threads. In most of the work in the literature [1, 4, 5, 6, 7, 9, 15, 16, 24, 25, 26, 27], a multithreaded computation is modeled as a directed acyclic graph (see Figure 1(a) Of much concern is how a multithreaded computation can be executed efficiently on a parallel computer. A parallel execution of a multithreaded computation specifies which processor executes each thread and ....
....that is the maximum length of any path in the computation graph. For computations with sufficient parallelism, depth first schedulers improve upon the previous space bound of O(PS1) achieved by work stealing algorithms [7, 9, 15, 16] Moreover, depth first schedulers cope with heap allocation [5] which is more general than the stack based model assumed in work on work stealing [7, 8, 9, 15, 16] However, depthfirst schedulers use a globally ordered centralized data structure of active threads and thus they are not as practical as work stealing schedulers. Especially for fine grained ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. Journal of the ACM, 46(2):281--321, March 1999.
....number IST 1999 14186 (ALCOM FT) dynamic, unstructured parallelism. During the execution of a multithreaded computation, a thread may spawn child threads which can be executed in parallel, and it can synchronize with other currently executing threads. In most of the work in the literature [1, 4, 5, 6, 7, 9, 15, 16, 24, 25, 26, 27], a multithreaded computation is modeled as a directed acyclic graph (see Figure 1(a) Of much concern is how a multithreaded computation can be executed efficiently on a parallel computer. A parallel execution of a multithreaded computation specifies which processor executes each thread and ....
....a scheduling algorithm to achieve all of the above goals is not a trivial task. Several algorithms [7, 9, 15, 16] employ work stealing, a technique in which underutilized processors try to steal work from over utilized ones, to achieve the above scheduling goals. Recently, a flurry of research [4, 6, 24, 26] has resulted in depth first schedulers, which schedule threads prioritized by their (serial) left to right depth first execution order and are highly space efficient; the space complexity of an algorithm is the total amount of memory used by all processors to execute the computation. ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12. Santa Barbara, California, July 1995.
....the performance model for a given program. In particular, tools for measuring the critical path of a given program are essential. Determining the critical path length is a well known and extremely effective way of understanding the performance of parallel programs [BL94, BJK 95, BJK 96, BGM95, NB97, Nar99] 1.4 Contributions The principal contributions of our work are as follows. ffl We propose a technique for achieving efficient execution in bottlenecks. It reduces the number of mutual exclusion operations that accompany bottleneck modules and enhances the cache efficiency in the ....
....excessive demands on memory because processors can rapidly send repeated requests to an owner, resulting in the creation of a huge number of data structures containing the information needed for the execution of the requested method. See Cilk s work [BL94, BJK 95, BJK 96] and NESL s work [BGM95, BGMN97, NB97] for a theoretical background on spaceefficiency. Local based execution, however, is not always the best choice. If an object is updated frequently by multiple processors, for example, local based execution will be subject to serious slowdowns caused by overheads, such as cache ....
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably Efficient Scheduling for Languages with Fine-Grained Parallelism. In Proceedings of the 7th ACM Symposium on Parallel Algorithms and Architectures (SPAA '95), pages 1--12, Santa Barbara, CA, July 1995.
....at run time. Cilk can handle big size problems but, communication cost is not taken into consideration. The Cilk system performs very well for tree like style computation (min max search, backtrack exploration, etc. but, it has not been designed for scientific loop nest computations. In [2, 13] run time methods to schedule task graphs are described addressing the problem of processor memory requirement, but these works do not consider DAG memory requirement. In [1] a tool CASCH, is presented. It allows to generate a schedule and a parallel code for a sequential program. Nevertheless ....
G. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings Symposium on Parallel Algorithms and Architectures, pages 1--12, July 1995.
....at run time. Cilk can handle big size problems but, communication cost is not taken into consideration. The Cilk system performs very well for tree like style computation (min max search, backtrack exploration, etc. but, it has not been designed for scientific loop nest computations. In [2, 13] run time methods to schedule task graphs are described addressing the problem of processor memory requirement, but these works do not consider DAG memory requirement. In [1] a tool CASCH, is presented. It allows to generate a schedule and a parallel code for a sequentiel program. Nevertheless ....
G. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings Symposium on Parallel Algorithms and Architectures, pages 1--12, July 1995.
....a FORALL construct. The compile time component constructs the threads from the nested loops. A run time component dynamically schedules these threads across processors. Recent work has resulted in run time scheduling techniques that minimize completion time and memory use of the generated threads [9, 6, 20]. Scheduling very fine grained threads (e.g. a single multiplication in the sparse matrix vector product example) is impractical, hence compile time techniques are required to increase thread granularity, although this may result in lost parallelism and increased load imbalance. FORALL (i = ....
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, CA, June 1995.
....scheduled at run time. Cilk can handle big size problems but communication cost is not taken into consideration. The Cilk system give good results with tree like style computation (min max search, backtrack exploration, etc. but it has not been designed for scientific loop nest computations. In [2, 13] run time methods to schedule task graphs are described addressing the problem of processor memory requirement, but these works do not consider DAG memory requirement. In [1] a tool CASCH, is presented. It allows to generate a schedule and a parallel code for a sequential program. Nevertheless ....
G. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings Symposium on Parallel Algorithms and Architectures, pages 1--12, July 1995.
....serial, depth first space requirement [9] A computation with work (total number of operations) and depth (length of the critical path) was shown to require # time on processors [9] We will henceforth refer to such schedulers as work stealing schedulers. Recent work [6, 34] has resulted in depth first scheduling algorithms that require # ( space for nestedparallel computations with depth . For programs that have a low depth (a high degree of parallelism) such as all programs in the class , 14] the space bound of .# ( is ....
....in the class , 14] the space bound of . # ( is asymptotically lower than the work stealing bound of ) Further, the depth first approach allows a more general memory allocation model compared to the stack based allocations assumed in space efficient work stealing [6]. The depth first approach has been extended to handle computations with futures [39] or I structures [16] resulting in similar space bounds [4] Experiments showed that an asynchronous, depth first scheduler often results in lower space requirement in practice, compared to a work stealing ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proc. ACM symp. Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 17--19, 1995.
....ignoring though space requirements and communication costs. Burton shows in [10] how to limit space in certain parallel computations without causing deadlock. More recently, Burton [9] has developed and analyzed a scheduling algorithm with provably good time and space bounds. Blelloch et al. [3, 4] have also recently developed and analyzed scheduling algorithms with provably good time and space bounds for languages with nested fine grained parallelism (that is, languages that lead to series parallel DAGs) All these algorithms are analyzed only for shared memory machines and do not account ....
G. E. Blelloch, P.B. Gibbons and Y. Matias, "Provably efficient scheduling for languages with finegrained parallelism," Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, Santa Barbara, California, pp. 1--12, July 1995.
....a ###### construct. The compile time component constructs the threads from the nested loops. A run time component dynamically schedules these threads across processors. Recent work has resulted in run time scheduling techniques that minimize completion time and memory use of the generated threads [9, 6, 20]. Scheduling very fine grained threads (e.g. a single multiplication in the sparse matrix vector product example) is impractical, hence compile time techniques are required to increase thread granularity, although this may result in lost parallelism and increased load imbalance. ###### ## # ....
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, CA, June 1995.
....is impractical, hence compile time techniques are required to increase thread granularity, although this may result in lost parallelism and increased load imbalance. Recent work has resulted in run time scheduling techniques that minimize completion time and memory use of the generated threads [11, 8, 27]. The automatic construction of threads of appropriate granularity is currently being investigated by several researchers [26, 19] In Fig. 4(c) we show a decomposition of the total work into four parallel threads T 1 ; T 4 . In this decomposition the body of the inner FORALL loop has been ....
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, CA, June 1995. 4.1
....where S1 is the serial, depth first space requirement [9] A computation with W work (total number of operations) and D depth (length of the critical path) was shown to require W=p O(D) time on p processors [9] We will henceforth refer to such schedulers as work stealing schedulers. Recent work [6, 34] has resulted in depth first scheduling algorithms that require S1 O(p Delta D) space for nestedparallel computations with depth D. For programs that have a low depth (a high degree of parallelism) such as all programs in the class NC [14] the space bound of S1 O(p Delta D) is ....
....in the class NC [14] the space bound of S1 O(p Delta D) is asymptotically lower than the work stealing bound of p Delta S1 . Further, the depth first approach allows a more general memory allocation model compared to the stack based allocations assumed in space efficient work stealing [6]. The depth first approach has been extended to handle computations with futures [39] or I structures [16] resulting in similar space bounds [4] Experiments showed that an asynchronous, depth first scheduler often results in lower space requirement in practice, compared to a work stealing ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proc. ACM symp. Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 17--19, 1995.
....by the owning thread (unlike shared variables which can be changed by simultaneously executing threads) Private variables are useful for communicating between a Cilk thread and C functions it calls, because these C functions are completely contained in the Cilk thread. An private char alternates[10][MAXWORDLEN] int checkword(const char word) Check spelling of word . If spelling is correct, return 0. Otherwise, put up to 10 alternate spellings in alternates array. Return number of alternate spellings. cilk void spellcheck(const char wordarray, int num) if (num = 1) ....
.... running time T P (C; n) The computational work of blockedmul is T 1 (n) Theta(n 3 ) so the total work is T 1 (C; n) T 1 (n) mF 1 (C; n) Theta(n 3 ) The critical path is T1 = Theta(lg 2 n) so using our performance model, the 4 In recent work, Blelloch, Gibbons, and Matias [10] have shown that series parallel dag computations can be scheduled to achieve substantially better space bounds than we report here. For example, they give a bound of SP (n) O(n 2 P lg 2 n) for matrix multiplication. Their improved space bounds come at the cost of substantially more ....
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 1--12, Santa Barbara, California, July 1995.
....on an allocation of ready tasks to idle processors. Using it, several results have been obtained that proves that well defined classes of parallel programs can be executed in asymptotic optimal time on theoretical machine models such as the PRAM or the local PRAM, including scheduling overheads [2, 5, 3]. Basically, a list scheduling do not require much information about processes in the application, although some knowledge may appear useful. For instance, a total sequential ordering of tasks allow to bound the amount of space required for the computation while achieving linear speed up for ....
....processes in the application, although some knowledge may appear useful. For instance, a total sequential ordering of tasks allow to bound the amount of space required for the computation while achieving linear speed up for certain classes of programs: strict computations [5] nested computations [2] or planar graphs [3] Furthermore, in practice, due to magnitude of the ratio between local and remote memory access costs, some significant improvement can be brought to a schedule by some knowledge about the data flow graph corresponding to the execution [10] For instance, Schedules computed ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with finegrained parallelism. In Proceedings of the 7th Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa-Barbara, California, 1995. ACM Press.
....=Q(P(n=P 1=3 ) 2 ) Q(n 2 P 1=3 ) 4 The work and critical path length for matrixmul can also be computed using recurrences. The computational work T 1 (n) to multiply n Theta n matrices satisfies T 1 (n) 8T 1 (n=2) Q(n 2 ) since 4 In recent work, Blelloch, Gibbons, and Matias [6] have shown that series parallel dag computations can be scheduled to achieve substantially better space bounds than we report here. For example, they give a bound of S P (n) O(n 2 Plg 2 n) for matrix multiplication. Their improved space bounds come at the cost of substantially more ....
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 1995.
....For applications where the amount of computation grows faster with problem size than communication, choosing a bigger problem size can reduce the relative impact of overheads such as communication latencies. Basically, we are applying Amdahl s law here, improving speedup by reducing critical path [8, 7]. In our situation, we could have increased the problem sizes to compensate for the slowness of the WANs. However, we have decided not to do so, since determining the impact of the WAN is precisely what we want to do. Thus, we believe and expect that the speedup figures that follow can be ....
G.E. Blelloch, P.B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with finegrained parallelism. In Proc. 7th ACM Symp. Par. Alg. and Arch. (SPAA), pages 1--12, July 1995.
....in the fast communication research such as active messages. Thus we expect other software system researchers can also benefit from our results in using fast communication support to design software layers. Most of previous research on scheduling [16, 19, 20] does not address memory issues. In [1], a dynamic scheduling algorithm for directed acyclic graphs is proposed with memory space usage S1=p O(D) on each processor, where S1 is the sequential space requirement, p is the total number of processors and D is the depth of a DAG. This work provides a solid theoretical ground for ....
....at most S1 space per processor. This paper assumes that each processor has a maximum space limit and the goal is to make the data space cost to be close to S1=p per processor in order to solve large scale problems. The scheduling scheme we use is static in the run time preprocessing stage while [1] and [2] use dynamic scheduling. This is mainly because in practice it is difficult to minimize the run time control overhead of dynamic scheduling in parallelizing sparse code with mixed granularities. It should be noted that there exists other space overhead which includes the space for the ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably Efficient Scheduling for Languages with Fine-Grained Parallelism. In Proceedings of 7th ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, July 1995.
....applications, multiprocessors also support multiprogrammed workloads in which a mix of serial and parallel, interactive and batch applications may execute concurrently. A major factor in the performance of such workloads is the operation of the thread scheduler. Prior work on thread scheduling [4, 5, 8, 11, 12] has dealt exclusively with non multiprogrammed environments in which a multithreaded computation executes on P dedicated processors. Such scheduling algorithms dynamically map threads onto the processors with the goal of achieving P fold speedups. Though such algorithms will work in some ....
.... to page faults [6] For these reasons, work stealing is practical and variants have been implemented in many systems [7, 16, 17, 21, 30, 34] For general multithreaded computations, other scheduling algorithms have also been shown to be simultaneously efficient with respect to time and space [4, 5, 11, 12]. Of particular interest here is the idea of deriving parallel schedules from serial schedules [4, 5] which produces strong upper bounds on time and space. The practical application and possible adaptation of this idea to multiprogrammed environments is an open question. Prior work that has ....
[Article contains additional citation context not shown here]
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 1995.
....cost model as a DAG of depth 2 and breadth n. When coming to an array scan in the code the implementation spawns n threads and places them in the set of active threads. Since creating n threads could take more than constant time on p processors, they are created lazily using a stub as described in [8] threads are expanded when taken from S instead of when inserted. For each block of p or less threads that are scheduled from the set in a particular step, we can use the unit time scan primitive assumed in the machine model to execute the scan across that subset and place the new running sum ....
....is through the future cells themselves and there is no specification in the algorithms of what happens on what step. This gives freedom to the implementation as to how to schedule the tasks. The implementation, for example, could optimize the schedule for either space efficiency [12] [8], 9] or locality [13] On a uniprocessor the implementation could run the code in a purely sequential mode without any need for synchronization. We are not yet sure how general the approach is. We have not been able to show, for example, whether the method can be used to generate a sort that has ....
G. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th AnnualACMSymposiumonParallel Algorithms andArchitectures, pages 1--12, July 1995.
....fork join or loop parallelism using nonpreemptive, stateless threads; it further reduces overheads by coarsening and pruning excess parallelism. Recent work has resulted in provably efficient scheduling techniques that provide upper bounds on the space required by the parallel computation [9, 11, 12, 13, 35]. Since there are several possible execution orders for lightweight threads in a computation with a high degree of parallelism, the provably space efficient schedulers restrict the execution order for the threads to bound the space requirement. For example, the Cilk multithreaded system [11] ....
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, California, July 17--19, 1995. ACM SIGACT/SIGARCH and EATCS.
....static partioning Figure 1: The speedup obtained by three different over relaxation algorithms. nested parallel computation on P processors is O( T 1 (C) P md m s eCT1 (m s)T1) where T1(C) is the uniprocessor execution time of the computation including cache misses. As in previous work [6, 9], we represent a multithreaded computation as a directed, acyclic graph (dag) of instructions. Each node in the dag represents a single instruction and the edges represent ordering constraints. A nested parallel computation [5, 6] is a race free computation that can be represented with a ....
....time of the computation including cache misses. As in previous work [6, 9] we represent a multithreaded computation as a directed, acyclic graph (dag) of instructions. Each node in the dag represents a single instruction and the edges represent ordering constraints. A nested parallel computation [5, 6] is a race free computation that can be represented with a series parallel dag [33] Nested parallel computations include computations consisting of parallel loops and fork an joins and any nesting of them. This class includes most computations that can be expressed in Cilk [8] and all ....
[Article contains additional citation context not shown here]
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 1--12, Santa Barbara, California, July 1995.
....a total of w operations (work) and has a critical path length (depth) of d can be implemented to run in O(w=p d) time, which is within a constant factor of optimal. These results were used to bound the time and space used by the Cilk programming language [7] Blelloch, Gibbons and Matias [4] showed that for nested computations, the time bounds can be maintained while bounding the space by s1 O(pd) which for sufficient parallelism is just an additive factor over the sequential space. This was used to bound the space of the nesl programming language [5] Narlikar and Blelloch [30] ....
..... In addition, we show that if the dag is planar, or close to it, then the algorithm executes the computation in s1 O(pd log p) space and O(w=p d log p) time, independent of the number of synchronizations. Planar dags are a more general class of dags than the computation dags considered in [8, 9, 4]. Previously, no space bounds were known for computations with synchronization variables, even in the case where the dags are planar. As with previous work [4, 29] the idea behind the implementation is to schedule the threads in an order that is as close as possible to the sequential order (while ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proc. Symposium on Parallel Algorithms and Architectures, pages 1--12, July 1995.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC