| Robert D. Blumofe and Charles E. Leiserson. 1993. Spaceefficient scheduling of multithreaded computations. In Proceedings of the Twenty-Fifth Annual ACM Symposium on the Theory of Computing (STOC '93), pages 362--371, San Diego, CA, USA, May. Also in SIAM Journal on Computing. |
....on an allocation of ready tasks to idle processors. Using it, several results have been obtained that proves that well defined classes of parallel programs can be executed in asymptotic optimal time on theoretical machine models such as the PRAM or the local PRAM, including scheduling overheads [2, 5, 3]. Basically, a list scheduling do not require much information about processes in the application, although some knowledge may appear useful l. For instance, a total sequential ordering of tasks allow to bound the amount of space required for the computation while achieving linear speed up for ....
....much information about processes in the application, although some knowledge may appear useful l. For instance, a total sequential ordering of tasks allow to bound the amount of space required for the computation while achieving linear speed up for certain classes of programs: strict computations [5] nested computations [2] or planar graphs [3] Furthermore, in practice, due to magnitude of the ratio between local and remote memory access costs, some significant improvement can be brought to a schedule by some knowledge about the data flow graph corresponding to the execution [10] Exploiting ....
[Article contains additional citation context not shown here]
R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of multithreaded computations . SIAM Journal on Computing, 27(1):202--229, 1998.
....pipeline depends on the input data, such as the treap algorithms we describe. This would be considerably more difficult to do by hand and we know of no previous PRAM algorithms with dynamic pipelines. 218 G. E. Blelloch and M. Reid Miller 2. The Model As with the work of Blumofe and Leiserson [12], 13] we model a computation as a set of threads and the cost as the size of the computation DAG. Threads can fork new threads using a future, and can synchronize by requesting a value written by another thread. A computation begins with a single thread and completes when all threads have ....
....constant time, the whole step takes constant time. Since, on each step, the implementation processes min S , p threads, and S holds all the active threads (by definition) the implementation executes a greedy schedule of the computation DAG. The number of steps is therefore bounded by w p d [12] and the total time by O(w p d) Note that for the time bounds it does not matter which threads are taken from S on each step, allowing the implementation some freedom in selecting a schedule that is space or communication efficient. The stack discipline we describe above, however, is probably ....
[Article contains additional citation context not shown here]
R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proceedings of the Twenty-Fifth Annual ACM Symposium on the Theory of Computing, pages 362--371, May 1993.
....Lab. de Recherche en Informatique, 91405 Orsay cedex France, fci,pierre lri.fr) y Dept. of Computing, Sydney, NSW 2109, Australia, bmans ics.mq.edu.au) z Dept. of Computer Science, Amherst, MA 01003, USA, rsnbrg cs.umass.edu) gorithm designer from the details of specific architectures; cf. [12, 14, 23, 27]. The further difficulties caused by NOWs asynchrony and loose coupling has led to yet more insulating models; cf. 13, 15] for dedicated NOWs and [4, 11] for borrowed NOWs. NOWs algorithmic intransigence increases as NOWs lose their homogeneity [5, 25] and or communication flatness [9, 10, ....
....essential ingredients in many parallel algorithms, while demonstrating the algorithmic tractability of HiHCoHP. 3.1. Collective Communication in Hyperclusters. Near )optimal scheduling algorithms exist for many cluster computing applications that require only point topoint communication; cf. [2, 11, 12, 24, 25]. In contrast, applications that require collective communication e.g. for cluster synchronization [16, 20] generally use heuristics whose quality is known only for special computations on specific clusters. We now develop a broadcast reduction algorithm whose efficiency and optimality ....
R. Blumofe and C.E. Leiserson (1998): Space-efficient scheduling of multithreaded computations. SIAM J. Comput. 27, 202--229.
....itself spawns threads, and local registers of the parent thread may be needed as global registers. It is beyond the scope of the current paper to describe this in greater detail. Consider the issues addressed by the left to right depth first policy for initiation and suspension of threads, due to [BL93]; recall that those issues include generating sufficient parallelism out of a parallel program while keeping the memory requirements under control; these issues are suppressed in the current paper. It will be interesting to see the extent to which recent work on scheduling of parallelism for ....
R.D. Blumofe and C.E. Leiserson. Space-efficient scheduling of multi-threaded computations. In Proc. 25th ACM-STOC, 362--371, 1993.
....we derive our scheduling guidelines in Section 3. We illustrate the application of the guidelines in a variety of scenarios in Section 4 and end with open problems in Section 5. Other noteworthy studies of scheduling algorithms for NOWs, which differ from ours in focus or objectives, appear in [1, 2, 4, 5, 6]. Of these, only [2] deals with the present adversarial scenario of stealing cycles; its main contribution is a randomized strategy that, with high probability, steals cycles within a logarithmic factor of optimally. We do not list the many empirical studies of computation on NOWs whose main foci ....
R. Blumofe and C.E. Leiserson. Space-efficient scheduling of multithreaded computations. 25th ACM Symp. on Theory of Computing, pages 362--371, 1993.
....processors steal work from other processors. The work stealing paradigm dates back at least as far as Burton and Sleep s research [11] on parallel execution of functional programs and Halstead s implementation of Multilisp [18] Since then a lot of work has been done in this direction (see e.g. [1, 4, 5, 6, 7, 8, 15]) Three significant performance parameters of any scheduling algorithm for multithreaded computations are the required space, their execution time and the communication cost incurred by them. The execution time is the total time needed by the algorithm to execute the instructions of all threads ....
....desirable in several cases. For all these reasons, strictification is not considered an efficient solution. It is thus interesting to discover other techniques or algorithms that schedules in a provably efficient way more general than fully strict multithreaded computations. Blumofe and Leiserson [7] have proved that there exists no scheduling algorithm for general multithreaded computations to achieve both linear speedup and linear expansion of memory. They have presented a multithreaded computation for which every algorithm can not achieve not even a factor of two speedup without avoiding ....
R. D. Blumofe and C. E. Leiserson, "Space-Efficient Scheduling of Multithreaded Computation," SIAM Journal on Computing, Vol. 27, No. 1, pp. 202--229, February 1998.
....for 3 At most at the cost of cache miss or false sharing due to a bad organization of data structure. 4 Note that, at a finer grain, this should be consider for programs on SMP architecture in order to efficiently re use data in cache memory. Parallel Implementation 3 fine grain computations [5] and at most at a factor two from the optimal in the general case [18] To decrease concurrency on the queue of tasks, it is often implemented in a distributed way following a work stealing scheme. Each processor manages its own queue and attempts to steal work from other processors only when it ....
.... appeared useful to efficiently solve sparse linear systems where the matrix requires a huge amount of memory [12] The main problem encountered in the use of such a scheduling is related to memory space exhaustion [1, 4] This exhaustion cannot be avoided for a general multithreading model [5] where synchronization operators, such as semaphores, are unpredictable and forbid a serial execution. However, if a correct serial execution order is a priori known, managing a priority queue according to this order enables to bound the memory space with respect to the one required by the serial ....
[Article contains additional citation context not shown here]
Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. SIAM Journal on Computing, 27(1):202--229, 1998.
....the number of tree nodes stored in the message queues may be very large. The breadth first nature of the protocol maintains a frontier of the tree within the message queues of all processors, and the size of the frontier may be much larger than the number of queues. Leiserson and Blumofe [12] proposed a new search procedure in which each processor performs a depth first search so that space requirement is significantly reduced. Our reactive protocol for the atomic message model will be able to reduce queue length requirements if it can adopt Leiserson and Blumofe s space efficient ....
R. Blumofe and C. Leiserson. Space-efficient scheduling of multi-threaded computations. In 25th Annual ACM Symposium on Theory of Computing, 1993.
....on an allocation of ready tasks to idle processors. Using it, several results have been obtained that proves that well defined classes of parallel programs can be executed in asymptotic optimal time on theoretical machine models such as the PRAM or the local PRAM, including scheduling overheads [2, 5, 3]. Basically, a list scheduling do not require much information about processes in the application, although some knowledge may appear useful. For instance, a total sequential ordering of tasks allow to bound the amount of space required for the computation while achieving linear speed up for ....
....much information about processes in the application, although some knowledge may appear useful. For instance, a total sequential ordering of tasks allow to bound the amount of space required for the computation while achieving linear speed up for certain classes of programs: strict computations [5] nested computations [2] or planar graphs [3] Furthermore, in practice, due to magnitude of the ratio between local and remote memory access costs, some significant improvement can be brought to a schedule by some knowledge about the data flow graph corresponding to the execution [10] For ....
[Article contains additional citation context not shown here]
R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of multithreaded computations. SIAM Journal on Computing, 27(1):202--229, 1998.
....required by the parallel execution to the space s1 required by the sequential execution. Burton [11] first showed that for a certain class of computations the space required by a parallel implementation on p processors can be bound by p Delta s1 (s1 space per processor) Blumofe and Leiserson [8, 9] then showed that this space bound can be maintained while also achieving good time bounds. They showed that a fully strict computation that executes a total of w operations (work) and has a critical path length (depth) of d can be implemented to run in O(w=p d) time, which is within a constant ....
..... In addition, we show that if the dag is planar, or close to it, then the algorithm executes the computation in s1 O(pd log p) space and O(w=p d log p) time, independent of the number of synchronizations. Planar dags are a more general class of dags than the computation dags considered in [8, 9, 4]. Previously, no space bounds were known for computations with synchronization variables, even in the case where the dags are planar. As with previous work [4, 29] the idea behind the implementation is to schedule the threads in an order that is as close as possible to the sequential order (while ....
[Article contains additional citation context not shown here]
R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proc. Symposium on Theory of Computing, pages 362--371, May 1993.
....architecture. One of the most commonly used technique is greedy: when a node becomes idle, it picks some ready tasks if any from a list that may be distributed among nodes. On uniform memory architectures, this technique leads to provable performances for executions that have a small critical path [10, 5]. Various practical implementations on such architectures have given experimentally good performances for a wide range of applications [14, 5] A low level scheduling, on any particular node, overlaps part of system overheads by effective computation. Here, the most commonly used technique is ....
....be distributed among nodes. On uniform memory architectures, this technique leads to provable performances for executions that have a small critical path [10, 5] Various practical implementations on such architectures have given experimentally good performances for a wide range of applications [14, 5]. A low level scheduling, on any particular node, overlaps part of system overheads by effective computation. Here, the most commonly used technique is multithreading. In theoretical works [13, 11] this technique, which is often referred as parallel slackness [13] enables to provide ....
[Article contains additional citation context not shown here]
Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. SIAM Journal on Computing, 27(1):202--229, 1998.
....path lower bound. 5.2) Brent showed [Bre74, Lemma 2] that for every dataflow graph there is a schedule that executes the graph in no more than the sum of the linear speedup term and the critical path term. That is, there are schedules such that T P W P C: 5. 3) Blumofe and Leiserson [BL93] show that any greedy schedule that has no overheads achieves Brent s bounds. The work W and the critical path length C can be combined to give the average available parallelism of a program. If you know W and C , you can produce lower bounds on the time to run on P processors. If C is as large as ....
....wish to achieve good time performance, but we must not use too much memory. For example, a breadth first greedy search of the game tree is one of the schedules that achieves Brent s bound, but the amount of memory needed is nearly proportional to the amount of total work. Blumofe and Leiserson [BL93], inspired by memory exhaustion problems I had in early versions of StarTech, showed that for certain kinds of dataflow graphs, such as the dataflow graph for Jamboree search, there are easy to find schedules that not only achieve Brent s time bounds, but also use no more memory per processor than ....
[Article contains additional citation context not shown here]
Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proceedings of the Twenty-Fifth Annual ACM Symposium on the Theory of Computing, pages 362--371, San Diego, California, May 1993.
....a total ordering of tasks that results in a correct sequential execution) may be used. In this way, list scheduling leads to parallel computations achieving a linear speed up while requiring a space related to the one of the sequential execution for certain classes of programs: strict computations [4], nested computations [2] or planar graphs [3] Furthermore, in practice, due to the magnitude of the ratio between local and remote memory access costs, some significant improvement can be brought to a schedule some knowledge about the data flow corresponding to the execution [8] Some ....
....schedules: execution of a closure is then delayed until its parent (the task that has created it) resumes. From this reference order, costs (time, space, depth, communications) are defined directly on the code itself by related recurrence equations (max, 10, 1] Here, following notations in [1, 4], those costs are defined on the trace of the execution. This trace can be represented by a bipartite DAG G (see Fig. 4, with node sets corresponding respectively to tasks (oval nodes in Fig. 4 and shared data versions (box nodes in Fig. 4) Each task node is weighted by its computation cost, ....
[Article contains additional citation context not shown here]
R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of multithreaded computations. SIAM Journal on Computing, 27(1):202-229, 1998.
....memory. Consider the issues addressed by a left to right depth first policy for initiation and suspension of threads; recall that those important issues include generating sufficient parallelism out of a parallel program while keeping the memory requirements under control, as per papers such as [BL93] and [BGMN] these issues are suppressed in the current paper. 5. HOW EXECUTION IS MEASURED For a preliminary proof of concept of the model, we considered a few significant, and challenging, test problems, whose algorithms are so called irregular. The more the flow of control of an algorithm ....
R.D. Blumofe and C.E. Leiserson. Spaceefficient scheduling of multi-threaded computations. Proc. 25th STOC, 362--371, 1993.
....Early attempts to reduce the memory usage of parallel computations were based on heuristics that limited the parallelism [10, 15, 28, 32] and are not guaranteed to be space efficient in general. These were followed by scheduling techniques that provide proven space bounds for parallel programs [6, 8, 9]. If S 1 is the space required by the serial execution, these techniques generate executions for a multithreaded computation on p processors that require no more than p Delta S 1 space. A scheduling algorithm that significantly improved this bound was recently proposed [2] and was used to prove ....
....the end of the iteration. Assuming that F(B,i,j) does not allocate any space, the serial execution requires O(n) space, since the space for array B is reused for each i iteration. Now consider the parallel implementation of this function on p processors, where p n. Previous scheduling systems [6, 8, 9, 19, 23, 28, 32], which include both heuristic based and provably space efficient techniques, would schedule the outer level of parallelism first. This results in all the p processors executing one i iteration each, and hence the total space allocated is O(p Delta n) Our scheduling algorithm also starts by ....
[Article contains additional citation context not shown here]
R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proc. 25th ACM Symp. on Theory of Computing, pages 362--371, May 1993.
....Early attempts to reduce the memory usage of parallel computations were based on heuristics that limited the parallelism [10, 16, 32, 36] and are not guaranteed to be space efficient in general. These were followed by scheduling techniques that provide proven space bounds for parallel programs [6, 7, 8, 9]. If S1 is the space required by the serial execution, these techniques generate schedules for a multithreaded computation on p processors that require no more than p Delta S1 space. These ideas are used in the implementation of the Cilk programming language [5] A recent scheduling algorithm ....
....the time required to execute the schedule for a parallel computation in terms of its work (total number of operations) and depth. For a parallel computation with D depth and W work, if the total space allocated is O(W ) then the generated schedule runs in O(W=p D) time, making it time efficient [6]. We note that a bigger K leads to a lower running time since it reduces scheduling costs, but also results in a larger space bound. The K parameter therefore provides a trade off between the running time and the memory requirement of a parallel computation. For the benchmarks used in our ....
[Article contains additional citation context not shown here]
R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proc. 25th ACM Symp. on Theory of Computing, pages 362-- 371, May 1993.
....based on heuristics that limited the parallelism [Burton and Sleep 1981; Culler and Arvind 1988; Halstead 1985; Rugguero and Sargeant 1987] and are not guaranteed to be space efficient in general. These were followed by scheduling techniques that provide proven space bounds for parallel programs [Blumofe and Leiserson 1993; 1994; Burton 1988; Burton and Simpson 1994] If S 1 is the space required by the serial execution, these techniques generate schedules for a multithreaded computation on p processors that require no more than p Delta S 1 space. These ideas are used in the implementation of the Cilk programming ....
....allocate any space, the serial execution requires O(n) space, since the space for array B is reused for each i iteration. Space Efficient Scheduling of Nested Parallelism Delta 5 Now consider the parallel implementation of this function on p processors, where p n. Previous scheduling systems [Blumofe and Leiserson 1993; Burton 1988; Burton and Simpson 1994; Chow and W. L. Harrison III 1990; Goldstein et al. 1995; Hummel and Schonberg 1991; Halstead 1985; Rugguero and Sargeant 1987] which include both heuristic based and provably space efficient techniques, would schedule the outer level of parallelism first. ....
[Article contains additional citation context not shown here]
Blumofe, R. D. and Leiserson, C. E. 1993. Space-efficient scheduling of multithreaded computations. In Proc. 25th ACM Symp. on Theory of Computing, pp. 362--371.
....universal implementations that guarantee performance bounds, both in terms of time and space. These are specified by placing upper bounds on the running time and the space of the implementation as a function of the work, depth and sequential space. As with the work of Blumofe and Leiserson [BL93, BL94] we formalize the notion of work, depth and space, by modeling computations as directed acyclic graphs (dags) that may unfold dynamically as the computation proceeds. The nodes in the dag represent unit work tasks, and the edges represent any ordering dependencies between the tasks that ....
....space for program variables and for task bookkeeping. Thus for programs with sufficient parallelism (i.e. S1=p AE D, recalling that S1 is at least the size of the input) this is within a factor of 1 o(1) of the sequential space. Previously, the best known bound was S1 Delta p [Bur88, BL93, BL94, BS94] a factor of p from the sequential space. These bounds apply when individual tasks allocate at most a constant amount of memory. When unit work tasks can allocate memory in arbitrary amounts, the same space bound can be obtained using at most (W S1) p D steps, and in general we ....
[Article contains additional citation context not shown here]
R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proc. 25th ACM Symp. on Theory of Computing, pages 362--371, May 1993.
....may yield different space and time requirements for the computation. It can be shown that for general dags, no good scheduling policy exists, in the sense that a dag can be constructed for which any schedule that provides linear speedup also requires vastly more than linear expansion of space [6]. Fortunately, every Cilk program generates a dag that can be scheduled efficiently, because the dag is fully strict [7] meaning that the only dependency edges that leave a subtree go from a thread in the root of the subtree to a thread in its parent. The dag generated by a Cilk program is ....
Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proceedings of the Twenty Fifth Annual ACM Symposium on Theory of Computing (STOC), pages 362--371, San Diego, California, May 1993.
....thread is a ready thread. It is obvious from the preceding, that every multithreaded computation can be represented by a directed acyclic graph (DAG) of bounded degree. In such a DAG, every node is a task. The edges of the DAG are spawn, continue and data dependency edges. Leiserson and Blumofe [2] have proved that general multithreaded computations (that is with arbitrary data dependencies) are impossible to be scheduled efficiently. In this paper k strict multithreaded computations are studied, in which the kind of dependencies are restricted. A strict multithreaded computation is one in ....
....denotes the dag depth of a computation, since even with arbitrarily many processors, each task on a path must execute serially. It is T P T1 , since the tasks along any path must be executed in a serial order. Brent [3] and Graham [4, 5] proved the bound T P T 1 =P T1 . Leiserson and Blumofe [2] proved the following theorem for greedy schedules. Greedy schedules are those in which at each step of the execution, if at least P tasks are ready, then P tasks execute and if fewer than P tasks are ready, then all execute. Theorem 1 For any multithreaded computation with work T 1 and dag depth ....
Robert D. Blumofe and Charles E. Leiserson, "Space-Efficient Scheduling of Multithreaded Computation"in Proc. of the 25th Ann. ACM Symposium on the Theory of Computing (STOC '93), pp. 362-371, San Diego, California, May 1993.
....may yield different space and time requirements for the computation. It can be shown that for general dags, no good scheduling policy exists, in the sense that a dag can be constructed for which any schedule that provides linear speedup also requires vastly more than linear expansion of space [7]. Fortunately, every Cilk program generates a dag that can be scheduled efficiently, because the dag is fully strict [8] meaning that the only dependency edges that leave a subtree go from a thread in the root of the subtree to a thread in its parent. The dag generated by a Cilk program is ....
Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proceedings of the Twenty Fifth Annual ACM Symposium on Theory of Computing (STOC), pages 362--371, San Diego, California, May 1993.
....Note that the standard depth first sequential schedule of this graph uses only Theta(n 2 ) space, counting the space for the input and output matrices. used [BS81, Hal85, RS87, CA88, JP92] but these are not guaranteed to be space efficient in general. There have been several recent works [BL93, BL94, BS94, Bur96] presenting scheduling algorithms with guaranteed performance bounds, both in terms of time and space. Blumofe and Leiserson [BL93, BL94] consider the class of fullystrict computations, and show that a computation with w work and d depth that requires s1 space when executed ....
....matrices. used [BS81, Hal85, RS87, CA88, JP92] but these are not guaranteed to be space efficient in general. There have been several recent works [BL93, BL94, BS94, Bur96] presenting scheduling algorithms with guaranteed performance bounds, both in terms of time and space. Blumofe and Leiserson [BL93, BL94] consider the class of fullystrict computations, and show that a computation with w work and d depth that requires s1 space when executed using a (standard) depth first sequential schedule can be implemented in O(w=p d) time and s1 Delta p space on p processors. Similar space bounds ....
[Article contains additional citation context not shown here]
R. D. Blumofe and C. E. Leiserson. Spaceefficient scheduling of multithreaded computations. In Proc. 25th ACM Symp. on Theory of Computing, pages 362--371, May 1993.
..... If neither memory space nor communication overheads are taken into account, such a schedule leads to provable performances [10] However, in practice, due to magnitude of the ratio between local and remote memory access costs on a distributed architecture [11] and memory space exhaustion [6], some significant improvements can be brought to such a schedule by some knowledge about the data flow graph corresponding to the execution. For instance, DSC [20] and Metis [14] enable to compute efficient schedules on distributed architectures for various numerical applications from the ....
....related serial program. So, the use of data access specification enables to determine at run time the data flow between tasks in order to guide the parallelization. Furthermore, the validity of a non preemptive sequential schedule enables to bound the memory space required for a parallel execution [6]. In the Cilk language [5] parallelism is described by spawning a procedure call. Synchronization is explicit: the execution of instructions following a sync statement is delayed until all previously spawned calls according to the sequential depth first order are completed. Then a ....
[Article contains additional citation context not shown here]
R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of multithreaded computations. SIAM Journal on Computing, 27(1):202-229, 1998.
....may yield different space and time requirements for the computation. It can be shown that for general dags, no good scheduling policy exists, in the sense that a dag can be constructed for which any schedule that provides linear speedup also requires vastly more than linear expansion of space [6]. Fortunately, every Cilk program generates a dag that can be scheduled efficiently, because the dag is fully strict [7] meaning that the only dependency edges that leave a subtree go from a thread in the root of the subtree to a thread in its parent. The Cilk runtime system implements a ....
Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proceedings of the Twenty Fifth Annual ACM Symposium on Theory of Computing (STOC), pages 362--371, San Diego, California, May 1993.
....and the system implementer. Such cost models can be used to predict (or bound) execution times, memory utilization, or communication requirements for parallel programs running on real platforms. Other programming approaches that emphasize the use of a cost model include the NESL [9] and Cilk [10, 11] languages, which utilize a work depth cost model, and the Split C language [20] which utilizes the LogP cost model [21] In comparison to other approaches that provide shared memory communication in a synchronous environment, BOS does not require preallocation for each shared data structure by ....
R. D. Blumofe and C. E. Leiserson, "Space-efficient scheduling of multithreaded computations," SIAM Journal on Computing, vol. 27, no. 1, pp. 202--229, Feb. 1998.
....users arrive and receive submachines, the various actual processors of the multiprocessor may find themselves managing quite disparate numbers of threads. The more heavily loaded processors are thus burdened by the nontrivial and nonproductive overhead of managing many threads as shown in [4, 5]. One avenue to alleviating this situation is to allow the processorallocation algorithm to reallocate users tasks so as to balance the numbers of threads across the machine s processors. This solution does not come without cost: process reallocation can require extensive communication cost ....
R. Blumofe and C.E Leiserson (1993): Space-efficient scheduling of multithreaded computations. 25th ACM Symp. on Theory of Computing, 362-371.
....Ts(p) factor in the mapping of our model to the PRAM comes from the allocation of tasks to processors and not by the pipelining itself; in the PRAM the processor allocation needs to be done by the user and often requires significant effort. 2 The Model As with the work of Blumofe and Leiserson [12, 13] we model a computation as a set of threads and the cost as a directed acyclic graph (DAG) Threads can fork new threads using a future, and can synchronize by requesting a value written by another thread. A computation begins with a single thread and completes when all threads have terminated. A ....
....(by definition) the whole step takes constant time. Since, on each step, the implementation processes minfjSj; pg actions, and S holds all the ready actions (by definition) the implementation will execute a greedy schedule of the computation DAG. The number of steps is therefore bound by w=p d [12] and the total time by O(w=p d) We now outline how to handle the array split operation used in the 2 6 trees. We first consider implementing a simpler array scan which given an array of integers of length n returns the plus scan of the array in O(n) work and O(1) depth (remember that n could ....
[Article contains additional citation context not shown here]
R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proc. ACM Symposium on the Theory of Computing, pages 362--371, May 1993.
....in the context of both garbage collection (e.g. 25] and copy avoidance (e.g. 20] None of this work, however, has considered the extra reachable space required by a parallel evaluation. There have been a sequence of studies that place space bounds on implementations of parallel languages [11, 10, 12, 2]. For a shared memory model, which is required to efficiently simulate the calculus because of shared pointers, the best results are those by Blelloch, Gibbons, and Matias [2] which are the results we use in this paper. Provable time bounds for mapping nested data parallel languages onto the ....
R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proc. 25th ACM Symp. on Theory of Computing, pages 362-- 371, May 1993.
....A multithreaded computation is composed of a set of threads, each of which is a sequential ordering of unit size instructions. A processor takes one unit of time to execute one instruction. In the example computation of Figure 2. 1, each shaded block is a thread Science and was first published in [13]. G 2 v 11 v 10 v 8 v 7 v 5 v 4 v 12 Figure 2.1: A multithreaded computation. This computation contains 20 instructions v 1 ; v 2 ; v 20 and 6 threads G 1 ; G 2 ; G 6 . with circles representing instructions and the horizontal edges, called continue edges, ....
....35, 36] techniques, they were able to eliminate the useless parallelism with only a small decrease in the average parallelism. Their applications Some of the research reported in this chapter is joint work with Charles Leiserson of MIT s Laboratory for Computer Science and was first published in [13] and [14] 29 had only small amounts of useless parallelism. In this section we show that multithreaded computations may contain vast quantities of provably useless parallelism. In particular, we show that there exist depth first multithreaded computations with large amounts of average ....
Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proceedings of the Twenty Fifth Annual ACM Symposium on Theory of Computing (STOC), pages 362--371, San Diego, California, May 1993.
....Cilk s runtime system takes care of details like load balancing and communication protocols. Unlike other multithreaded languages, however, Cilk is algorithmic in that the runtime system s scheduler guarantees provably efficient and predictable performance. Cilk grew out of theoretical work [1, 5, 6] on the scheduling of multithreaded computations. The basis of Cilk is a provably good scheduling algorithm that has been the cornerstone of Cilk system development. Cilk s provably good scheduler engendered a performance model that accurately predicts the efficiency of a Cilk program using two ....
....yield different space and time requirements for the computation. It can be shown that for general multithreaded dags, no good scheduling policy exists. That is, a dag can be constructed for which any schedule that provides linear speedup also requires vastly more than linear expansion of space [5]. Fortunately, every Cilk program generates a well structured dag that can be scheduled efficiently [6] The Cilk runtime system implements a provably efficient scheduling policy based on randomized work stealing. During the execution of a Cilk program, when a processor runs out of work, it asks ....
Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proceedings of the Twenty Fifth Annual ACM Symposium on Theory of Computing (STOC), pages 362--371, San Diego, California, May 1993.
....actually exploit, and since each living thread requires the use of a certain amount of memory, such schedulers can easily overrun the This research was supported in part by the Defense Advanced Research Projects Agency under Grant N00014 91 J 1698. An extended abstract of this paper appeared as [7]. y Department of Computer Sciences, the University of Texas at Austin, Austin, Texas, 78712 1188 (rdb cs.utexas.edu) This research was conducted at the MIT Laboratory for Computer Science with additional support from a National Science Foundation Graduate Fellowship. z MIT Laboratory for ....
....queues, one per processor. By making some generous modeling assumptions, we have been able to analyze this algorithm and to obtain similar bounds to those for Algorithm LDF. We are currently working on improving these results. Appendix. During the time between our results becoming publicly known [7] 26 R. D. BLUMOFE AND C. E. LEISERSON and this journal publication, we have explored multithreaded computing more fully. We have been able to characterize the performance of a distributed thread stealing algorithm [5, 8] For the class of fully strict (well structured) computations, this ....
R. D. Blumofe and C. E. Leiserson, Space-efficient scheduling of multithreaded computations, in Proceedings of the Twenty Fifth Annual ACM Symposium on Theory of Computing (STOC), San Diego, California, May 1993, pp. 362--371.
No context found.
Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proceedings of the Twenty-Fifth Annual ACM Symposium on the Theory of Computing (STOC '93), pages 362--371, San Diego, California, May 1993.
....may yield different space and time requirements for the computation. It can be shown that for general multithreaded dags, no good scheduling policy exists. That is, a dag can be constructed for which any schedule that provides linear speedup also requires vastly more than linear expansion of space [4]. Fortunately, every Cilk program 1 Technically, procedure instances. Figure 3: The Cilk model of multithreaded computation. Each procedure, shown as a rounded rectangle, is broken into sequences of threads, shown as circles. A downward edge indicates the spawning of a subprocedure. A ....
Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proceedings of the Twenty Fifth Annual ACM Symposium on Theory of Computing (STOC), pages 362--371, San Diego, California, May 1993.
....by an adversary. Our main contribution is a randomized workstealing scheduling algorithm for fully strict multithreaded computations which is provably efficient in terms of time, space, and communication. The bounds on space and time are better than previous bounds for work sharing schedulers [3], and the work stealing scheduler is much simpler and eminently practical. Part of this improvement is due to our focusing on fully strict computations, as compared to the (general) strict computations studied in [3] Moreover, we are also able to provide a bound on the communication of fully ....
....on space and time are better than previous bounds for work sharing schedulers [3] and the work stealing scheduler is much simpler and eminently practical. Part of this improvement is due to our focusing on fully strict computations, as compared to the (general) strict computations studied in [3]. Moreover, we are also able to provide a bound on the communication of fully strict computations which is existentially tight to within a constant factor, meeting the lower bound of Wu and Kung [25] for communication in parallel divide and conquer. In contrast, work sharing schedulers have near ....
[Article contains additional citation context not shown here]
Robert D. Blumofe and Charles E. Leiserson. Spaceefficient scheduling of multithreaded computations. In Proceedings of the Twenty Fifth Annual ACM Symposium on Theory of Computing, pages 362--371, San Diego, California, May 1993.
No context found.
Robert D. Blumofe and Charles E. Leiserson. 1993. Spaceefficient scheduling of multithreaded computations. In Proceedings of the Twenty-Fifth Annual ACM Symposium on the Theory of Computing (STOC '93), pages 362--371, San Diego, CA, USA, May. Also in SIAM Journal on Computing.
No context found.
Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proceedings of the Twenty-Fifth Annual ACM Symposium on the Theory of Computing (STOC '93), pages 362--371, San Diego, California, May 1993.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC