Results 1  10
of
213
Scheduling Multithreaded Computations by Work Stealing
, 1994
"... This paper studies the problem of efficiently scheduling fully strict (i.e., wellstructured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMDstyle computation is “work stealing," in which processors needing work steal com ..."
Abstract

Cited by 572 (43 self)
 Add to MetaCart
This paper studies the problem of efficiently scheduling fully strict (i.e., wellstructured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMDstyle computation is “work stealing," in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good workstealing scheduler for multithreaded computations with dependencies. Specifically, our analysis shows that the ezpected time Tp to execute a fully strict computation on P processors using our workstealing scheduler is Tp = O(TI/P + Tm), where TI is the minimum serial ezecution time of the multithreaded computation and T, is the minimum ezecution time with an infinite number of processors. Moreover, the space Sp required by the execution satisfies Sp 5 SIP. We also show that the ezpected total communication of the algorithm is at most O(TmS,,,P), where S, is the site of the largest activation record of any thread, thereby justifying the folk wisdom that workstealing schedulers are more communication eficient than their worksharing counterparts. All three of these bounds are existentially optimal to within a constant factor.
The implementation of the cilk5 multithreaded language
 In PLDI ’98: Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
, 1998
"... The fth release of the multithreaded language Cilk uses a provably good \workstealing " scheduling algorithm similar to the rst system, but the language has been completely redesigned and the runtime system completely reengineered. The eciency of the new implementation was aided by a clear st ..."
Abstract

Cited by 493 (30 self)
 Add to MetaCart
(Show Context)
The fth release of the multithreaded language Cilk uses a provably good \workstealing " scheduling algorithm similar to the rst system, but the language has been completely redesigned and the runtime system completely reengineered. The eciency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this \workrst " principle has led to a portable Cilk5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the workrst principle was exploited in the design of Cilk5's compiler and its runtime system. In particular, we present Cilk5's novel \twoclone " compilation strategy and its Dijkstralike mutualexclusion protocol for implementing the ready deque in the workstealing scheduler.
Obstructionfree synchronization: Doubleended queues as an example
 In preparation
, 2003
"... We introduce obstructionfreedom, a new nonblocking property for shared data structure implementations. This property is strong enough to avoid the problems associated with locks, but it is weaker than previous nonblocking properties—specifically lockfreedom and waitfreedom— allowing greater flexi ..."
Abstract

Cited by 215 (18 self)
 Add to MetaCart
(Show Context)
We introduce obstructionfreedom, a new nonblocking property for shared data structure implementations. This property is strong enough to avoid the problems associated with locks, but it is weaker than previous nonblocking properties—specifically lockfreedom and waitfreedom— allowing greater flexibility in the design of efficient implementations. Obstructionfreedom admits substantially simpler implementations, and we believe that in practice it provides the benefits of waitfree and lockfree implementations. To illustrate the benefits of obstructionfreedom, we present two obstructionfree CASbased implementations of doubleended queues (deques); the first is implemented on a linear array, the second on a circular array. To our knowledge, all previous nonblocking deque implementations are based on unrealistic assumptions about hardware support for synchronization, have restricted functionality, or have operations that interfere with operations at the opposite end of the deque even when the deque has many elements in it. Our obstructionfree implementations have none of these drawbacks, and thus suggest that it is much easier to design obstructionfree implementations than lockfree and waitfree ones. We also briefly discuss other obstructionfree data structures and operations that we have implemented. 1.
The Power of Two Random Choices: A Survey of Techniques and Results
 in Handbook of Randomized Computing
, 2000
"... ITo motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately ..."
Abstract

Cited by 139 (6 self)
 Add to MetaCart
ITo motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately log n= log log n with high probability. Now suppose instead that the balls are placed sequentially, and each ball is placed in the least loaded of d 2 bins chosen independently and uniformly at random. Azar, Broder, Karlin, and Upfal showed that in this case, the maximum load is log log n= log d + (1) with high probability [ABKU99]. The important implication of this result is that even a small amount of choice can lead to drastically different results in load balancing. Indeed, having just two random choices (i.e.,...
The data locality of work stealing
 Theory of Computing Systems
, 2000
"... This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validatio ..."
Abstract

Cited by 113 (17 self)
 Add to MetaCart
(Show Context)
This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires (n) total instructions (work), for which when using workstealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is (n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nestedparallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd m e PT1), where m is the execution time s of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T1 is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nestedparallel computations using work stealing. For the second part of our results, we present a localityguided work stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative dataparallel applications show that the algorithm matches the performance of staticpartitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammed work loads. Furthermore, the localityguided work stealing improves the performance of workstealing up to 80%. 1
A Java Fork/Join Framework
, 2000
"... This paper describes the design, implementation, and performance of a Java framework for supporting a style of parallel programming in which problems are solved by (recursively) splitting them into subtasks that are solved in parallel, waiting for them to complete, and then composing results. The ge ..."
Abstract

Cited by 112 (0 self)
 Add to MetaCart
This paper describes the design, implementation, and performance of a Java framework for supporting a style of parallel programming in which problems are solved by (recursively) splitting them into subtasks that are solved in parallel, waiting for them to complete, and then composing results. The general design is a variant of the workstealing framework devised for Cilk. The main implementation techniques surround efficient construction and management of tasks queues and worker threads. The measured performance shows good parallel speedups for most programs, but also suggests possible improvements. 1. INTRODUCTION Fork/Join parallelism is among the simplest and most effective design techniques for obtaining good parallel performance. Fork/join algorithms are parallel versions of familiar divideand conquer algorithms, taking the typical form: Result solve(Problem problem) { if (problem is small) directly solve problem else { split problem into independent parts fork new subtas...
A Pragmatic Implementation of NonBlocking LinkedLists
 Lecture Notes in Computer Science
, 2001
"... We present a new nonblocking implementation of concurrent linkedlists supporting linearizable insertion and deletion operations. ..."
Abstract

Cited by 93 (1 self)
 Add to MetaCart
We present a new nonblocking implementation of concurrent linkedlists supporting linearizable insertion and deletion operations.
GarbageFirst Garbage Collection
, 2004
"... GarbageFirst is a serverstyle garbage collector, targeted for multiprocessors with large memories, that meets a soft realtime goal with high probability, while achieving high throughput. Wholeheap operations, such as global marking, are performed concurrently with mutation, to prevent interrupt ..."
Abstract

Cited by 72 (4 self)
 Add to MetaCart
GarbageFirst is a serverstyle garbage collector, targeted for multiprocessors with large memories, that meets a soft realtime goal with high probability, while achieving high throughput. Wholeheap operations, such as global marking, are performed concurrently with mutation, to prevent interruptions proportional to heap or livedata size. Concurrent marking both provides collection ”completeness ” and identifies regions ripe for reclamation via compacting evacuation. This evacuation is performed in parallel on multiprocessors, to increase throughput.
The Design of a Task Parallel Library
, 2008
"... The Task Parallel Library (TPL) is a library for.NET that makes it easy to expose potential parallelism in a program. The library can be seen as an embedded domain specific language, and relies heavily on generics and delegate expressions to provide a convenient interface with custom control structu ..."
Abstract

Cited by 54 (3 self)
 Add to MetaCart
(Show Context)
The Task Parallel Library (TPL) is a library for.NET that makes it easy to expose potential parallelism in a program. The library can be seen as an embedded domain specific language, and relies heavily on generics and delegate expressions to provide a convenient interface with custom control structures for parallelism. In this article, we describe the design and implementation of the library. In particular, we show the use of ‘replicable tasks ’ as an abstraction for implementing parallel iteration and aggregation, and the use of ‘duplicating queues ’ as an alternative to the regular task queues based on the THE protocol. 1.