Results 1 - 10
of
266
Scheduling Multithreaded Computations by Work Stealing
, 1994
"... This paper studies the problem of efficiently scheduling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing," in which processors needing work steal com ..."
Abstract
-
Cited by 568 (34 self)
- Add to MetaCart
This paper studies the problem of efficiently scheduling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing," in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies. Specifically, our analysis shows that the ezpected time Tp to execute a fully strict computation on P processors using our work-stealing scheduler is Tp = O(TI/P + Tm), where TI is the minimum serial eze-cution time of the multithreaded computation and T, is the minimum ezecution time with an infinite number of processors. Moreover, the space Sp required by the execution satisfies Sp 5 SIP. We also show that the ezpected total communication of the algorithm is at most O(TmS,,,P), where S, is the site of the largest activation record of any thread, thereby justify-ing the folk wisdom that work-stealing schedulers are more communication eficient than their work-sharing counterparts. All three of these bounds are existentially optimal to within a constant factor.
APRIL: A Processor Architecture for Multiprocessing
- IN PROCEEDINGS OF THE 17TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1990
"... Processors in large-scale multiprocessors must be able to tolerate large communication latencies and synchronization delays. This paper describes the architecture of a rapid-context-switching processor called APRIL with support for fine-grain threads and synchronization. APRIL achieves high single-t ..."
Abstract
-
Cited by 283 (25 self)
- Add to MetaCart
Processors in large-scale multiprocessors must be able to tolerate large communication latencies and synchronization delays. This paper describes the architecture of a rapid-context-switching processor called APRIL with support for fine-grain threads and synchronization. APRIL achieves high single-thread performance and supports virtual dynamic threads. A commercial RISC-based implementation of APRIL and a run-time software system that can switch contexts in about 10 cycles is described. Measurements taken for several parallel applications on an APRIL simulator show that the overhead for supporting parallel tasks based on futures is reduced by a factor of twoover a corresponding implementation on the Encore Multimax. The scalability of a multiprocessor based on APRIL is explored using a performance model. We show that the SPARC-based implementation of APRIL can achieve close to 80# processor utilization with as few as three resident threads per processor in a large-scale cache-based machine with an average base network latency of 55 cycles.
Semantic foundations of concurrent constraint programming
, 1990
"... Concurrent constraint programming [Sar89,SR90] is a sim-ple and powerful model of concurrent computation based on the notions of store-as-constraint and process as information transducer. The store-as-valuation conception of von Neu-mann computing is replaced by the notion that the store is a constr ..."
Abstract
-
Cited by 276 (27 self)
- Add to MetaCart
(Show Context)
Concurrent constraint programming [Sar89,SR90] is a sim-ple and powerful model of concurrent computation based on the notions of store-as-constraint and process as information transducer. The store-as-valuation conception of von Neu-mann computing is replaced by the notion that the store is a constraint (a finite representation of a possibly infinite set of valuations) which provides partial information about the possible values that variables can take. Instead of “reading” and “writing ” the values of variables, processes may now ask (check if a constraint is entailed by the store) and tell (augment the store with a new constraint). This is a very general paradigm which subsumes (among others) nonde-terminate data-flow and the (concurrent) (constraint) logic programming languages. This paper develops the basic ideas involved in giving a coherent semantic account of these languages. Our first con-tribution is to give a simple and general formulation of the notion that a constraint system is a system of partial infor-mation (a la the information systems of Scott). Parameter passing and hiding is handled by borrowing ideas from the cylindric algebras of Henkin, Monk and Tarski to introduce diagonal elements and “cylindrification ” operations (which mimic the projection of information induced by existential quantifiers). The se;ond contribution is to introduce the notion of determinate concurrent constraint programming languages. The combinators treated are ask, tell, parallel composition, hiding and recursion. We present a simple model for this language based on the specification-oriented methodology of [OH86]. The crucial insight is to focus on observing the resting points of a process—those stores in which the pro-cess quiesces without producing more information. It turns out that for the determinate language, the set of resting points of a process completely characterizes its behavior on all inputs, since each process can be identified with a closure operator over the underlying constraint system. Very nat-ural definitions of parallel composition, communication and hiding are given. For example, the parallel composition of two agents can be characterized by just the intersection of the sets of constraints associated with them. We also give a complete axiomatization of equality in this model, present
Programming Parallel Algorithms
, 1996
"... In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a th ..."
Abstract
-
Cited by 237 (11 self)
- Add to MetaCart
In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofparallelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and prototyping algorithms. There has been a large gap between languages
A Natural Semantics for Lazy Evaluation
, 1993
"... We define an operational semantics for lazy evaluation which provides an accurate model for sharing. The only computational structure we introduce is a set of bindings which corresponds closely to a heap. The semantics is set at a considerably higher level of abstraction than operational semantics f ..."
Abstract
-
Cited by 221 (3 self)
- Add to MetaCart
We define an operational semantics for lazy evaluation which provides an accurate model for sharing. The only computational structure we introduce is a set of bindings which corresponds closely to a heap. The semantics is set at a considerably higher level of abstraction than operational semantics for particular abstract machines, so is more suitable for a variety of proofs. Furthermore, because a heap is explicitly modelled, the semantics provides a suitable framework for studies about space behaviour of terms under lazy evaluation.
The MIT Alewife Machine: Architecture and Performance
- In Proceedings of the 22nd Annual International Symposium on Computer Architecture
, 1995
"... Alewife is a multiprocessor architecture that supports up to 512 processing nodes connected over a scalable and cost-effective mesh network at a constant cost per node. The MIT Alewife machine, a prototype implementation of the architecture, demonstrates that a parallel system can be both scalable a ..."
Abstract
-
Cited by 193 (22 self)
- Add to MetaCart
(Show Context)
Alewife is a multiprocessor architecture that supports up to 512 processing nodes connected over a scalable and cost-effective mesh network at a constant cost per node. The MIT Alewife machine, a prototype implementation of the architecture, demonstrates that a parallel system can be both scalable and programmable. Four mechanisms combine to achieve these goals: software-extended coherent shared memory provides a global, linear address space; integrated message passing allows compiler and operating system designers to provide efficient communication and synchronization; support for fine-grain computation allows many processorsto cooperate on small problem sizes; and latency tolerance mechanisms -- including block multithreading and prefetching -- mask unavoidable delays due to communication; Microbenchmarks, together with over a dozen complete applications running on the 32-node prototype, help analyze the behavior of the system. Analysis shows that integrating message passing with sha...
Fine-grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine
- in Proceedings of the Fourth International Conference on Architectural Support for Programming Languages an Operating Systems
, 1991
"... Abstract: In this paper, we present a relatively primitive execution model for ne-grain parallelism, in which all synchronization, scheduling, and storage management is explicit and under compiler control. This is de ned by a threaded abstract machine (TAM) with a multilevel scheduling hierarchy. Co ..."
Abstract
-
Cited by 155 (8 self)
- Add to MetaCart
(Show Context)
Abstract: In this paper, we present a relatively primitive execution model for ne-grain parallelism, in which all synchronization, scheduling, and storage management is explicit and under compiler control. This is de ned by a threaded abstract machine (TAM) with a multilevel scheduling hierarchy. Considerable temporal locality of logically related threads is demonstrated, providing an avenue for e ective register use under quasi-dynamic scheduling. A prototype TAM instruction set, TL0, has been developed, along with a translator to a variety of existing sequential and parallel machines. Compilation of Id, an extended functional language requiring ne-grain synchronization, under this model yields performance approaching that of conventional languages on current uniprocessors. Measurements suggest that the net cost of synchronization on conventional multiprocessors can be reduced to within a small factor of that on machines with elaborate hardware support, such as proposed data ow architectures. This brings into question whether tolerance to latency and inexpensive synchronization require speci c hardware support or merely an appropriate compilation strategy and program representation. 1
Advances in dataflow programming languages
- ACM COMPUT. SURV
, 2004
"... Many developments have taken place within dataflow programming languages in the past decade. In particular, there has been a great deal of activity and advancement in the field of dataflow visual programming languages. The motivation for this article is to review the content of these recent develo ..."
Abstract
-
Cited by 124 (0 self)
- Add to MetaCart
Many developments have taken place within dataflow programming languages in the past decade. In particular, there has been a great deal of activity and advancement in the field of dataflow visual programming languages. The motivation for this article is to review the content of these recent developments and how they came
NESL: A nested data-parallel language (version 2.6
, 1993
"... The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Wright Laboratory or the U. S. Government. Keywords: Data-parallel, parallel algorithms, supe ..."
Abstract
-
Cited by 112 (8 self)
- Add to MetaCart
The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Wright Laboratory or the U. S. Government. Keywords: Data-parallel, parallel algorithms, supercomputers, nested parallelism, This report describes Nesl, a strongly-typed, applicative, data-parallel language. Nesl is intended to be used as a portable interface for programming a variety of parallel and vector computers, and as a basis for teaching parallel algorithms. Parallelism is supplied through a simple set of data-parallel constructs based on sequences, including a mechanism for applying any function over the elements of a sequence in parallel and a rich set of parallel functions that manipulate sequences. Nesl fully supports nested sequences and nested parallelism—the ability to take a parallel function and apply it over multiple instances in parallel. Nested parallelism is important for implementing algorithms with irregular nested loops (where the inner loop lengths depend on the outer iteration) and for divide-and-conquer algorithms. Nesl also provides a performance model for calculating the asymptotic performance of a program on