Results 1 - 10
of
103
Fine-grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine
- in Proceedings of the Fourth International Conference on Architectural Support for Programming Languages an Operating Systems
, 1991
"... Abstract: In this paper, we present a relatively primitive execution model for ne-grain parallelism, in which all synchronization, scheduling, and storage management is explicit and under compiler control. This is de ned by a threaded abstract machine (TAM) with a multilevel scheduling hierarchy. Co ..."
Abstract
-
Cited by 136 (6 self)
- Add to MetaCart
Abstract: In this paper, we present a relatively primitive execution model for ne-grain parallelism, in which all synchronization, scheduling, and storage management is explicit and under compiler control. This is de ned by a threaded abstract machine (TAM) with a multilevel scheduling hierarchy. Considerable temporal locality of logically related threads is demonstrated, providing an avenue for e ective register use under quasi-dynamic scheduling. A prototype TAM instruction set, TL0, has been developed, along with a translator to a variety of existing sequential and parallel machines. Compilation of Id, an extended functional language requiring ne-grain synchronization, under this model yields performance approaching that of conventional languages on current uniprocessors. Measurements suggest that the net cost of synchronization on conventional multiprocessors can be reduced to within a small factor of that on machines with elaborate hardware support, such as proposed data ow architectures. This brings into question whether tolerance to latency and inexpensive synchronization require speci c hardware support or merely an appropriate compilation strategy and program representation. 1
A Design Space Evaluation of Grid Processor Architectures
, 2001
"... In this paper, we survey the design space of a new class of architec-tures called Grid Processor Architectures (GPAs). These architectures are designed to scale with technology, allowing faster clock rates than conventional architectures while providing superior instruction-level parallelism on trad ..."
Abstract
-
Cited by 100 (31 self)
- Add to MetaCart
In this paper, we survey the design space of a new class of architec-tures called Grid Processor Architectures (GPAs). These architectures are designed to scale with technology, allowing faster clock rates than conventional architectures while providing superior instruction-level parallelism on traditional workloads and high performance across a range of application classes. A GPA consists of an array of ALUs, each with limited control, connected by a thin operand network. Pro-grams are executed by mapping blocks of statically scheduled instruc-tions to the ALU array and executing them dynamically in dataflow or-der This organization enables the critical paths of instruction blocks to be executed on chains of ALUs without transmitting temporary val-ues back to the register file, avoiding most of the large, unscalable structures that limit the scalability of conventional architectures. Fi-nally, we present simulation results of a preliminary design, the GPA-1. With a half-cycle routing delay, we obtain performance roughly equal to an ideal 8-way, 512-entry window superscalar core. With no inter-ALU delay, perfect memory, and perfect branch prediction, the 1PC of the GPA-1 is more than twice that of the ideal superscalar core, achieving an average of 11 IPC across nine SPEC CPU2000 and Mediabench benchmarks.
Lively Linear Lisp - 'Look Ma, No Garbage!'
- ACM Sigplan Notices
, 1992
"... Linear logic has been proposed as one solution to the problem of garbage collection and providing efficient "updatein -place" capabilities within a more functional language. Linear logic conserves accessibility, and hence provides a mechanical metaphor which is more appropriate for a distributed-me ..."
Abstract
-
Cited by 91 (6 self)
- Add to MetaCart
Linear logic has been proposed as one solution to the problem of garbage collection and providing efficient "updatein -place" capabilities within a more functional language. Linear logic conserves accessibility, and hence provides a mechanical metaphor which is more appropriate for a distributed-memory parallel processor in which copying is explicit. However, linear logic's lack of sharing may introduce significant inefficiencies of its own. We show an efficient implementation of linear logic called Linear Lisp that runs within a constant factor of non-linear logic. This Linear Lisp allows RPLACX operations, and manages storage as safely as a non-linear Lisp, but does not need a garbage collector. Since it offers assignments but no sharing, it occupies a twilight zone between functional languages and imperative languages. Our Linear Lisp Machine offers many of the same capabilities as combinator/graph reduction machines, but without their copying and garbage collection problems. Intr...
*T: A Multithreaded Massively Parallel Architecture
- IN PROCEEDINGS OF THE 19TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1992
"... What should the architecture of each node in a general purpose, massively parallel architecture (MPA) be? We frame the question in concrete terms by describing two fundamental problems that must be solved well in any general purpose MPA. From this, we systematically develop the required logical orga ..."
Abstract
-
Cited by 47 (1 self)
- Add to MetaCart
What should the architecture of each node in a general purpose, massively parallel architecture (MPA) be? We frame the question in concrete terms by describing two fundamental problems that must be solved well in any general purpose MPA. From this, we systematically develop the required logical organization of an MPA node, and present some details of *T (pronounced Start), a concrete architecture designed to these requirements. *T is a direct descendant of dynamic dataflow architectures, and unifies them with von Neumann architectures. We discuss a hand-compiled example and some compilation issues.
Advances in dataflow programming languages
- ACM Comput. Surv
, 2004
"... Abstract. Many developments have taken place within dataflow programming languages in the past decade. In particular, there has been a great deal of activity and advancement in the field of dataflow visual programming languages. The motivation for this article is to review the content of these recen ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
Abstract. Many developments have taken place within dataflow programming languages in the past decade. In particular, there has been a great deal of activity and advancement in the field of dataflow visual programming languages. The motivation for this article is to review the content of these recent developments and how they came
StarT the Next Generation: Integrating Global Caches and Dataflow Architecture
- CSG MEMO 354, COMPUTATION STRUCTURES GROUP, MIT LAB. FOR COMP. SCI
, 1994
"... The implicitly parallel programming model provides an attractive approach to deal with the complexity of parallel programming. Implementing this model efficiently, especially on stock processors, remains a big challenge, partly because of the fine granularity of the parallelism exploited. The Monsoo ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
The implicitly parallel programming model provides an attractive approach to deal with the complexity of parallel programming. Implementing this model efficiently, especially on stock processors, remains a big challenge, partly because of the fine granularity of the parallelism exploited. The Monsoon[27] project was designed to address and investigate support for fine-grain parallelism, and has yielded very encouraging results[13]. Our experience with Monsoon and *T[24, 28], a followup project after Monsoon, suggests that provision for global shared memory is an area where both the Monsoon and *T architectures can be improved. Starting with the split-phase approach used in Monsoon and *T, we propose to augment global memory access by including coherent global caches. The rapid improvements in stock microprocessors, and the high cost and effort required to develop a competitive microprocessor, presents practical constraints on what can be built in any experimental architecture project. ...
Spatial Computation
- in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS
, 2004
"... This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized control. SC circuits are optimized for wires at the ..."
Abstract
-
Cited by 37 (10 self)
- Add to MetaCart
This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized control. SC circuits are optimized for wires at the expense of computation units. In this paper we investigate a particular implementation of SC: ASH (Application-Specific Hardware). Under the assumption that computation is cheaper than communication, ASH replicates computation units to simplify interconnect, building a system which uses very simple, completely dedicated communication channels. As a consequence, communication on the datapath never requires arbitration; the only arbitration required is for accessing memory. ASH relies on very simple hardware primitives, using no associative structures, no multiported register files, no scheduling logic, no broadcast, and no clocks. As a consequence, ASH hardware is fast and extremely power efficient.
Optimization of Array Accesses by Collective Loop Transformations
, 1990
"... In this paper, we investigate the problem of optimizing array accesses across a collection of loops. We demonstrate that a good solution to such a problem should be based on an optimization scheme, called collective loop transformations, that considers all loops simultaneously. In particular, loop r ..."
Abstract
-
Cited by 36 (2 self)
- Add to MetaCart
In this paper, we investigate the problem of optimizing array accesses across a collection of loops. We demonstrate that a good solution to such a problem should be based on an optimization scheme, called collective loop transformations, that considers all loops simultaneously. In particular, loop reversal, loop interchange and loop fusion are performed collectively on a set of loop nests. The main impact of these transformations is an optimization called array contraction, that saves space and time by converting an array variable into a scalar variable or a buffer containing a small number of scalar variables. This optimization is applicable to general-purpose high-performance architectures. For a multiprocessor architecture, array contraction is performed by executing the producer and consumer loops on separate processors, and by using a smaller buffer for the array communication. For a uniprocessor architecture, array contraction is performed by fusing the producer and consumer loop...
Compiling for EDGE architectures
- In International Symposium on Code Generation and Optimization
, 2006
"... Explicit Data Graph Execution (EDGE) architectures offer the possibility of high instruction-level parallelism with energy efficiency. In EDGE architectures, the compiler breaks a program into a sequence of structured blocks that the hardware executes atomically. The instructions within each block c ..."
Abstract
-
Cited by 33 (23 self)
- Add to MetaCart
Explicit Data Graph Execution (EDGE) architectures offer the possibility of high instruction-level parallelism with energy efficiency. In EDGE architectures, the compiler breaks a program into a sequence of structured blocks that the hardware executes atomically. The instructions within each block communicate directly, instead of communicating through shared registers. The TRIPS EDGE architecture imposes several restrictions on its blocks to simplify the microarchitecture: each TRIPS block has at most 128 instructions, issues at most 32 loads and/or stores, and executes at most 32 register bank reads and 32 writes. To detect block completion, each TRIPS block must produce a constant number of outputs (stores and register writes) and a branch decision. The goal of the TRIPS compiler is to produce TRIPS blocks full of useful instructions while enforcing these constraints. This paper describes a set of compiler algorithms that meet these sometimes conflicting goals, including an algorithm that assigns load and store identifiers to maximize the number of loads and stores within a block. We demonstrate the correctness of these algorithms in simulation on SPEC2000, EEMBC, and microbenchmarks extracted from SPEC2000 and others. We measure speedup in cycles over an Alpha 21264 on microbenchmarks. 1.

