Achieving 100 TeraOps performance within a tenyear horizon will require massively-parallel architectures that exploit both commodity software and hardware technology for cost efficiency. Increasing clock rates and system diameter in clock periods will make efficient management of communication and coordination increasingly critical. Configurable logic presents a unique opportunity to customize bindings, mechanisms, and policies which comprise the interaction of processing, memory, I/O and communication resources. This programming flexibility, or customizability, can provide the key to achieving robust high performance. The MultiprocessOr with Reconfigurable Parallel Hardware (MORPH) uses reconfigurable logic blocks integrated with the system core to control policies, interactions, and interconnections. This integrated configurability can improve the performance of local memory hierarchy, increase the efficiency of interprocessor coordination, or better utilize the network bisection of the machine. MORPH provides a framework for exploring such integrated application-specific customizability. Rather than complicate the situation, MORPH's configurability supports component software and interoperabilty frameworks, allowing direct support for application-specified patterns, objects, and structures. This paper reports the motivation and initial design of the MORPH system.
|
680
|
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and
– Jouppi
- 1990
|
|
483
|
MPI: The Complete Reference
– Snir, Otto, et al.
- 1996
|
|
477
|
TreadMarks: Distributed shared memory on standard workstations and operating systems
– Keleher, Dwarkadas, et al.
- 1994
|
|
371
|
The LAPACK Users’ Guide
– Anderson, Bai, et al.
- 1992
|
|
362
|
The Stanford Dash multiprocessor
– Lenoski, Laudon, et al.
- 1992
|
|
328
|
Efficient string matching: An aid to bibliographic search
– Aho, Corasick
- 1975
|
|
253
|
Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence
– Bennett, Carter, et al.
- 1990
|
|
199
|
High-performance all-software distributed shared memory
– Johnson
- 1995
|
|
172
|
Hitting the memory wall: Implications of the obvious
– Wulf, McKee
- 1995
|
|
158
|
Precise Concrete Type Inference for Object-Oriented Languages
– Plevyak, Chien
- 1994
|
|
128
|
iWarp, an integrated solution to highspeed parallel computing
– Borkar, Cohn, et al.
- 1988
|
|
127
|
A methodology for procedure cloning
– Cooper, Hall, et al.
- 1993
|
|
111
|
Comparative performance evaluation of cachecoherent NUMA and COMA architectures
– Stenstrom, Joe, et al.
- 1992
|
|
108
|
Co-Synthesis of Hardware and Software for Digital Embedded Systems
– Gupta
- 1995
|
|
89
|
Supporting Systolic and Memory Communication in iWarp
– Borkar, Cohn, et al.
- 1990
|
|
89
|
ApplicationSpecific Protocols for User-Level Shared Memory
– Falsafi, Lebeck, et al.
- 1994
|
|
57
|
Internetworking with TCP/IP, Vol
– Comer, Stevens
- 1999
|
|
44
|
Interprocedural transformations for parallel code generation
– Hall, Kennedy, et al.
- 1991
|
|
41
|
Architecture of a Message-Driven Processor
– Dally
- 1987
|
|
37
|
Synthesis and optimization of interface transducer logic
– BORIELLO, KATZ
- 1987
|
|
36
|
LAPACK++: A Design Overview of Object-Oriented Extensions for High Performance Linear Algebra
– Dongarra, Pozo, et al.
- 1993
|
|
33
|
A hybrid execution model for fine-grained languages on distributed memory multicomputers
– Plevyak, Karamcheti, et al.
- 1995
|
|
32
|
Experimental evaluation of on-chip microprocessor cache memories
– Hill, Smith
- 1984
|
|
30
|
Interprocedural partial redundancy elimination and its application to distributed memory compilation
– Agrawal, Saltz, et al.
- 1995
|
|
29
|
Synthesis of the hardware/software interface in microcontroller-based systems
– Chou, Ortega, et al.
- 1992
|
|
29
|
Speeding up irregular applications in shared-memory multiprocessors: Memory binding and group prefetching
– Zhang, Torrellas
- 1995
|
|
28
|
Design of a Self-Timed VLSI Multicomputer Communication Controller
– Dally, Song
- 1987
|
|
23
|
Skewed associativity enhances performance predictability
– Bodin, Seznec
- 1995
|
|
21
|
Microprocessors circa 2000
– Gelsinger, Gargini, et al.
- 1989
|
|
20
|
Simulation Analysis of Data Sharing in Shared Memory Multiprocessors
– EGGERS
|
|
18
|
Evaluation of mechanisms for fine-grained parallel programs
– Spertus, Goldstein, et al.
- 1993
|
|
16
|
Let's Route Packets Instead of Wires
– Seitz
- 1990
|
|
14
|
Linpack User's Guide
– al
- 1979
|
|
9
|
Dynamic scheduling and synchronization synthesis of concurrent digital systems under system-level constraints
– Coelho, Micheli
- 1994
|
|
8
|
Integrating networks and memory hierarchies in a multicomputer node architecture
– Choi, Chien
- 1994
|
|
7
|
An algorithm for synthesis of system-level interface circuits
– Chung, Gupta, et al.
- 1996
|
|
7
|
The design and performance evaluation of the DI-multicomputer
– Choi, Chien
- 1996
|
|
7
|
The message driven processor: an integrated multicomputer processing element
– Dally, Chien, et al.
- 1992
|
|
7
|
Summary of the architecture group findings
– Kogge
- 1996
|
|
6
|
Programming abstractions for run-time partitioning of scientific continuum calculations running on multiprocessors
– Baden
- 1987
|
|
6
|
MPI: The Complete Reference
– al
- 1995
|
|
6
|
Efficient flow-sensative interprocedural computation of pointer-induced aliases and side effects
– Choi, Burke, et al.
- 1993
|
|
4
|
Low-voltage cmos device scaling
– Hu
- 1994
|
|
3
|
Architecture of a Message-Driven Processor
– al
- 1987
|
|
3
|
iWarp: An Integrated Solution to High-Speed Parallel Computing
– al
- 1988
|
|
2
|
64-Gbit DRAMs, 1-GHz Microprocessors Expected by 2010," Computer Design
– Weiss
- 1995
|
|
2
|
Digital Integrated Circuits, ch
– Rabaey
- 1996
|
|
2
|
Graph Embeddings in a Photonic HyperPlane
– Szymanski, Hinton
- 1994
|
|
2
|
Compiling object-oriented programs
– Plevyak, Chien
- 1995
|
|
1
|
Microprocessor Circa 2000
– al
- 1989
|