56 citations found. Retrieving documents...
Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, and Whay S. Lee. The M-Machine Multicomputer. In Proceedings of the 28th International Symposium on Microarchitecture, pages 146--156, December 1995.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents  Next 50

Symbiotic Jobscheduling with Priorities for a.. - Snavely, Tullsen.. (2002)   (3 citations)  (Correct)

....execution of instructions from multiple threads each cycle on a wide superscalar processor. This organization results in more than doubling the throughput of the processor without excessive increases in hardware [31] The techniques described here also apply to other multithreaded architectures [3, 10, 2]; however, the SMT architecture is most interesting because threads interact at such a fine granularity in the architecture, and because it is closest to widespread commercial use. By contrast, the Tera MTA supercomputer [3] 17 Figure 8: Average percent improvement in turnaround time overNaive ....

M. Fillo, S.W. Keckler, W.J. Dally, N.P. Carter, A. Chang, Y. Gurevich, and W.S. Lee. The M-Machine multicomputer. In 28th Annual International Symposium on Microarchitecture, November 1995.


Integrating User-Level Networks with SMT - Parker, Davis, Hsieh   (Correct)

....quickly by the processor, and acts as a staging area for outgoing messages. A zero copy message protocol allows messages to be delivered directly to user space without copying. Not all of these ideas are new. For example, previous research has explored the use of user level network interfaces[3,9,11,13,18]. However, this specific combination of features is unique, in that it exposes interrupts directly to user level programs. The important aspect of our architecture lies in its support for user level messaging (for both interprocessor communication and I O) in a general purpose operating system ....

....message arrival notification. Sends, receives, and notifications all make passes through operating system code. Since the operating system code is unlikely to reside in the cache, these system calls result in cache misses. Figure 1: Anatomy of a message for a kernel mode NI User level interfaces[3,9,11,13,18] and zero copy protocols[5,7] significantly reduce the overhead of message sends and receives by eliminating operating system and copying overhead on the message send and receive sides. Notifications still have significant opportunity for optimization, as they remain the performance and ....

[Article contains additional citation context not shown here]

Marco Fillo, et al. The M-Machine Multicomputer. In Proceedings of the 28th Annual International Symposium on Microarchitecture, 1995, pp. 146-156.


Sparsely Faceted Arrays: A Mechanism Supporting Parallel.. - Brown (2002)   (2 citations)  (Correct)

....entire machine. The J machine suffers from the problem, common to early capability systems, that indirecting every memory access through a segment table is inefficient; 51] reports that in practice, an unacceptably large percentage of program time is spent engaged in translation. The M machine [15] multicomputer, a successor to the J machine, provides direct addressing. It supports a coarse grained mechanism for distributing resources over variable regions of the architecture. In particular, the M machine s page translation mechanism provides partitions over multiple adjacent nodes; ....

Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, and Whay S. Lee. The M-machine multicomputer. In Proceedings of the 28th Annual International Symposium on Microarchitecture (MICRO28) , pages 146--156, November 1995.


Symbiotic Jobscheduling with Priorities for a.. - Snavely, Tullsen.. (2002)   (3 citations)  (Correct)

....execution of instructions from multiple threads each cycle on a wide superscalar processor. This organization results in more than doubling the throughput of the processor without excessive increases in hardware [32] The techniques described here also apply to other multithreaded architectures [3, 10, 2]; however, the SMT architecture is most relevant here because threads interact at such a fine granularity in the architecture, and because it is closest to widespread commercial use. By contrast, the Tera MTA supercomputer [3] which features fine grain multithreading, has fewer shared system ....

M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y. Gurevich, and W. Lee. The M-Machine multicomputer. In 28th Annual International Symposium on Microarchitecture, Nov. 1995.


Sparsely Faceted Arrays: A Mechanism Supporting Parallel.. - Brown (2002)   (2 citations)  (Correct)

....entire machine. The J machine suffers from the problem, common to early capability systems, that indirecting every memory access through a segment table is inefficient; 51] reports that in practice, an unacceptably large percentage of program time is spent engaged in translation. The M machine [15] multicomputer, a successor to the J machine, provides direct addressing. It supports a coarse grained mechanism for distributing resources over variable regions of the architecture. In particular, the M machine s page translation mechanism provides partitions over multiple adjacent nodes; ....

Marco Fillo, Stephen W. Keckler, William J. Daily, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, and Whay S. Lee. The M-machine multicomputer. In Proceedings of the 28th Annual International Symposium on Microarchitecture (MICRO-28), pages 146-156, November 1995.


Symbiotic Jobscheduling for a Simultaneous Multithreading.. - Snavely, Tullsen (2000)   (16 citations)  (Correct)

....execution of instructions from multiple threads each cycle on a wide superscalar processor. This organization results in more than doubling the throughput of the processor without excessive increases in hardware [31] The techniques described here also apply to other multithreaded architectures [3, 11, 2]; however, the SMT architecture is most interesting because threads interact at such a fine granularity in the architecture, and because it is closest to widespread commercial use, having been announced for the next Alpha processor [10] By contrast, the Tera MTA supercomputer [3] which features ....

M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y. Gurevich, and W. Lee. The M-Machine multicomputer. In 28th Annual International Symposium on Microarchitecture, Nov. 1995.


Coping with Very High Latencies in Petaflop Computer.. - Ryan, Amaral, Gao.. (1998)   (Correct)

....by Prof. Gao and many of his students and research associates [16] Many other architectures have been proposed to address the problem of tolerating inherent communication and synchronization latencies by switching to a new ready thread of control whenever a long latency operation is encountered [4, 5, 8, 10, 12, 18, 19, 23 25, 29]. Central ideas in the program execution model proposed in this document originate from the extensive experience that the Delaware team has acquired with the multi threading program execution model and the multi threaded language developed for EARTH [16, 31] The initial design of the ....

Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, and Whay S. Lee. The M-Machine multicomputer. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 146--156, Ann Arbor, Michigan, November 29--December1, 1995. IEEE-CS TC-MICRO and ACM SIGMICRO.


Dynamic Prediction of Critical Path Instructions - Tune, Liang, Tullsen, Calder (2001)   (20 citations)  (Correct)

....a cluster. Instructions are assigned to a particular structure by hardware. This architecture is similar to that described in [15] and one of the machines described in [21] A similar architecture is described by Farkas et al. 6] but instruction scheduling is done statically. The M machine [7] also features clusters, but their clusters are also not transparent to software. Performance on a clustered architecture is optimized when the instructions at both ends of key dependences are assigned to the same cluster. Even better, we d like to send an entire critical dependence chain through ....

M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y. Gurevich, and W. Lee. The M-Machine multicomputer. In 28th Annual International Symposium on Microarchitecture, Nov. 1995.


Symbiotic Jobscheduling for a Simultaneous Multithreading.. - Snavely, Tullsen (2000)   (16 citations)  (Correct)

....execution of instructions from multiple threads each cycle on a wide superscalar processor. This organization results in more than doubling the throughput of the processor without excessive increases in hardware [15] The techniques described here also apply to other multithreaded architectures [3, 6, 2]; however, the SMT architecture is most interesting because threads interact at such a fine granularity in the architecture, and because it is closest to widespread commercial use, having been announced for the next Alpha processor [5] By contrast, the Tera MTA supercomputer [3] which features ....

M. Fillo, S.W. Keckler, W.J. Dally, N.P. Carter, A. Chang, Y. Gurevich, and W.S. Lee. The M-Machine multicomputer. In 28th Annual International Symposium on Microarchitecture, November 1995.


Extending Cache Coherence to Support Thread-Level Data.. - Steffan, Colohan, Mowry (1998)   (Correct)

....midst of speculative memory references, possibly implemented as an uncached store. Producer consumer style synchronization is also required, and could be implemented by synchronizing on specially allocated memory locations, or by implementing something with similar functionality to full empty bits [4, 8, 11]. As shown in Figure 5(a) an inefficient way to forward a memory location from one epoch to another is simply to allow a data dependence violation to occur the epoch which consumes the value is re executed once the producing epoch has committed its speculative modifications. This causes some ....

M. Fillo, S. W. Keckler, W. J. Dally, N. P. Carter, A. Chang, Y. Gurevich, and W. S. Lee. The M-Machine Multicomputer. In Proceedings of ISCA 28, December 1995.


Symbiotic Jobscheduling for a Simultaneous Multithreading.. - Allan Snavely University (2000)   (16 citations)  (Correct)

....of instructions from multiple threads each cycle on a wide superscalar processor. This organization has the potential to more than double the throughput of the processor without excessive increases in hardware [31] The techniques described here also apply to other multithreaded architectures [3, 11, 2]# however, the SMT architecture is most interesting because threads interact at sucha fine granularity in the architecture, and because it is closest to widespread commercial use, having been announced for the next Alpha processor [10] By contrast, the Tera MTA supercomputer [3] which features ....

M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y. Gurevich, and W. Lee. The M-Machine multicomputer. In 28th Annual International Symposium on Microarchitecture,Nov. 1995.


The Need for Fast Communication in Hardware-Based.. - Krishnan, Torrellas (1999)   (6 citations)  (Correct)

....superscalar processor on the chip, many researchers have proposed decentralized architectures wherein multiple simpler processing units are con gured on a single chip. Indeed, the chip multiprocessor (CMP) architecture has drawn great attention, with architects proposing various related designs [5, 10, 12, 16, 20, 22, 23, 24]. Though the CMP is an ideal platform to run multiple sequential applications or a fully parallel application, if it is to be fully accepted, it must also be able to give good performance when running a single sequential application or one that cannot be parallelized by the compiler e ectively. ....

....shared L2 cache. If, in addition, the producer has just updated prev, the new value can be forwarded to the L2 cache in the same synchronization step. The second approach is to provide much faster inter processor communication. Speci cally, we can register allocate prev and use an on chip network [5, 12, 20, 22] to communicate its value between processors. This approach allows very fast communication. In a CMP with wide issue dynamic superscalar processors, performing synchronization through memory is slow. Consider, for example, a CMP with four 4 issue processors where the L2 cache has advanced ....

M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y. Gurevich, and W. Lee. The M-Machine Multicomputer. In 28th International Symposium on Computer Microarchitecture (MICRO-28), pages 146-156, November 1995.


Toward A Cost-Effective DSM Organization That Exploits.. - Torrellas, Yang, Nguyen (2000)   (6 citations)  (Correct)

....transistors that can be integrated on a VLSI chip are fueling the trend toward integration of processor and memory on a chip. It is widely expected that o the shelf microprocessor designs will exploit this trend to provide low latency and high bandwidth communication between processor and memory [1, 4, 6, 9, 12, 16, 19]. Since directory based, cache coherent Distributed SharedMemory (DSM) multiprocessors are typically built around the latest o the shelf microprocessors, they will be a ected by the trend of progressive processor memory integration. Currently, the nodes in DSM systems are typically orga This ....

M. Fillo, S.W. Keckler, W.J. Dally, N.P. Carter, A. Chang, Y. Gurevich, and W.S. Lee. The M-Machine Multicomputer. In 28th International Symposium on Microarchitecture, 1995.


Design and Evaluation of Network Interfaces for System Area.. - Mukherjee (1998)   (Correct)

....a particular network implementation and allows me to focus my attention purely on the NI. In Section 4.5, I study the impact of network latency on the overall performance of benchmarks. I model hardware flow control at the NIs using a scalable end to end flow control scheme called return to sender [39]. In this scheme, the sending NI allocates an empty buffer for a message and injects the message into the network. If the receiving NI has a free buffer to accept the message, it sends an acknowledgment to the sender to free up the sender s buffer. However, if the receiving NI cannot accept the ....

....5.4.3 Single Cycle NI 2w vs. CNI 32 Q m Figure 5 3 compares the performance of CNI 32 Q m with an NI 2w NI, whose memory can be accessed by the processor in a single cycle. Thus, my single cycle NI 2w approximates processor register mapped NIs in research machines, such as the MIT M machine [39]. 1 Figure 5 3 shows two interesting results. First, CNI 32 Q m the CNI with a cache outperforms my single cycle NI 2w for spsolve and em3d for small number of flow control buffers. Processor register mapped NIs are likely to have a small number of flow control buffers because of two ....

Marco Fillo, Stephen W. Kekler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, and Whay S. Lee. The M-Machine Multicomputer. Technical Memo A.I. Memo No. 1532, MIT, March 1995.


Compiling For Multithreaded Architectures - Tang (1999)   (1 citation)  (Correct)

....to enhance performance. To exploit such coarse grain thread level parallelism (TLP) a number of options have been proposed, including Single Chip Multiple Superscalar (SCMS) 99] Simultaneous MultiThreading (SMT) 138, 137, 63, 58] Multiscalar [118] SPSM [33] Superthreaded [135] M machine [37], and STAMPede [122] The common theme is to put a number of processors, be it superscalar or multithreaded, on one chip. In order to generate multiple concurrent threads, aggressive speculation techniques such as control speculation and data (memory) speculation are widely used to derive threads ....

....the 1980s [69] Recently strong push from VLSI technology and diminishing return of wide issue architectures such as Superscalar and VLIW make the multithreaded architecture even more attractive. More and more research projects have been started, building multithreaded processors on a single chip [138, 37, 118]. Furthermore IBM has announced a new PowerPC system supporting multithreading [123] However, the multithreading concept applies beyond the processor architecture level. There is a familiar concept in operating systems, where multithreading has long been used to tolerate I O latencies [124] ....

[Article contains additional citation context not shown here]

Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, and Whay S. Lee. The M-Machine multicomputer. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 146--156, Ann Arbor, Michigan, November 29--December1, 1995. IEEE-CS TC-MICRO and ACM SIGMICRO.


Speculative Multithreading Architectures - Krishnan (1998)   (Correct)

....sections. for (i = 0; i n; i ) a[x[i] a[x[i] end Figure 1.3 Example of a loop that will not be parallelized by a conservative compiler. 1.3. 1 Chip Multiprocessor The chip multiprocessor(CMP) has drawn great attention, with architects proposing various related designs [21, 30, 69, 76, 78, 82]. The design simplicity of the CMP approach not only allows for a much faster clock in each of the processing unit, but also allows e cient use of the resources. The interconnect problem discussed in the previous section would also be ameliorated since such a decentralized implementation would ....

....step. The second approach is for the threads to communicate via registers. It helps that variables like prev are typically allocated in registers by any optimizing compiler. Even though this approach requires special hardware support for synchronization and communication at the register level [21, 69, 76], it is very fast to perform. It could be argued that CMPs could be based on single issue processors thereby allowing more processors to be con gured on chip. However, exploiting both threadlevel and instruction level parallelism is critical for the performance of multithreaded applications [44] ....

[Article contains additional citation context not shown here]

M. Fillo, S.W. Keckler, W.J. Dally, N.P. Carter, A. Chang, Y. Gurevich, and W.S. Lee. The M-Machine Multicomputer. In 28th International Symposium on Computer Microarchitecture (MICRO-28), pages 146-156, November 1995.


The Performance of Applications and Operating Systems on.. - Bowman, Cardwell, Romer   (Correct)

....the operating system, and in the kernel idle loop. There have also been several attempts at integrating processors and memory for parallel processing. The Execube [18] a chip with eight 16 bit processors each with 64 KB of DRAM, was targeted at massively parallel systems. Similarly, the Mmachine [13] features four superscalar processors and 128 KB per processor. The PPRAM project [3] at Kyushu University in Japan plans to fabricate a chip in 1999 with four 32 bit RISC processors, each with 24KB of SRAM cache and 8 MB of DRAM [23] Their simulations, though limited to SPEC 95 integer programs, ....

FILLO, M., ET AL. The M-Machine multicomputer. In Proceedings of the 28th Annual International Symposium on Microarchitecture (Ann Arbor, MI, Nov. 1995), pp. 146--156.


Design and Analysis of Routed Inter-ALU Networks.. - Singh.. (2003)   Self-citation (Keckler)   (Correct)

No context found.

Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, and Whay S. Lee. The M-Machine Multicomputer. In Proceedings of the 28th International Symposium on Microarchitecture, pages 146--156, December 1995.


Routed Inter-ALU Networks for ILP Scalability and.. - Sankaralingam, Singh, .. (2003)   (3 citations)  Self-citation (Keckler)   (Correct)

No context found.

M. Fillo, S. W. Keckler, W. J. Dally, N. P. Carter, A. Chang, Y. Gurevich, and W. S. Lee. The M-Machine Multicomputer. In Proceedings of the 28th International Symposium on Microarchitecture, pages 146--156, December 1995.


Exploiting Fine-Grain Thread Level Parallelism on.. - Keckler, Dally.. (1998)   (14 citations)  Self-citation (Keckler Dally Chang)   (Correct)

....5, and concluding remarks are found in Section 6. 2 The MAP Chip The Multi ALU Processor (MAP) chip, designed for use in the M Machine Multicomputer, is intended to exploit parallelism at a spectrum of grain sizes, from instruction level parallelism to coarser grained multi node parallelism [5]. It employs a set of fast communication and synchronization mechanisms that enable execution of fine grain parallel programs. Figure 1 shows a block diagram of the MAP chip containing three execution clusters, a unified cache which is divided into two banks, an external memory interface, and a ....

FILLO, M., KECKLER, S. W., DALLY, W. J., CARTER, N. P., CHANG, A., GUREVICH, Y., AND LEE, W. S. The M-Machine Multicomputer. In Proceedings of the 28th International Symposium on Microarchitecture (Ann Arbor, MI, December 1995), ACM, pp. 146-156.


Processor Mechanisms for Software Shared Memory - Carter   Self-citation (Dally Carter)   (Correct)

....To obtain maximum performance on a wide spectrum of programs, future systems will need to close this parallelism gap by providing lowlatency communication mechanisms which allow small tasks with large communication requirements to be parallelized effectively. The M Machine Multicomputer [11] addresses this parallelism gap through an architecture which effectively exploits parallelism at all granularities. An M Machine consists of a two dimensional mesh of processing nodes, each of which is made up of a custom 262626 Multi ALU Processor (MAP) chip and five synchronous DRAM (SDRAM) ....

Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, and Whay S. Lee. The M-Machine multicomputer. In Proceedings of the 28th International Symposium on Microarchitecture, pages 146-156, Ann Arbor, MI, December 1995


Exploiting Thread-Level Parallelism On . . . - Lo (1998)   (Correct)

No context found.

M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y. Gurevich, and W. Lee. The M-Machine multicomputer. In 28th Annual International Symposium on Microarchitecture, pages 146--156, November 1995.


Explicit Multi-Threading (XMT) Bridging Models for.. - Vishkin, Dascal.. (1998)   (Correct)

No context found.

M. Fillo, S.W. Keckler, W.J. Dally, N.P. Carter, A. Chang, Y. Gurevich, and W.S. Lee. The M-Machine Multicomputer. In Proc. 28th Int. Symp. on Microarchitecture, Ann Arbor, MI, 1995. See also http://www.ai.mit.edu/projects/cva/cvammachine.html.


A Chip-Multiprocessor Architecture with Speculative.. - Krishnan, Torrellas (1999)   (22 citations)  (Correct)

No context found.

M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y. Gurevich, and W. Lee. The M-Machine Multicomputer. In 28th International Symposium on Computer Microarchitecture (MICRO-28), pages 146-156, November 1995.


Incorporating Memory Management into User-Level Network.. - Welsh, Basu, von Eicken (1997)   (59 citations)  (Correct)

No context found.

M. Fillo, S. W. Keckler, W. J. Dally, N. P. CarterA. Chang, Y. Gurevich, and W. S. Lee. The M-Machine Multicomputer. In Proceedings of the 28th Annual International Symposium on Computer Micorarchitecture, Ann Arbor, Michigan, 1995.

First 50 documents  Next 50

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC