| M. Fillo et al., "The M-Machine Multicomputer." Proc. 28th Ann. Int'l Symp. Microarchitecture, IEEE Press, 1995, pp. 146-156. |
....idempotent (cycles) slowdown buffered (cycles) idempotent (cycles) slowdown buffered (cycles) idempotent (cycles) slowdown reverse 8198 30053 3.67 3333 13931 4.18 717 1421 1.98 quicksort 195356 285565 1.46 124230 147229 1.19 96745 113460 1. 17 6] Both link level [1] and end to end [2, 14] protocols have been proposed which improve system performance by limiting message injection rates; the end to end flow control protocol uses a twomessage handshake identical to the first two messages of our three message handshake. By contrast, discarding networks do not experience traffic jams ....
Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, Whay S. Lee, "The M-Machine Multicomputer", Proc. MICRO-28, 1995, pp. 146-156.
....Threads of execution within a program are typically loop iterations or multiple paths of a control structure. These threads most often have cross iteration data dependences that are difficult, if not impossible, for the compiler to detect at compile time. Therefore, multithreaded architectures [6, 8, 14, 18, 21] require hardware to support data dependence checking and speculative execution. In many multithreaded architectures, the compiler identifies possible data dependences and then special hardware determines, at runtime, whether these data dependences are true dependences or simply false alarms. True ....
....work and in Section 7 we conclude with our final recommendation of the best cache structure for multithreaded computer systems. 2. Background and Motivation Several multithreaded architectures have been proposed which support synchronization and communication between threads. In the M Machine [6], XIMD [21] Elementary Multithreading [8] and Multiscalar approaches [14] data values are speculated and then, if the speculations turn out to be incorrect, the thread which speculated incorrectly is terminated and restarted. This requires each read and write to memory to be checked and ....
M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y. Gurevich, W. Lee, "The M-Machine Multicomputer," International Symposium on Microarchitectures (MICRO), November 1995.
....Work There are two primary works of interest on the topic of multithreading for horizontal architectures. In processor coupling multiple threads are scheduled statically and interleaved into execution clusters, consisting of a set of function units and a common register file, at run time [6] [7]. Threads, which are generated by the compiler through explicit fork and forall operations, communicate through registers and memory and are non speculative, unlike the dual thread Weld model. XIMD [8] has multiple functional units and a large global register file similar to VLIW EPIC ....
M. Fillo, S. W. Keckler, W. J. Dally, N.P. Carter, A. Chang, Y. Gurevich and W.S. Lee, "The M-Machine Multicomputer," in Proc.28th Ann. Int'l Symp. Microarchitecture, Ann Arbor, MI, Dec. 1995.
....instructions across all issue slots and therefore there is one to one mapping between functional units. Weld has neither of these limitations. Also, threads are the different benchmarks or programs in their experiments unlike multithreading from a single program in Weld. In Processor Coupling [8][9], several threads are scheduled statically and interleaved into clusters at run time. A cluster consists of a set of functional units that share a register file. Operations from different threads compete for a functional unit within a cluster. Interleaving does not occur across all issue slots, ....
M. Fillo, S. W. Keckler, W. J. Dally, N.P. Carter, A. Chang, Y. Gurevich and W.S. Lee, "The M-Machine Multicomputer," in Proc.28th Ann. Int'l Symp. Microarchitecture, Ann Arbor, MI, Dec. 1995.
....instructions across all issue slots and therefore there is one to one mapping between functional units. Weld has neither of these limitations. Also, threads are the different benchmarks or programs in their experiments unlike multithreading from a single program in Weld. In Processor Coupling [7][8], several threads are scheduled statically and interleaved into clusters at run time. A cluster consists of a set of functional units that share a register file. Operations from different threads compete for a functional unit within a cluster. Interleaving does not occur across all issue slots. ....
M. Fillo, S. W. Keckler, W. J. Dally, N.P. Carter, A. Chang, Y. Gurevich and W.S. Lee, "The M-Machine Multicomputer," in Proc.28th Ann. Int'l Symp. Microarchitecture, Ann Arbor, MI, Dec. 1995.
....the first thread is rescheduled, its communication operations have concluded. Multithreading can be done in software or hardware. Software multithreading is very expensive. Some hardware multithreading research architectures for message passing systems such as the JMachine [35] and the M Machine [52] have been reported. 7 In precommunication, communication operations are pulled up from the place that communications naturally occur in the program so that it is partially or entirely completed before data is needed. This can be done in software by inserting a precommunication operation, or in ....
....major sources of the communication overhead. The communication hardware aspect includes the architecture and placement of the network interface, and the interconnection network and its services. Many architectures have been proposed for the network interfaces. They are classified as (1) direct [52, 7, 63, 80, 97, 88] and (2) memory based [48, 112, 126, 23] Direct network interfaces allow a processor to directly access the network queue. However, they mostly ignore the issue of multiprogramming. That is, a single thread can only use the network interface at a time. Memory based interfaces provide protection ....
M. Fillo, S. W. Keckler, W. J. Dally, N. P. Carter, A. Chang, Y. Gurevich and W. S. Lee, "The M-Machine Multicomputer", Proceedings of the 28th Annual IEEE/ ACM International Symposium on Microarchitectures", 1995.
....instructions from a single thread until the thread issues an instruction that causes a long latency memory access. Alewife also tries to avoid memory delays by using a data cache in addition to context switching. To increase parallelism, the XIMD [30] Elementary Multithreading [31] M machine [32], Simultaneous Multithreading [33] Multiscalar [34] and SPSM [35] architectures execute instructions from multiple threads simultaneously. The XIMD, Elementary Multithreading, Mmachine, and Multiscalar architectures support synchronization and communication between threads to allow the execution ....
Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, and Whay S. Lee, "The M-Machine multicomputer," in Proceedings of the 28th 10 Annual International Symposium on Microarchitecture, November 29--December 1, 1995, pp. 146--156.
....it is filled with an operation from another thread through a dynamic interleaver. Dynamic interleaver does not interleave instructions across all issue slots and therefore there is one to one mapping between functional units. SST has neither of these limitations. In Processor Coupling [1] [13], several threads are scheduled statically and interleaved into clusters at run time. A cluster consists of a set of functional units that share a register file. Operations from different threads compete for a functional unit within a cluster. Interleaving does not occur across all issue slots, ....
M. Fillo, S. W. Keckler, W. J. Dally, N. P. Carter, A. Chang, Y. Gurevich, and W. S. Lee, "The M-Machine Multicomputer," in Proc. 28th Ann. Int'l Symp. Microarchitecture, (Ann Arbor, MI), pp. 146--156, Dec. 1995.
....there has been interest at studying single program speculative execution at the thread level. The Multiscalar [7] 20] work introduced and popularized this idea. The Multiscalar processor favors a hardware centric approach and synchronizes register flow between tasks. The XIMD [27] M Machine [6], Simultaneous Mutithreading [23] SPSM [5] Hydra [17] Stampede [21] Raw [25] Impact [10] and Superthreading [22] architectures all propose a single chip concurrent multithreaded architecture. These architectures either require independent threads, or speculate on control dependencies and or ....
Marco Fillo, Stephen W. Keckler, Wialliam Dayy, Nicholas Carter, Andrew Chang, Yevgeny Gurevich, and Whay Lee, "The M-machine multicomputer," in ISCA-22 , pp. 146-156, 1995
....both the basic and advanced schemes. The cost benefit scheme used could be applied or modified for use with VLIW architectures as well. There are also novel cluster based architectures that are less heavily focused on compile time cluster assignment, such as Multiscalar [16] and the M Machine [17]. Since our focuses on compile time assignment, these works are not discussed further. Future work is planned for adapting UAS to dynamically scheduled microarchitectures. 3 Unified Assign and Schedule Schedule time resource availability is not checked in BUG or Limited Connectivity VLIW since ....
M. Fillo, S. W. Keckler, W. J. Dally, N. P. Carter, A. Chang, Y. Gurevich, and W. S. Lee, "The M-Machine multicomputer," in Proc. 28th Ann. Int'l Symp. Microarchitecture, (Ann Arbor, MI), pp. 146--156, Dec. 1995.
....IRAM, 32:1 memory 1.5 2.0 2.5 2.5 4.5 4. Related Work 18 3) Multiprocessors. This category includes chips intended exclusively to be used as a building block in a multiprocessor, IRAMs that include a MIMD (Multiple Instruction streams, Multiple Data streams) multiprocessor within a single chip [Fil95][Kog95] Mur97] and IRAMs that include a SIMD (Single Instruction stream, Multiple Data streams) multiprocessor, or array processor, within a single chip [Aim96] Ell92] This category is the most popular research area for IRAMs. Figure 6 places uniprocessor and multiprocessor chips on a chart ....
Fillo, M.; Keckler, S.W.; Dally, W.J.; Carter, N.P.; and others. "The M-Machine multicomputer ". Proceedings of MICRO'95: 28th Annual IEEE/ACM International Symposium on Microarchitecture, Ann Arbor, MI, USA, 29 Nov.-1 Dec. 1995). p. 146-56.
....via interconnection network[2] In addition to the above, Computational RAM[3] and PIP RAM[1] are proposed as SIMD(single instruction, multiple data) approach. As MIMD(multiple instruction, multiple data) approach such as PPRAM R , Execube is proposed[5] With all not merged DRAM logic, M Machine[4] and Hydra[10] are proposed as on chip multiprocessor. 6 Conclusion We shows the overview of PPRAM, PPRAM R , PPRAM R 256 4 under development, the result of preliminary evaluation, and related works. Only a few studies have so far been made at VLSI architecture that take account of merged ....
Fillo, M., Keckler, S. W., Dally, W. J., Carter, N. P., Chang, A., Gurevich, Y., Lee, W. S., "The M-Machine Multicomputer," Proceedings of the 28th Annual International Symposium on Microarchitecture, pp.146--156, 1997.
....aggressive data communication and synchronization mechanisms between threads to exploit more finegrained parallelism. In addition, multiple functional units can be shared among threads for better utilization. Many concurrent multiple threaded processor architectures have been proposed and studied [1, 3, 4, 5, 6, 7, 8, 10, 11, 12, 14, 15, 16, 19]. Some of them [4, 7, 11, 14] are primarily for increasing system throughput by allowing multiple programs (one program for each thread) to be run concurrently. In this paper, we focus on models that are primarily for speeding up the execution of one single program. Among them, models such as ....
....[16] and SPSM [3] allow tasks that are independent, such as the iterations of a do all loop, to be executed in parallel. This restriction can simplify the design, but limits the exploitable parallelism. Models such as HEP [10] Tera [1] XIMD [19] Elementary Multithreading [8] and M machine [5, 12] allow data synchronization and communication between threads. These models rely on compilers to detect dependences between threads, and to insert explicit data synchronization and communication commands in a program. They do not support run time dependence checking. Hence, the compilers must be ....
[Article contains additional citation context not shown here]
Marco Fillo, Stephen W. Keckler, Dally William J, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, and Whay S. Lee. The mmachine multicomputer. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 146--156, November 29-- December 1, 1995.
....performance to Latency, Occupancy and Grain size for measuring the impact of communication overhead on large scale, fine grain parallel computing. The thesis is focused on the interface between user level programs and the communication system hardware. The multithreaded, multi ALU MIT M Machine [1] is 1 Processor occupancy refers to the amount of productive computation displaced by messagerelated operations. used as the experimental platform for this study. In the resulting M Machine system, only five cycles of processor occupancy are consumed in sending a null message. It thus ....
....necessary. 1.2 Approach This thesis focuses on the message interface. Conventional message interfaces that take tens to thousands of cycles to send a message present a bottleneck in the advent of advanced processors that are capable of generating multiple results on chip every cycle (e.g. [1, 2, 3]) and high speed signaling pads [11, 12] that are able to carry that data off chip quickly. In terms of robustness, the message interface is also the weakest link in the communication system. As user programs share the resources such as message buffers, care must be taken to shield them from one ....
[Article contains additional citation context not shown here]
Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, Whay S. Lee, "The M-Machine Multicomputer," in Proceedings of the 28th Annual International Symposium on Microarchitecture, 1995. pp. 104--114.
....such as the size of the memory buffer, the bandwidth requirement of the communication links between thread processing units, and the bandwidth requirement of the shared data cache. 1 Introduction Recent concurrent multithreaded architectures (CMAs) such as the multiscalar [3, 11] the M machine [2], the simultaneous multithreaded architecture [16] and others [4, 8, 13] have shown that exploiting threadlevel parallelism is a viable approach to improve the scalability of existing single threaded superscalar architectures. Using multiple threads of control to fetch This work is supported ....
....locations of a programs simultaneously allows compilers and processors to exploit more parallelism from multiple instruction windows. Among these CMA models, some of them [16, 8] only support concurrent execution of loosely coupled threads similar to a multiprocessoron a chip, while others [4, 11, 2] allow more tightlycoupled threads to execute in parallel with hardware support for direct data transfer between threads. Some of them also provide thread level control and data speculation to exploit high level program structures such as DO While loops and objects referenced through pointers. In ....
M. Fillo, S. W. Keckler, W. J. Dally, N. P. Carter, A. Chang, Y. Gurevich, and W. S. Lee. The mmachine multicomputer. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 146--156, November 29--December 1, 1995.
....represent the past and the current research efforts in the multithreading community. The architectures included in the discussion are Tera [Alverson90] MIT s StarT project ( Ang92] Chiou95] Nikhil92] Electrotechnical Lab s EM X ( Kodama95] Sakane95] MIT s Alewife [Agrawal95] M Machine [Filo95] and Simultaneous Multithreading [Tullsen95, 96] 5.1. Tera MTA Tera MTA (MultiThreaded Architecture) computer is a multistream MIMD system developed by Tera Computer Company [Alverson90] It is the only commercially available multithreaded architecture that will become available in 1997. The ....
....increased circuit density by devoting more chip area to the processor. It is claimed that a 32 node M Machine system with 256 MBytes of memory has 128 times the peak performance of uniprocessor with the same memory capacity at 1. 5 time the area, 85 times improvement in peak performance area [Filo95] The M Machine consists of a collection of computing nodes interconnected by a bidirectional 3 D mesh network. Each node consists of a multi ALU processor (MAP) and 8 MBytes of synchronous DRAM. A MAP contains four execution clusters, four cache banks, a network interface, and a router. Each of ....
Fillo, M. et al., "The M-Machine Multicomputer," Proceedings of MICRO-28, 1995 (Also available as MIT AI Lab Memo 1532).
....and processed at their destinations with very low overhead, possibly by directly accessing processor registers. The time needed to process an active message at its destination is kept to a minimum by placing the address of an interrupt handler, or even an opcode, within the message. See [7,22,10] for active message implementation for the J and M machine [14] for FLASH, and [15] for EM X. 5.2 Remote Operation Discussion Exo ops offer greater flexibility than systems providing atomic operations and the simple smart memory operations described above. Although systems providing protocol ....
M. Fillo, S.W. Keckler, W.J. Dally, N.P. Carter, A. Chang, Y. Gurevich, and W.S. Lee, "The Mmachine multicomputer," in Proc. of the 28th Annual Int. Symp. on Microarchitecture, Nov. 1995, pp. 146--156.
....limit in available instruction level parallelism; that is, our ability to extract ILP is growing less rapidly than our ability to integrate larger numbers of functional units on a single processor. This has led to several innovative paradigms in recent years to use functional units in novel ways [7, 9, 10, 23, 25, 27, 34, 36], as opposed to simply increasing the issue width of traditional superscalar designs. To address these problems, we present an architecture that like the multiscalar paradigm [10] the M Machine architecture [9] Raw processors [36] single chip multiprocessors [27] and simultaneous ....
....in recent years to use functional units in novel ways [7, 9, 10, 23, 25, 27, 34, 36] as opposed to simply increasing the issue width of traditional superscalar designs. To address these problems, we present an architecture that like the multiscalar paradigm [10] the M Machine architecture [9], Raw processors [36] single chip multiprocessors [27] and simultaneous multi threading [34] maps well to future process technologies that are dominated by interconnect overheads and thus demand decentralization at an architecture level. Like the recent investigations of vector processing to ....
[Article contains additional citation context not shown here]
M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y. Gurevich, and W. Lee. "The M-Machine multicomputer." In Proc. 28th Annual International Symposium on Microarchitecture (MICRO-28), Ann Arbor MI, November 1995, pp. 146--156.
....must be partitioned into independent physical regions, and the latency for communicating among partitions must be exposed to the microarchitecture and possibly to the ISA. This observation is not new; a number of researchers and product teams have proposed or implemented partitioned architectures [8, 9, 10, 12, 15, 21, 25, 29]. However, many of these architectures use conventional communications mechanisms, or rely too heavily on software to perform the application partitioning. The best combination of static and dynamic communication and partitioning mechanisms, which lend themselves to the high bandwidth, highlatency ....
Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, and Whay S. Lee. The MMachine Multicomputer. In Proceedings of the 28th International Symposium on Microarchitecture, pages 146--156, December 1995.
....must be partitioned into independent physical regions, and the latency for communicating among partitions must be exposed to the microarchitecture and possibly to the ISA. This observation is not new; a number of researchers and product teams have proposed or implemented partitioned architectures [8, 9, 10, 12, 15, 21, 25, 29]. However, many of these architectures use conventional communications mechanisms, or rely too heavily on software to perform the application partitioning. The best combination of static and dynamic communication and partitioning mechanisms, which lend themselves to the high bandwidth, ....
Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, and Whay S. Lee. The MMachine Multicomputer. In Proceedings of the 28th International Symposium on Microarchitecture, pages 146--156, December 1995.
....incorporates a mix of mechanisms that boost performance without compromising protection. Performance evaluation results are presented in Section 3. In Section 4, we discuss additional issues when the message interface is integrated into multithreaded systems. 2 The MIT M Machine The MIT M Machine [1] is an experimental multicomputer designed to exploit parallelism with a wide range of granularity. It consists of an array of Multi ALU Processor (MAP) nodes connected in a 2 D network supporting two message priorities. The network links have a 1 word cycle raw bandwidth. 2.1 The MAP Chip Figure ....
Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, Whay S. Lee, "The M-Machine Multicomputer," in Proceedings of the 28th Annual International Symposium on Microarchitecture, 1995. pp. 104-114.
No context found.
M. Fillo et al., "The M-Machine Multicomputer." Proc. 28th Ann. Int'l Symp. Microarchitecture, IEEE Press, 1995, pp. 146-156.
No context found.
Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, Whay S. Lee, "The M-Machine Multicomputer", Proc. MICRO-28, 1995, pp. 146-156.
No context found.
M. Fillo et al., "The M-Machine Multicomputer," Proc. 28th Ann. Int'l Symp. Microarchitecture, IEEE CS Press, Los Alamitos, Calif., 1995, pp. 104-114.
No context found.
M. Fillo et al., "The M-Machine Multicomputer," Proc. MICRO 95: 28th Ann. IEEE/ACM Int'l Symp. Microarchitecture, IEEE, 1995, pp. 146-156.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC