75 citations found. Retrieving documents...
W. J. Dally, A. Chien, S. Fiske, W. Horwat, J. Keen, M. Larivee, R. Lethin, P. Nuth, S. Wills, P. Carrick, and G. Flyer. The J-Machine: a Fine-Grain Concurrent Computer. In Information Processing `89, 1989.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents  Next 50

Information Hiding in Parallel Programs - Foster (1992)   (5 citations)  (Correct)

....has so far focused on concepts. We now examine how virtual topologies, virtual channels, lightweight processes, and port arrays can be used to develop parallel programs. Although some multicomputers and operating systems incorporate certain of these abstractions as primitive mechanisms [25, 11, 28], it will in general be necessary to provide compile time or run time support. There is much to be gained from standardizing this support so that it can be reused in many applications. It is also desirable to define interfaces that encourage or enforce correct usage. One viable approach is to ....

....cessors. An annotation L on a block location denotes invocation of location function L; it causes the block to execute on the virtual processor with index returned by L. Port Arrays. A port declaration creates a one dimensional distributed array of deftnitional variables. A declaration port P [11] ; creates a port array P with 11 elements, distributed blockwise across the nodes of the virtual topology in which the port array is declared. For example, a declaration port p [2 nodes ( creates a port array p with 10 2 nodes( elements; p [2.i] and p [2.i 1] are located on the ith node ....

Dally, W. J., et al., The J-Machine: A fine-grain concurrent computer, Information Processing 89, G. X. Ritter (ed.), Elsevier Science Publishers B.V., North Holland, IFIP, 1989.


A Compiler Approach to Scalable Concurrent Program Design - Foster, Taylor (1992)   (11 citations)  (Correct)

....Code . ode Networks iPSC80 J l J Uohin [ Uosio l Portable Emulator Figure 1: Compilation Strategy architecture provides high performance message handling and fine grain process schedul ing [36] The J machine also provides high performance variable and code manipulation hardware [15]. All of these features may be used to replace unique components of the emulator design, providing high performance, native code versions of the system. Implementations of this type are currently under construction. 1.5 Summary The important characteristics of this approach arc as follows. We ....

Dally, W. J., et al., The J-Machine: A fine-grain concurrent computer, Information Processing 89, G. X. Ritter (ed.), Elsevier Science Publishers B.V., North Holland, IFIP, 1989.


Fine-Grain Distributed Shared Memory on Clusters of Workstations - Schoinas (1997)   (3 citations)  (Correct)

....by suspending the computation and invoking a user level handler. A typical handler performs the actions dictated by a coherence protocol to allow the access and then resumes the computation. The fine grain access control mechanism is similar to full empty bits of dataflow architec tures [DCF 89] but it is tailored to support the implementation of shared memory protocols. For this reason, it extends the two state model of the full empty bits to a three state model that includes a readonly state. More specifically, Tempest s fine grain access control is based on tagged memory blocks. ....

....with low latencies than the former. Among the key proposals that emerged from the multicomputer community have been the Berkeley active messages. The design has been heavily influenced by earlier work in message directed computation in the context of dataflow architectures and the J machine [DCF 89, PC90] Berkeley active messages sought to reduce latencies by eliminating the soft ware complexity associated with traditional multicomputer messaging interfaces. Tempest s 52 messaging interface is based on the Berkeley active messages [vECGS92] It differs from Berkeley active messages in ....

[Article contains additional citation context not shown here]

William J. Dally, Andrew Chien, Stuart Fiske, Waldemar Horwat, John Keen, Michael Larivee, Rich Nuth, Scott Wills, Paul Carrick, and Greg Flyer. The j-machine: A fine-grain concurrent computer. In G. X. Ritter, editor, Proc. Information Processing 89. Elsevier North-Holland, Inc., 1989.


Planar-Adaptive Routing (par) :low-Cost Adaptive Networks For.. - Jae Kim Eng   (Correct)

....memory is distributed across the processing nodes; only the data in local memory can be accessed directly. Access to data in remote memories is supported by message passing between processors. Direct networks, represented by grid or mesh networks, have been used predominantly in multicomputers [47, 45, 23, 46]. However, as direct networks gain acceptance in shared memory multiprocessor designs, distinguishing these machines by network topology is less appropriate. Though indirect networks provide several advantages such as the topological equidistance property, they suffer from a significant drawback: ....

....and the MIT Alewife [1] The number of memory references to distant memory units can be dramatically reduced by exploiting locality of reference. Another approach is to hide or tolerate the latency by overlapping useful work with communication latency. Multicomputers such as the MIT J machine [23] use context switching to tolerate remote object access. The TERA machine [5] uses fine grain multithreading to hide the latency, and the Stanford DASH and MIT Alewife also use the multithreading to complement remote memory access due to cache misses. However, the techniques we have described ....

[Article contains additional citation context not shown here]

W. J. Dally, A. Chien, S. Fiske, W. Horwat, J. Keen, M. Larivee, R. Lethin, P. Nuth, S. Wills, P. Carrick, and G. Fyler. The J-Machine: A Fine-Grain Concurrent Computer. In Information Processing 89, Proceedings of the IFIP Congress, pages 1147-- 1153, August 1989.


L'insegnamento di Nuove Tecnologie di Programmazione: Alcune.. - Briot   (Correct)

....possiamo evidenziare i seguenti vantaggi : alto livello di espressione, modularit a, dinamicit a ed apertura. Anche se la OOCP rappresenta attualmente un nuovo campo in continua espansione, si possono gi a riscontrare decisivi in ussi sulle archittetture a pi u processori (come la J Machine [Dally et al. 89] e numerose applicazioni nei sistemi per l analisi dei segnali [Barry 89] controllo dei processi, sistemi per l automazione dell ucio, e per no nell animazione. 2 Insegnamento In questa sezione saranno discussi i vantaggi apportati da questa nuova metodologia di programmazione agli ....

W.J. Dally et al., \The J-Machine: a Fine-Grain Concurrent Computer", Proceedings of Information Processing Congress (IFIP'89), pages 1147-1153, August 1989.


Training in New Programming Technologies: an Experience - Briot (1992)   (Correct)

....of cooperative modules, and to execute them onto parallel computer architectures. Advantages may be summarized as following: high levelness, modularity, dynamicity, and openness. OOCP is a new growing eld, but has already main impact on new multi processor architectures (like the J Machine [Dally et al. 89] and various applications like signal processing [Barry 89] process control, oce information systems, animation. 2 Teaching In this section we will discuss the issue of introducing this new programming methodology to students and conventional programmers. This discussion is based on our ....

W.J. Dally et al., \The J-Machine: a Fine-Grain Concurrent Computer", Proceedings of Information Processing Congress (IFIP'89), pages 1147-1153, August 1989.


FUGU: Implementing Translation and Protection in a .. - Mackenzie.. (1994)   (9 citations)  (Correct)

....Hybrid Deposit [19] proposes hardware to interpret messages as operations on pre negotiated buffer areas. FUGU s approach is to add protection while maintaining existing, well defined user level communicationmechanisms and efficient, distributed shared memory. The J machine multicomputer [9] provides two levels of network priorities, user level access to the network hardware and the ability to relaunch incoming messages from memory transparently. The J machine is a single user machine with no support for shared memory or DMA on messages. The CM 5 multicomputer provides multiuser ....

William J. Dally et al. The J-Machine: A Fine-Grain Concurrent Computer. In Proceedings of the IFIP (International Federation for Information Processing), 11th World Congress, pages 1147--1153, New York, 1989. Elsevier Science Publishing.


Adaptive routing on the Recursive Diagonal Torus - Funahashi And Hanawa   (Correct)

....exerted to implement Massively Parallel Computers (MPCs) with tens of thousands nodes. In these systems, the connection topology often dominates the system performance. Instead of hypercube used in first generation multicomputers, most recent machines take the 2 D or 3 D mesh (torus) network[1][2][3] Although the diameter of a mesh network is large ( O( p M) or O( 3 p M) for M nodes) it only requires four or six links per node unlike the hypercube which requires log 2 M links per node. However, in an MPC with more than ten thousands nodes, the large diameter of the mesh network is ....

W. J. Dally A. Chien S. Fiske W. Horwat J. Kenn M. Larivee R. Lethin P. Nuth and S. Wills. The J-machine: A Fine-Grain Concurrent Computer. In IFIP 11th Computer Congress, pages 1147--1153, August 1989.


Bandwidth-Optimal Complete Exchange on Wormhole-Routed.. - Tseng, Lin, Gupta, Panda (1997)   (6 citations)  (Correct)

....Science, Duke University, Durham N.C. 27708, U.S.A, sandeep cs.duke.edu] x A preliminary version of this paper appeared in Int l Parallel Processing Symp. 1995 [19] 2 as hypercubes [1] Examples of machines with such topologies include the MasPar MP 1 [3] Intel Paragon, MIT J Machine [6], Tera HORIZON [17] Cray T3D [4, 13] and Polymorphic Torus [9] A torus is a mesh with wrap around links. Although meshes and tori are generally regarded as close families, there are still some distinctions: i) As opposed to meshes, all nodes of a torus are topologically symmetric, ii) a torus ....

W. J. Dally, et al. The J-Machine: A fine-grain concurrent computer. In Information Processing 89, IFIP, pages 1147--1153, 1989.


Synchronization and Pipeline Design for a Multithreaded Massively.. - Sakai (1992)   (2 citations)  (Correct)

....pipeline. The concept of multithreading is not exclusive to the extension of dataflow architectures. For instance, the Denelcor HEP [1] and the Tera Computing System [6] are multithreaded computers in the sense that they execute and control multiple threads in a single pipeline. Dally s J machine [16] does not interleave multiple threads, but it can switch between threads very quickly; thus, we can say that it actually supports the multithreaded computation. In addition, Dally s new machine, called the M machine, has a mechanism of thread interleaving [17] where many threads can exist inside a ....

Dally, W., Chien, A., Fiske, S., Horwat, W., Keen, J., Larivee, M., Lethin, R., Nuth, P. and Wills, S.: The J-Machine: A Fine-Grain Concurrent Computer, Proc. of IFIP 89, pp.1147-1153 (1989).


Software Overhead in Messaging Layers: Where Does the Time Go? - Karamcheti, Chien (1994)   (32 citations)  (Correct)

....Second, the nodes, NI, and the network all have finite buffering, so software buffer management is required. Third, the CM 5 network provides error detection at the packet level, but no error correction, requiring a software 1 While this is not the most efficient type of network interface [12, 6], it requires no changes to the processor. Many researchers believe that this type of interface is representative of future network interfaces. protocol to ensure reliable delivery. And finally, the CM5 network hardware only supports packets with five 32 bit words, so a typical message is broken ....

....remains significant over the range of packet sizes. For finite sequence multi packet deliveries, the messaging overhead is lower, but still significant, accounting for 9 11 of the total cost. Improved network interfaces and DMA hardware If network interfaces can be integrated on chip, as in [12, 6], the basic cost of communication can be reduced, but this will not reduce protocol costs in the messaging layer on which our study focuses. If the base cost is reduced, that increases the importance of the costs in the rest of the messaging layer. Similarly, while DMA hardware can reduce the cost ....

[Article contains additional citation context not shown here]

W. J. Dally et. al The J-Machine: A fine-grain concurrent computer. In Information Processing 89, Proceedings of the IFIP Congress, pages 1147--1153, August 1989.


Design and Evaluation of Network Interfaces for System Area.. - Mukherjee (1998)   (Correct)

....Partial No SHRIMP [12] Yes Write Through No DI Multicomputer[23] No No Network Interface Table 3.5: Comparison of CNI with other network interfaces 80 communicate through the cachable memory accesses, for which most processors and buses are optimized. Henry and Joerg [50] and Dally, et al. [34] advocate changes to a processor s registers. MIT Alewife [2] and Fugu [72] rely on a custom cache controller. MIT StarT NG [22] requires a co processor interface at the same level as the L2 cache. AP1000 [110] requires integrated cache and DMA controllers. Stanford FLASH [64, 48] uses a custom ....

William J. Dally, Andrew Chien, Stuart Fiske, Waldemar Horwat, John Keen, Michael Larivee, Rich Nuth, Scott Wills, Paul Carrick, and Greg Flyer. The J- 178 Machine: A Fine-Grain Concurrent Computer. In G. X. Ritter, editor, Proc. Information Processing 89. Elsevier North-Holland, Inc., 1989.


How Much Adaptivity is Required for Bursty Traffic? - Ludmila Cherkasova Al   (Correct)

....may be more difficult to hide in a more conventional PE design, low latency message traffic becomes the primary goal. Adaptivity is costly [2] both in terms of router complexity and in terms of latency when suboptimal paths are chosen. Several low latency deterministic routers have been developed [7, 3] but we are still interested in the potential use of limited adaptivity to bypass temporary congestion in the fabric rather than the added latency required to just wait for the resource. A major concern we will address is how much routing adaptivity is enough for efficient transfer of different ....

Dally, W. J. et al.: The J-Machine: A Fine-Grain Concurrent Computer. In Proceedings of the IFIP Conference, North-Holland, pp. 1147--1153, 1989.


Limits on Interconnection Network Performance - Agarwal (1991)   (70 citations)  (Correct)

....mappings, two or three dimensional networks are favored because they scale better than high dimensional networks, they are modular, and they are easy to implement. Examples of machine designs that use such networks are the MuNet [12] Ametek 2010 [26] the Caltech Mosaic [3] the MIT J machine [9], and the CMU Intel iWarp [4] Some recent distributed shared memory designs are also planning to use low dimensional direct networks, e.g. HORIZON [18] the Stanford DASH Multiprocessor [20] and the MIT Alewife machine [2, 6] The choice of the optimal network for a multiprocessor is highly ....

William J. Dally et al. The J-Machine: A Fine-Grain Concurrent Computer. In IFIP Congress, 1989.


Near-Minimum Parallel Time Reduction on Processor Arrays - Robert Wagner (1993)   (Correct)

....each dimension, a PE has two neighbors. Each PE can accept a single operand from any of its neighbors and perform an operation with that operand (with a stored result) in one timestep. We call this model the mesh PE SO model. The best current practical manifestation of this model is the J machine [2], which is a 3D mesh of MIMD processors with communication provided by a fast wormhole routing network. A timestep in our model includes both computation and communication; we will be measuring the communication time required to compute the reduction. 2 Wagner [5] has shown how to do reduction ....

....since 3D architectures are particularly feasible. In this case we seek a closed form solution to the linear programming problem in terms of the Z x i and Z n i constants. We will eliminate the variables to come up with this solution. Let our center point be c = x,y,z) c[0] x, c[1] y and c[2] = z. We have four direction vectors, s 0 = 1, 1, 1] s 2 = 1,1, 1] s 1 = 1, 1,1] s 3 = 1, 1, 1] Also we will let the notation Z i (c) x,y,z) be the value of Z i (c) when c = x,y,z] which is s i [x,y,z] The notation Z i (c) x,y,z) makes clear that we are solving for c in terms of ....

[Article contains additional citation context not shown here]

Dally, W., Chien, A., Fiske, S., Horwat, W., Keen, J., Larivee, M., Lethin, R., Nuth, P., and Wills, S., The J-Machine: A Fine-Grain Concurrent Computer, IFIP Congress '89 (August 1989).


A Lower Bound For Order-Preserving Broadcast In The Postal Model - Philip Mackenzie (1992)   (1 citation)  (Correct)

....is the job of the communication network to deliver the message from its source to its destination. In many cases, the actual topology of the network can be ignored, since passing a message between any pair of processors takes roughly the same time. For some examples of networks of this type, see [1,2,3,4]. To analyze systems of this type, Bar Noy and Kipnis [5] have developed the postal model, in which it is assumed that processors are connected in a complete network and there is simply a communication latency factor which measures the inverse ratio of the time it takes for a processor to send ....

W. J. Dally, A. Chien, S. Fiske, W. Horwat, J. Keen, M. Larivee, R. Lethin, P. Nuth, S. Wills, P. Carrick, and G. Fyler, The J-Machine: a fine-grain concurrent computer, in Information Processing 89, 1989, 1147--1153.


Object-Oriented Concurrent Programming: Introducing a New.. - Briot (1992)   (Correct)

.... [Agh88] office information systems [Hew86] real time systems and (distributed) process control [Bar89] Significant results may also be found in other fields, e.g. text parsing [YO90] and computer music [CBS87] OOCP has also main impact on new multi processor architectures, like the J Machine [Dal89]. Object oriented concurrent programming fits specially well with the new area of artificial intelligence, called distributed artificial intelligence (DAI in short) DAI88] which takes much input from sociology and organization theories. DAI intends at solving problems in a distributed way ....

W.J. Dally et al., The J-Machine: a Fine-Grain Concurrent Computer. Proceedings of Information Processing Congress (IFIP'89), pages 1147--1153, August 1989.


Emulation of a Virtual Shared Memory Architecture - Raina (1993)   (3 citations)  (Correct)

....the communication and computation capabilities on a single chip similar to the Transputer. The first is the Texas Instruments TI 320C40 [170] which features six byte wide bidirectional links offering a total bandwidth of 120 Mbps. The other device is the message driven processor of the J Machine [52]. The hardware scheduler schedules processes by time slicing. Each time slice period 7.4 Implementation on a Transputer based Multiprocessor 83 lasts for 1024 high priority ticks (or 1 millisecond) If a low priority process has been running continuously for two time slice periods, the scheduler ....

W. J. Dally, A. Chien, S. Fiske, W. Horwat, J. Keen, M. Larivee, R. Lethin, P. Nuth, S. Wills, P. Carrick, and G. Flyer. The J-Machine: a Fine-Grain Concurrent Computer. In Information Processing `89, 1989.


Exploiting Two-Case Delivery for Fast Protected Messaging - Mackenzie, Kubiatowicz, .. (1998)   (14 citations)  (Correct)

....it only to let the operating system clear the network. A polling watchdog mode could be implemented in the FUGU system. Direct Network Interfaces. Several machines have provided direct network interfaces. These include the CM 5, the J machine, iWarp, the T interface, Alewife, and Wisconsin s CNI [20, 8, 5, 26, 1, 24]. These interfaces feature low latency by allowing the processor direct access to the network queue. Direct NIs can be inefficient unless placed close to the processor. Anticipating continued system integration, we place our NI on the processor cache bus. The CNI work shows how to partly ....

....a second logical network reserved to the operating system as a guaranteed path to backing store. The second network is used infrequently for this purpose so its performance is not critical. The network might be shared with some other use, such as supporting shared memory. An extra virtual channel [8] in the main network, a LAN or a service network would serve the purpose. Our emulator hardware provides a very simple, bit serial network. ffl The second network provides a guarantee of deadlock avoidance, but performance would degrade severely if we were to routinely block the main network ....

William J. Dally et al. The J-Machine: A Fine-Grain Concurrent Computer. In Proceedings of the IFIP (International Federation for Information Processing), 11th World Congress, pages 1147--1153, New York, 1989. Elsevier Science Publishing.


Crystal Scheme A Language for Massively Parallel Machines - Queinnec   (Correct)

....language itself, advanced concurrent constructs such as futures. Eventually we comment some simulations, with various topologies and migration policies, which enables to appreciate our previous linguistical choices and confirms the viability of the model. Massively parallel computers [Hewitt 80, Dally et al. 89, Germain et al. 90] are large ensembles comprising thousands of conventional but powerful processors equipped with independent memories. Their total throughput confers them tremendous computing potential but they still remain to be tamed. Such machines usually have a crystalline structure where a ....

William J. Dally et al., The J-Machine: A Fine-Grain Concurrent Computer, Proceedings of the IFIPS Conference 1989.


UDM: User Direct Messaging for General-Purpose.. - Mackenzie.. (1996)   (4 citations)  (Correct)

.... passing can use the same memory protection mechanisms as are used in uniprocessors [3, 22] On the other hand, systems supporting fine grain message passing are currently either single user machines, at best resorting to hard partitioning or strict gang scheduling to permit multiprogramming [17, 6, 1, 11] or use alternate techniques that generally add restrictions or overhead (see related work in Section 6) This paper develops a model of messaging called User Direct Messaging (UDM) which allows the application of the techniques of modern operating systems to multiprocessors without sacrificing ....

....mechanisms including DMA for bulk transfer and hardware synthesized messages for accelerating shared memory. Sender: Network: Receiver: transit time ( 1 uS) compose (7 cycles) receive occupancy (65 cycles, trap, 9 cycles, poll) Figure 1: Direct message timing. memory system [17, 6, 11, 8, 1]. Low overhead and latency are achieved by avoiding the memory system, so that message overheads scale with processor performance rather than with memory performance. Figure 1 represents the one way message latency and overheads predicted for our prototype system, FUGU, operating at 20MHz. By ....

[Article contains additional citation context not shown here]

William J. Dally et al. The J-Machine: A Fine-Grain Concurrent Computer. In Proceedings of the IFIP (International Federation for Information Processing), 11th World Congress, pages 1147--1153, New York, 1989. Elsevier Science Publishing.


An Efficient Virtual Network Interface in the FUGU Scalable.. - Mackenzie (1998)   (1 citation)  (Correct)

.... advantage of the characteristics of the so called System Area Network (SAN) environment [77, 64, 81, 6, 19, 17, 29, 13] Higher performance network interfaces suitable for significantly finer grain parallel problems have been demonstrated in massively parallel processors as research prototypes [70, 7, 16, 1, 61, 2, 56] and as commercial machines [45, 69, 72] However, MPP work has largely ignored issues of mixed workloads that require multiprogramming, demand paging and interactive scheduling. A scalable workstation represents one vision of the convergence of SMP, cluster and MPP goals and technologies that ....

....et al. s CNI 16 Qm [56, 57] interface provides both a fast path and a (potentially virtual) buffered path by using the network interface to buffer messages. Hybrid solutions will be discussed in more detail in Chapter 8. Direct network interfaces, Figure 8 1a have been used in research machines [16, 7, 61, 1, 56] and one commercial machine, the CM 5 [45] These interfaces feature low latency by allowing the processor direct access to the network queue. Direct NIs can be inefficient unless placed close to the processor. Anticipating continued system integration, we place our NI on the processor cache bus. ....

[Article contains additional citation context not shown here]

William J. Dally et al. The J-Machine: A Fine-Grain Concurrent Computer. In Proceedings of the IFIP (International Federation for Information Processing), 11th World Congress, pages 1147--1153, New York, 1989. Elsevier Science Publishing.


Cost Modeling and Analysis: Towards Optimal Resource Utilization.. - Moritz   (Correct)

....of semiconductor technology over the next 15 years. The SIA predicted that by 2010, industry would be manufacturing 800 million transistor processors with thousands of pins, a 1000 bit bus, and clock speeds over 2GHz. Several research groups are working today on billion transistor architectures [18] and have dramatically different views on how to use these chip level resources most efficiently. Suggested architectures range from advanced superscalar processors capable of issuing 16 to 32 instructions per cycle, chip multiprocessors 29 20 with 4 16 processor elements on a chip, to highly ....

....[17] Elliot Waingold, Michael Taylor, Devabhaktuni Srikrishna, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew Frank, Peter Finch, Rajeev Barua, Jonathan Babb, Saman Amarasinghe, and Anant Agarwal Baring it all to Software: Raw Machines. IEEE Computer, September 1997, pp. 86 93. [18] The future of microprocessors. IEEE Computer, September 1997 [19] J. Babb and R. Tessier and M. Dahl and S. Hanono and D. Hoki and A. Agarwal. Logic Emulation with Virtual Wires. IEEE Transactions on Computer Aided Design, VOL. 16, No.6, June 1997, pp. 609 626. 20] H. T. Kung. Memory ....

[Article contains additional citation context not shown here]

William J. Dally et al. The J-Machine: A Fine-Grain Concurrent Computer, Proceedings of the IFIP (International Federation for Information Processing), 11th World Congress, Elsevier Science Publishing, New York, 1989. pp. 1147-1153.


A Survey of User-Level Network Interfaces for System Area Networks - Mukherjee (1997)   (7 citations)  (Correct)

....cache bus interfaces to independent vendors nor promise to preserve that interface across different generations of processors. Processor register mapped ULNIs are much harder to design because these ULNIs are tightly coupled with the microprocessor. A few research projects such as MIT J machine [12] and the MIT M Machine [17] have explored register mapped ULNIs. Unfortunately, no microprocessor manufacturer have felt the need to provide a ULNI in their microprocessors because microprocessors are produced primarily for the uniprocessor PC market, and not for the multiprocessor market, and ....

William J. Dally, Andrew Chien, Stuart Fiske, Waldemar Horwat, John Keen, Michael Larivee, Rich Nuth, Scott Wills, Paul Carrick, and Greg Flyer. The J-Machine: A FineGrain Concurrent Computer. In G. X. Ritter, editor, Proc. Information Processing 89. Elsevier North-Holland, Inc., 1989.


Anatomy of a Message in the Alewife Multiprocessor - Kubiatowicz, Agarwal (1993)   (40 citations)  (Correct)

....at the source and, ideally, delivered directly to processor registers at the destination. Thus, efficient messaging facilities should permit direct transfer of information from registers to the network interface. Direct register to register transmission has been suggested by a number of architects [9, 10, 11, 12]. 2. Blocks of data that reside in memory often accompany such header information. Consequently, efficient messaging facilities should allow direct memory access (DMA) mechanisms to be invoked inexpensively, possibly on multiple blocks of data. This is important for a number of reasons, including ....

....(where a task frame or portion of the calling stack may be transmitted along with the continuation) 13] and distributed block I O (where both a buffer header structure and data may reside in memory) 3. Some modern processors, such as the Alewife s Sparcle processor [14] MOSAIC [15] and the MDP [10], can respond rapidly to interrupts. In particular, vectored interrupts permit dispatch directly to appropriate code segments, and reserved hardware contexts can remove the need for saving and restoring registers in interrupt handlers. This couples with efficient DMA to provide another advantage: ....

[Article contains additional citation context not shown here]

William J. Dally et al. The J-Machine: A Fine-Grain Concurrent Computer. In Proceedings of the IFIP (InternationalFederationfor Information Processing), 11th World Congress, pages 1147--1153, New York, 1989. Elsevier Science Publishing.


A Comparison of Architectural Support for Messaging in the TMC.. - Karamcheti (1995)   (27 citations)  (Correct)

....all these areas, evaluating the hardware support and messaging protocols required to provide robust performance for a range of dynamic and irregular traffic patterns. Research on specialized hardware support for messaging has focused primarily on integrating message processing within the processor [10, 14, 1, 25]. These approaches are effective in reducing point to point costs, but provide no solutions for network and output contention. In contrast, we have investigated messaging atop shared address space primitives and demonstrated that it can deliver performance robust over output contention. Research ....

William J. Dally et al. The J-Machine: A fine-grain concurrent computer. In Information Processing 89, Proceedings of the IFIP Congress, pages 1147--1153, August 1989.


PROTEUS: A High-Performance Parallel-Architecture.. - Brewer, Dellarocas.. (1991)   (144 citations)  (Correct)

....coherence protocols for large scale multiprocessors. Technical Report MIT LCS TR 489, MIT Laboratory for Computer Science, September 1990. Che89] D. K. Chen. MaxPar: An execution driven simulator for studying parallel systems. Technical Report CSRD 917 and UILU ENG 89 8013, University of Illinois, October 1989. 24 ....

W. J. Dally et al. The J-machine: A fine-grain concurrent computer. In G.X. Ritter, editor, Proceedings of the IFIP Congress, pages 1147--1153. North-Holland, August 1989.


Expressing Fine-Grained Parallelism Using Distributed Data.. - Suresh Jagannathan   (Correct)

....multiprocessor hardware. A distributed data structure (also referred to as a distributed object) is a device that serves as a communication and synchronization repository for a collection of asynchronously executing processes. Explicitly parallel languages such as FCP[16] Concurrent Smalltalk[10, 12], C Linda[6] MultiLisp[11] etc. permit many producers and consumers simultaneously to modify and read the contents of shared distributed objects. Consumers that access a component of such an object block until a producer provides a value. Advocates argue that programming with distributed data ....

....parallelism. This has not been the case, however, for two reasons. First, the generality afforded by the semantics of most distributed object proposals makes generating efficient low level representations difficult in the absence of advanced compile time analysis or specialized hardware[2, 10]. No system to our knowledge has seriously pursued the former approach; the latter alternative suffers from lack of portability and high cost. Consequently, many languages that implement some form of distributed data structure do so by requiring user annotations to aid the compiler in generating a ....

William Dally et. al. The J-Machine: A Fine-Grain Concurrent Computer. In Proceedings of the 1989 IFIPS Conference, 1989.


Computing Global Combine Operations in the.. - Bar-Noy, Bruck, Ho, .. (1996)   (Correct)

....emerging trends in modern distributed memory parallel computers and high speed communication networks. A related model, the LogP model [18] was also proposed recently to address similar goals. Systems that are well modeled by the postal model include parallel computers like the J Machine [19], the CM 5 [29] the Vulcan system [34] and communication networks such as PARIS [16] and AURORA [17] Regarding the topology of the message passing system, as in the postal model, we assume that the communication between any two processors has the same characteristics. Namely, we assume that the ....

W. J. Dally, A. Chien, S. Fiske, W. Horwat, J. Keen, M. Larivee, R. Lethin, P. Nuth, S. Wills, P. Carrick, and G. Fyler, The J-Machine: a fine-grain concurrent computer, Information Processing 89, Elsevier Science Publishers, IFIP, 1989.


Efficient Implementations of Concurrent.. - Yonezawa, Matsuoka, .. (1992)   (1 citation)  (Correct)

....active objects. A recent breed of COOP (Concurrent Object Oriented Programming) languages attempt to provide maximum computational and modeling power through concurrency of objects[2, 3, 19, 20] Several research projects including those at University of Illinois[3] University of Tokyo[19] MIT[5], ETL[13] and MCC[11] have been actively pursuing the concurrent object approach in designing and implementing languages, designing hardware architectures, developing applications, and laying theoretical foundations as well. The major hindrance against the wide spread use of the approach had ....

....performance results on a real hardware to achieve the performance of up to nearly 10 seconds (130 clocks) total for a remote object creation followed by a request message send to the created object and a reply reception from the object. This can be favorably compared to the Cosmos J Machine[5], which is highly optimized for Concurrent OO computation as our required machine cycles are considerably smaller. Having completed the preliminary phase of the ABCL onEM 4 project, we then initiated the second implementation project ABCL onAP1000. Since AP1000 does not have special hardware ....

[Article contains additional citation context not shown here]

William J. Dally, Andrew Chien, Stuart Fiske, Waldemar Horwat, John Keen, Michael Larivee, Rich Lethin, Peter Nuth, and Scott Wills. The J-Machine: A fine-grain concurrent computer. Proc. of the IFIP 11th World Computer Congress, pages 1147--1153, San Francisco, August/September 1989.


Cyclic-Cubes: A New Family of Interconnection Networks of Even.. - Fu, Chau (1998)   (2 citations)  (Correct)

....node, hence we are interested in fixed degree networks. Some fixed degree Cayley graphs are known, e.g. n cycles in [1] the connected cycles in [5] the Cube Connected Cycles in [6] the Cayley graphs proposed in [7] and the k ary n cube which has been used in the design of a number of machines [8, 9, 10, 11, 12]. The k ary n cube graphs have even fixed degrees and the other graphs mentioned above have fixed degrees at most 4. In this paper, we propose a new family of Cayley graphs with fixed degree of any even number greater than or equal to 4. Since for each graph in this new family, after contracting ....

W.J. Dally et al., The J-Machine: A fine-grain concurrent computer, Elsevier Science Publishers B.V., 1989.


General Purpose Parallel Computing - McColl (1993)   (64 citations)  (Correct)

....objects and can be passed in messages. The graph of possible interactions between actors can thus change dynamically. The actor model provides a convenient framework for concurrent object oriented programming [10] Dally has developed an interesting parallel architecture, called the J Machine [69, 70, 72], which supports the actor model. The dataflow model has evolved considerably over the last decade. Modern designs for dataflow architectures [128, 129, 200, 201, 206] emphasise the importance of ideas such as efficient multithreading and the exploitation of parallel slackness, in the same way as ....

W J Dally, A Chien, S Fiske, W Horwat, J Keen, M Larivee, R Lethin, P Nuth, and S Wills. The J-Machine: A fine-grain concurrent computer. In G X Ritter, editor, Proc. Information Processing 89, pages 1147--1153. Elsevier Science Publishers, B. V., 1989.


The M-Machine Multicomputer - Fillo, Keckler, Dally, Carter.. (1995)   (22 citations)  Self-citation (Dally)   (Correct)

....which are copied into the buffer and sent again later. Discussion: The M Machineprovides direct register to register communication, avoiding the overhead of memory copying at both the sender and the receiver, and eliminating the dedicated memory for message arrival, as is found on the J Machine [8]. Registermapped network interfaces have been used previously in the Mars Machine [2] J Machine, and iWarp [4] and have been described by T [26] as well as Henry and Joerg [15] However, none of these systems provide protection for user level messages. Systems, like the J Machine, that provide ....

DALLY, W. J., ET AL. The J-Machine: A fine-grain concurrent computer. In Proceedings of the IFIP Congress (Aug. 1989), G. Ritier, Ed., North-Holland, pp. 1147-1153.


Execution of Dataflow Programs on General-Purpose Hardware - Spertus (1992)   (1 citation)  Self-citation (William)   (Correct)

....1. Each level has its own set of registers, and priorities 0 and I have separate message queues. Background execution is interrupted by a priority 0 message, which in turn will be interrupted by any priority I messages. Several J Machines have been built, including one with 128 processors. See [9, 11] for a complete description of the MDP and the J Machine. 1.2 Previous Experiments in Executing Dataflow Programs on the J Machine 1.2.1 Dataflow Graphs Dataflow compilers convert programs into dataflow graphs, where the nodes of the graph represent operators, and the arcs represent ....

Dally, William J., et al. The J-Machine: A Fine-Grain Concurrent Computer. Informa- tion Processing 89, Proceedings of the IFIP Congress, 1989.


Planar-Adaptive Routing: Low-cost Adaptive Networks for.. - Chien, Kim (1992)   (136 citations)  Self-citation (Chien)   (Correct)

....routing [15] the ideas apply to virtual cut through [21] and store and forward networks as well. Overloaded Channels Figure 1: Four packets and their routing paths under deterministic, dimension order routing. 2 The Problem Most existing multicomputer routing networks use deterministic routing [32, 30, 13, 31]. Although there are numerous paths between any source and destination, in order to avoid deadlock, deterministic routing defines a single path from source to destination. Fixed, single path routing prevents effective use of the network s density of physical interconnection because the physical ....

....significantly reduces the amount of hardware required and should reduce the time to setup and drive data across the switches. Low connectivity requirements also make it possible to use organizations which allow the router performance to be further optimized for high speed, low latency performance [14, 13]. In planar adaptive routers, the routing function prevents deadlock, completely independent of the flow control. No routing decisions depend on the presence or absence of flits in particular network buffers. This allows routing and flow control decisions to be made separately, decoupling the ....

W. J. Dally, A. Chien, S. Fiske, W. Horwat, J. Keen, M. Larivee, R. Lethin, P. Nuth, S. Wills, P. Carrick, and G. Fyler. The J-Machine: A fine-grain concurrent computer. In Information Processing 89, Proceedings of the IFIP Congress, pages 1147--1153, August 1989.


The Cost of Adaptivity and Virtual Lanes in a Wormhole Router - Aoyama, Chien (1995)   (13 citations)  Self-citation (Chien)   (Correct)

....Router latency is approximately 50ns and channel data rates are as high as 90MB s, using byte wide links. Derivatives of MRCs are used in several research Cost of Adaptivity 24 machines [1, 20, 30] The J Machine Router The J Machine is a fine grained concurrent computer developed at MIT [16, 29]. The J machine network is a three dimensional mesh, with bidirectional 9 bit channels, and dimension order, wormhole routing. The J Machine network uses two virtual channels to support two logically independent message priorities and a globally synchronous clock. The data throughput is 36 MB s ....

W. J. Dally, A. Chien, S. Fiske, W. Horwat, J. Keen, M. Larivee, R. Lethin, P. Nuth, S. Wills, P. Carrick, and G. Fyler. The J-Machine: A fine-grain concurrent computer. In Information Processing 89, Proceedings of the IFIP Congress, pages 1147--1153, August 1989.


An Evaluation of Planar-Adaptive Routing (PAR) - Jae Kim Andrew (1992)   (4 citations)  Self-citation (Chien)   (Correct)

....been touted as scalable parallel architectures, in fact their scalability is limited by the performance of their interconnection networks. One reason why networks do not achieve their full potential bandwidth is restrictive routing policies. Most existing multicomputers use deterministic routing [13, 11, 6, 12] due to its simplicity. Any deter 1 The research described in this paper was supported in part by National Science Foundation grant CCR 9209336, Office of Naval Research grant N00014 92 J 1961, and National Aeronautics and Space Administration grant NAG 1 613. Additional support has been ....

William J. Dally, Andrew Chien, Stuart Fiske, Waldemar Horwat, John Keen, Michael Larivee, Rich Lethin, Peter Nuth, Scott Wills, Paul Carrick, and Greg Fyler. The J-Machine: A Fine-Grain Concurrent Computer. In Information Processing 89, Proceedings of the IFIP Congress, pages 1147--1153, August 1989.


Execution of Dataflow Programs on General-Purpose Hardware - Spertus (1992)   (1 citation)  Self-citation (William)   (Correct)

....1. Each level has its own set of registers, and priorities 0 and 1 have separate message queues. Background execution is interrupted by a priority 0 message, which in turn will be interrupted by any priority 1 messages. Several J Machines have been built, including one with 128 processors. See [9, 11] for a complete description of the MDP and the J Machine. 1.2 Previous Experiments in Executing Dataflow Programs on the J Machine 1.2.1 Dataflow Graphs Dataflow compilers convert programs into dataflow graphs, where the nodes of the graph represent operators, and the arcs represent ....

Dally, William J., et al. The J-Machine: A Fine-Grain Concurrent Computer. Information Processing 89, Proceedings of the IFIP Congress, 1989.


Using Attributed Flow Graph Parsing to Recognize Programs - Wills (1994)   (6 citations)  Self-citation (Wills)   (Correct)

....other existing recognition system is a 300 line database program recognized by CPU[12] All other systems work with toy programs on the order of tens of lines. We empirically and analytically studied the computational cost of GRASPR s parsing algorithm with respect to the simulator programs [4]. Since the algorithm is essentially constrained search, it is exponential in the worst case. However, in the practical application of graph parsing to recognizing complete instances of clich es, constraints are strong enough to prevent exponential behavior in practice. In particular, structural ....

W. Dally, A. Chien, S. Fiske, W. Horwat, J. Keene, M. Larivee, R. Lethin, P. Nuth, S. Wills, P. Carrick, and G. Fyler. The J-Machine: A fine-grain concurrent computer. In Int. Fed. of Info. Processing Societies, 1989.


Do Faster Routers Imply Faster Communication? - Karamcheti, Chien (1994)   (6 citations)  Self-citation (Chien)   (Correct)

....transfers, and is implemented using CMAM xfer function which splits up the transfer into a sequence of hardware packets at the source, and CMAM handle left xfer function which reassembles the packets at the destination. 1 While this is not the most efficient type of network interface [13, 8, 4], it has the significant virtue that no changes to the processor are required. Many researchers believe that this type of interface is basically representative of future network interfaces. 2 The CM 5 NI also supports an interrupt driven interface for reception; however, the cost is very high ....

....exploring what impact advanced network features (adaptive routing, virtual channels) have on network interface complexity and software overhead. Our work addresses some of these issues. Research on network interfaces has focused primarily on reducing message injection (and reception) overhead [13, 8, 19, 4] or offloading the communication onto a coprocessor [14, 16, 3] Such efforts are complementary to our goal of software protocol overhead reduction. Improvements in network interface can reduce the basic communication cost in our studies. While reducing the basic cost is important, as can be seen ....

W. J. Dally, A. Chien, S. Fiske, W. Horwat, J. Keen, M. Larivee, R. Lethin, P. Nuth, S. Wills, P. Carrick, and G. Fyler. The J-Machine: A fine-grain concurrent computer. In Information Processing 89, Proceedings of the IFIP Congress, pages 1147--1153, August 1989.


The Concert System -- Compiler and Runtime Support for.. - Andrew Chien Vijay (1993)   (11 citations)  Self-citation (Chien)   (Correct)

.... drawback of fine grained, concurrent object oriented languages to date has been their inefficiency (compared to their competitors such as parallel FORTRAN dialects) In addition, the most efficient implementations of such languages have relied on specialized hardware to achieve high performance [15, 36, 42]. The primary goal of the Concert project is to develop compiler and runtime techniques to make fine grained concurrent object oriented languages portable and efficient. By portable and efficient, we mean that the programs should run efficiently both on uniprocessors and on parallel computers ....

....basic thread scheduling, etc. is also discussed. Efficient concurrent object oriented language implementations must provide a global object namespace, communication services for remote method invocation, and support for scheduling method invocations. Though implementations on custom hardware [15, 36, 42] focus on providing a few general purpose primitives, runtime systems on stock hardware require a different approach. The hardware structure of such systems necessarily implies a hierarchy of costs for many basic runtime operations. These cost distinctions must be recognized and managed to obtain ....

W. J. Dally, A. Chien, S. Fiske, W. Horwat, J. Keen, M. Larivee, R. Lethin, P. Nuth, S. Wills, P. Carrick, and G. Fyler. The J-Machine: A fine-grain concurrent computer. In Information Processing 89, Proceedings of the IFIP Congress, pages 1147--1153, August 1989.


The M-Machine Multicomputer - Fillo, Keckler, Dally, Carter.. (1995)   (22 citations)  Self-citation (Dally)   (Correct)

....which are copied into the buffer and sent again later. Discussion: The M Machine provides direct register to register communication, avoiding the overhead of memory copying at both the sender and the receiver, and eliminating the dedicated memory for message arrival, as is found on the J Machine [8]. Register mapped network interfaces have been used previously in the Mars Machine [2] J Machine, and iWarp [4] and have been described by T [26] as well as Henry and Joerg [15] However, none of these systems provide protection for user level messages. Systems, like the J Machine, that ....

Dally, W. J., et al. The J-Machine: A fine-grain concurrent computer. In Proceedings of the IFIP Congress (Aug. 1989), G. Ritter, Ed., North-Holland, pp. 1147--1153.


Planar-Adaptive Routing: Low-cost Adaptive Networks for.. - Chien, Kim (1992)   (136 citations)  Self-citation (Chien)   (Correct)

....(virtual cut through or wormhole routing) the ideas apply to store and forward networks as well. 2 The Problem Overloaded Channels Figure 1: Four packets and their routing paths under deterministic, dimension order routing. Most existing multicomputer routing networks use deterministic routing [22, 20, 8, 21]. Although there are numerous paths between any source and destination, in order to avoid deadlock, deterministic routing defines a single path from source to destination. This means that the interconnection networks cannot make effective use of the density of their physical interconnection ....

....significantly reduces the amount of hardware required and should reduce the time to setup and drive data across the switches. Low connectivity requirements also make it possible to use organizations which allow the router performance to be further optimized for high speed, low latency performance [9, 8]. The extensions to planar adaptive routing for fault tolerance and packet ordering add only slightly to the router complexity. A single bit in message headers can be used to tag packets that are currently undergoing misrouting. The routing logic examines the tag when selecting paths and uses that ....

William J. Dally, Andrew Chien, Stuart Fiske, Waldemar Horwat, John Keen, Michael Larivee, Rich Lethin, Peter Nuth, Scott Wills, Paul Carrick, and Greg Fyler. The j-machine: A fine-grain concurrent computer. In Information Processing 89, Proceedings of the IFIP Congress, pages 1147--1153, August 1989.


Emulation of a Virtual Shared Memory Architecture - Raina (1993)   (3 citations)  (Correct)

No context found.

W. J. Dally, A. Chien, S. Fiske, W. Horwat, J. Keen, M. Larivee, R. Lethin, P. Nuth, S. Wills, P. Carrick, and G. Flyer. The J-Machine: a Fine-Grain Concurrent Computer. In Information Processing `89, 1989.


Analyzing NIC Overheads in Network-Intensive Workloads - Binkert, Hsu, Saidi.. (2005)   (Correct)

No context found.

William J. Dally et al. The J-Machine: A fine-grain concurrent computer. In G. X. Ritter, editor, Information Processing 89, pages 1147--1153. Elsevier North-Holland, Inc., 1989.


The Performance Potential of an Integrated Network.. - Binkert, Dreslinski.. (2004)   (Correct)

No context found.

W. J. Dally et al. The J-Machine: A fine-grain concurrent computer. In G. X. Ritter, editor, Information Processing 89, pages 1147--1153. Elsevier North-Holland, Inc., 1989.


Analyzing NIC Overheads in Network-Intensive Workloads - Binkert, Hsu, Saidi.. (2004)   (Correct)

No context found.

William J. Dally et al. The J-Machine: A fine-grain concurrent computer. In G. X. Ritter, editor, Information Processing 89, pages 1147--1153. Elsevier North-Holland, Inc., 1989.


Distributed Paging for General Networks - Awerbuch, Bartal, Fiat (1996)   (36 citations)  (Correct)

No context found.

William J. Dally et al. The J-Machine: A fine-grain concurrent computer. In G.X. Ritter, editor, Proceedings of the IFIP Congress, pages 1147--1153. North-Holland, August 1989. 33


Issues In Software Support For Parallel I/O - Bordawekar (1993)   (Correct)

No context found.

W.J. Dally, A. Chien, S. Fiske, W. Howart, J. Keen, and M. Larivee. The J Machine: A Fine Grain concurrent Computer. Information Processing 89, Proceedings of the IFIP Conference, pages 1147--1153, August 1986.


Distributively-Competitive Online Paging for.. - Awerbuch, Bartal, Fiat (1994)   (Correct)

No context found.

William J. Dally et al. The J-Machine: A fine-grain concurrent computer. In G.X. Ritter, editor, Proceedings of the IFIP Congress, pages 1147--1153. North-Holland, August 1989. REFERENCES 12

First 50 documents  Next 50

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC