25 citations found. Retrieving documents...
A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer - Designing a MIMD, Shared-memory Parallel Machine. pages 27--43, June 1982.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Toward The Design Of Large-Scale, Shared-Memory Multiprocessors - Scott (1992)   (3 citations)  (Correct)

....A similar phenomenon occurs for omega networks, where even small to medium scale networks have long wires. Although the length of the longest wire is O (N) if the network is laid out as shown in Figure 2. 2, a good three dimensional layout can result in maximum wire lengths of O (N 1 2 ) [Gott83]. The remainder of this thesis will consider only k ary n cube networks, although much of the analysis would be germane to multistage networks as well. This choice is partially because k ary n cubes provide a very flexible and general interconnect to study (allowing many configurations, from rings ....

....appealing in larger systems, where the contention for shared data may be higher. Contention for a particular memory location or module in a large system is known as a hot spot, and has been the focus of much research. Hardware mechanisms for combining requests in the network have been proposed [Gott83, Pfis85], as well as mechanisms to improve network performance in the face of contention [Tami88, Scot90, Lang88] These mechanisms are not necessary under the uniform workload assumption, as the average rate of read requests to a line is independent of system size. However, certain workloads (those using ....

[Article contains additional citation context not shown here]

Gottlieb, A., R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir, The NYU Ultracomputer -- Designing a MIMD, Shared Memory Parallel Machine, IEEE Transactions on Computers C-32(2), February 1983, 175-189.


On Bounding Time and Space for Multiprocessor Garbage.. - Guy Blelloch Perry (1999)   (1 citation)  (Correct)

....parallelized. The TestAndSet and FetchAndAdd primitives can be implemented in parallel using a combining network, and it has been argued that they can be made no slower than a memory reference to the shared memory [14, 28] Furthermore several machines have supported these primitives in hardware [13, 27, 20, 31, 18]. Most current symmetric multiprocessors support the TestAndSet operation directly, but not the FetchAndAdd. On the Sun Enterprise Server, for example, TestAndSet only requires about the same time as a memory reference (15 cycles if in second level cache and 60 cycles if in shared memory) 23] ....

A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer---designing a MIMD, sharedmemory parallel machine. IEEE Transactions on Computers, C--32:175--189, 1983.


Hardware Spatial Forwarding for Widely Shared Data - Pirvu, Bhuyan   (Correct)

....The widely shared data must be identi ed (either statically or dynamically) and the GLOW agents are informed about it. From this moment on, multiple requests for a cache line generate only a new request towards the home node, thus reducing the network trac. Hence, an e ect similar to combining [11] is achieved. In this paper, we study an alternative solution: Rather than reducing the cache miss penalty we concentrate on improving the cache hit ratio. We enhance the memory controller with a simple, yet ecient, forwarding engine. Based on the full map directory, the engine anticipates the ....

A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuli e, L. Rudolph, and M. Snir. The NYU ultracomputer - designing a MIMD, shared-memory parallel machine. IEEE Trans. on Computers, 32(2):175, Feb. 1983.


Scalable Load Balancing Techniques for Parallel Computers - Kumar, Grama, Rao (1994)   (62 citations)  (Correct)

....are combined at intermediate processors. Thus the total number of requests that have to be handled by processor 0 is greatly reduced. This technique of performing atomic increment operations on a shared variable, TARGET, 7 is essentially a software implementation of the fetch and add operation of [6]. To the best of our knowledge, GRR M has not been used for load balancing by any other researcher. We illustrate this scheme by describing its implementation for a hypercube architecture. Figure 4.5 describes the operation of GRR M for a hypercube multicomputer with P = 8. Here, we embed a ....

A. Gottlieb et al. The NYU ultracomputer - designing a MIMD, shared memory parallel computer. IEEE Transactions on Computers, pages 175--189, February 1983.


Synchronization, Coherence, and Consistency for High Performance .. - Dwarkadas (1992)   (Correct)

....Read modify write (RMW) primitives are provided by most existing processors. The most common set of primitives consists of test set and reset. The microcode or software will usually repeat the test set until the returned value is zero. Other RMW primitives include compare swap [83] fetch add [43], and fetch OE [63] operations, where OE is a function such as add, or, store, increment, or and. Typical implementations of these basic synchronization primitives cause a performance bottleneck in the presence of contention. If N processes attempt to access a critical section at the same time, ....

....is required for user access to these lock gates. The 80386 also has the ability to lock the bus during the read and write operations of any instruction, a feature that can be exploited in the implementation of barriers. To overcome the O(N 2 ) complexity bottleneck, the NYU Ultracomputer [43] provides hardware combining for the fetch add operation. When a single processor executes a fetch add, the old value of the variable is returned, and the contents of the memory location is incremented by the amount specified. The implementation is such that N processors can all access the same ....

A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer - Designing a MIMD, Shared Memory Parallel Machine. IEEE Transactions on Computers, pages 175--189, Feb 1983.


Emulation of a Virtual Shared Memory Architecture - Raina (1993)   (3 citations)  (Correct)

....sum in parallel. A number of processors can simultaneously succeed in this instruction and 3.3 Synchronisation mechanisms 30 the shared variable eventually gets updated by combining all the requests without involving any locking or unlocking. This primitive can be found in the NYU Ultracomputer [81] and the IBM RP3 [142] Load linked Store conditional: The load linked and store conditional primitive pair [89] can be efficiently implemented in cache coherent architectures [98, 100] The loadlinked operation copies the contents of a shared variable to a local variable and the ....

....a separate request above. This is helpful in reducing hot spots in applications with high degrees of contention the most common manifestation of which is contention due to synchronisation. Combining in MIMD multiprocessors has been extensively studied by the NYU Ultracomputer research group [81]. The other feature of the tree organisation is the ability to handle race conditions correctly. Two or more simultaneous write requests to the same shared data item travel towards the root of the enclosing sub tree. The first request to reach the sub tree root succeeds and cancels all other erase ....

A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer - Designing a MIMD, Shared-memory Parallel Machine. In Proceedings of the 9th Annual International Symposium on Computer Architecture, pages 27--43, June 1982.


A Design of Performance-optimized Control-based Synchronization - Min, Hsu, Kim (1991)   (Correct)

.... where hundreds or thousands of processors and memory modules are interconnected through a multistage interconnection network (MIN) Examples of the above type of architecture include the University of Illinois Cedar machine [4] the BBN Butterfly multiprocessor [2] the NYU Ultracomputer [5], and the IBM RP3 machine [12] A typical MIN based shared memory multiprocessor is depicted in Figure 1 with a memory hierarchy consisting of private caches C, local memory LM , and shared global memory M . Barrier synchronization implements global synchronization among processors participating ....

A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer - Designing a MIMD, Shared-Memory Parallel Machine. In Proceedings of the 9th Annual International Symposium on Computer Architecture, pages 27--42, April 1982.


Synchronized MIMD Computing - Kuszmaul (1994)   (3 citations)  (Correct)

....and MIMD (multiple instruction path, multiple data path) machines [Fly66] Each processor in the CM 5 executes its own instructions, providing the flexibility of a typical MIMD machine. And, like many MIMD machines, the CM 5 is a distributed memory machine (as opposed to shared memory machine [DT90, GGK 83]) in which processors communicate among themselves by sending messages [Sei85, SAD 86] through the data network of the machine. A deficiency of typical MIMD machines, especially as compared with their SIMD cousins, however, is that they provide little or no support for coordinating and ....

....wide communications channels. Like the Cosmic Cube and its descendants, the CM 5 uses a distributed memory organization, and the processors communicate among themselves using message passing techniques [Sei85, SAD 86] Another way to organize the memory of a MIMD computer is as a shared memory [LB80, GGK 83, DT90]. Some recent machines, such as Alewife [ACD 91] have tried to merge the shared memory and distributed memory architectures. Other machines with global synchronization include the SIMD machines (discussed above) the proposed Burroughs Flow Model Processor (FMP) LB80] the DADO machine [SS82] ....

Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal, Kevin P. McAuliffe, Larry Rudolph, and Marc Snir. The NYU ultracomputer --- designing a MIMD, shared-memory parallel machine. IEEE Transactions on Computers, C-32 (2), pages 175--159, February 1983.


Parallel Depth First Search, Part II: Analysis - Vipin Kumar   (8 citations)  (Correct)

....asymptotically, W should grow as O(N 2 log N) to avoid contention for TARGET. But for 15 puzzle, this limitation does not take effect for the range of processors we experimented with ( 120) On shared memory network architectures that use message combining (e.g. RP3[10] the Ultracomputer[2]) this problem does not arise at all. In such systems, simultaneous atomic add requests to TARGET are combined at intermediate nodes of network (where 7 This new work distribution algorithm is obtained by replacing the second line of GETWORK( in [12] by the line TARGET = atomic add(I,1) ....

A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The NYU ultracomputer - designing a MIMD, shared memory parallel computer. IEEE Transactions on Computers, C--32, No. 2:175--189, February 1983.


Notification And Multicast Networks For Synchronization.. - Andrews, Beckmann.. (1992)   (9 citations)  (Correct)

....A2 Expected Value of Number of Links in Unrestricted Multicast. 31 1 1. Introduction As the scale of multiprocessors increases, bus based architectures [25] must be abandoned in favor of high bandwidth interconnections such as packet switched multistage interconnection networks (MINs) [18, 21, 26] that provide better scalability and higher performance. Two issues which are critical to the performance of MIN based systems are efficient synchronization and cache coherence. The wait phase of synchronization operations often involves polling (spin waiting) on a synchronization variable. In ....

Gottlieb A., Grishman R., Kruskal C., McAuliffe K., Rudolph L., and Snir M. The NYU Ultracomputer -- Designing a MIMD, Shared-Memory Parallel Machine. IEEE Trans. Comput., C32, 2 (February 1983), pp. 175-189.


Efficient Barriers for Distributed Shared Memory Computers - Grunwald, Vajracharya (1994)   (13 citations)  (Correct)

....by busy waiting, or spinning . Once the rendezvous has been achieved, the counter is reset to zero for the next rendezvous. There are two disadvantages to this approach. First, the counter must be updated atomically, either via explicit locking or hardware operations such as fetch and OE [3]. Second, all processes must contend with each other to read and write a single memory location. As mentioned, this causes hot spots, or points of high traffic congestion. Consequently, this barrier is not scalable since each read and a write involves serialized actions. The hotspot problem can be ....

Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal, Kevin P. McAuliffe, Larray Rudolph, and Marc Snir. The nyu ultracomputer: Designing a mimd, shared-memory parallel machine. Proceedings of 9th Annual International Symposium on Computer, 10(3):27--42, April 1982.


General Purpose Parallel Computing - McColl (1993)   (64 citations)  (Correct)

....concurrent access to a memory location, as in the CRCWPRAM model. An important practical case is that of broadcasting, where all processors simultaneously require the value of a single memory location. One approach to the implementation of concurrent memory access is to use combining networks [107], i.e. networks that can combine and replicate messages in addition to delivering them in a point to point manner. The Fluent machine of Ranade [220, 221] provides an excellent example of how a CRCW PRAM can be efficiently implemented on a distributed memory architecture equipped with a combining ....

A Gottlieb, R Grishman, C P Kruskal, K P McAuliffe, L Rudolph, and M Snir. The NYU Ultracomputer - Designing an MIMD, shared-memory parallel machine. IEEE Transactions on Computers, 32:75--89, 1983.


Scans as Primitive Parallel Operations - Blelloch (1987)   (97 citations)  (Correct)

....though the data must fan in on any real hardware and therefore take time that increases with the memory size. In spite of this inaccuracy in the model, the unit time assumption has served as an excellent basis for the analysis of algorithms. In the parallel random access machine (P RAM) models [16, 40, 42, 19, 20], memory references are again assumed to take unit time. In these parallel models, this unit time is large since there is no practical hardware known that does better than deterministic O(lg 2 n) or probabilistic O(lg n) bit times for an arbitrary memory reference from n processors. 1 ....

....implementation and they can be used to implement many other useful scan operations. On a P RAM, each element a i is placed in a separate processor, and the scan executes over a fixed order of the processors the prefix operation on a linked list [48, 27] and the fetch and op type instructions [21, 20, 37] are not considered. 1 The AKS sorting network [1] takes O(lg n) time deterministically, but is not practical. 2 The appendix gives a short history of the scan operations. Model Algorithm EREW CRCW Scan Graph Algorithms (n vertices, m edges, m processors) Minimum Spanning Tree O(lg 2 n) ....

[Article contains additional citation context not shown here]

Allan Gottlieb, R. Grishman, Clyde P. Kruskal, Kevin P. McAuliffe, Larry Rudolph, and Marc Snir. The NYU Ultracomputer---designing a MIMD, shared-memory parallel machine. IEEE Transactions on Computers, C-32:175--189, 1983.


Parallel Processing of Discrete Optimization Problems - Ananth, Kumar, Pardalos (1992)   (7 citations)  (Correct)

....a bottleneck [34] Consequently, when the number of processors increases, its performance degrades. On the other hand, random polling does not suffer from such a drawback. However, on machines that have hardware support for concurrent access to a global pointer (e.g. the hardware fetch and add [16]) the performance of the global round robin scheme would be better than random polling. When a work transfer is made, work in the donor s stack is split into two stacks one of which is given to the requester. In other words, some of the nodes (i.e. alternatives) from the donor s stack are ....

A. Gottlieb et al. The NYU ultracomputer - designing a MIMD, shared memory parallel computer. IEEE Transactions on Computers, pages 175--189, February 1983.


A Survey of Parallel Search Algorithms for Discrete.. - Grama, Kumar (1993)   (5 citations)  (Correct)

....of processors increases, its performance degrades. In contrast, random polling results in more work requests but does not suffer from contention over shared data structures. However, on machines that have hardware support for concurrent access to a global pointer (e.g. the hardware fetch and add [38]) the global round robin scheme would perform better than random polling. Parallel DFS using sender initiated subtask distribution has been proposed by a number of researchers [30, 119, 35, 139, 125] These use different techniques for task generation and transfer as shown in Table 1. Ichiyoshi ....

A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The NYU ultracomputer - designing a MIMD, shared memory parallel computer. IEEE Transactions on Computers, C--32, No. 2:175--189, February 1983.


Implementing the Data Diffusion Machine using Crossbar Routers - Muller, Stallard, al. (1996)   (1 citation)  (Correct)

....are essentially tree structured and can therefore access data in a time bounded by the logarithm of the number of processors. However, simple tree networks have a bottleneck in the top of the tree so we have chosen to use a Banyan graph [6] or fat tree [15] The New York Ultracomputer (NYU) [16] uses a network that can be seen as a special case of the Banyan graph. The NYU is constructed with an Omega network, which is identical to a Banyan graph with a fanout and split factor of two at each level (although the nodes must be permuted) To support data migration in the DDM, routing in the ....

.... data migration in the DDM, routing in the DDM network is a more complex operation than on the NYU (which binds data to physical memory locations) However, the routing operation is still one that is performed locally in the network, meeting the criteria for scalability given by Gottlieb et al. [16]. The work presented here forms part of on going re search to evaluate various design options for the DDM. Other publications can be found on our WWW site, http: www.pact.srf.ac.uk DDM . 7 Conclusions The DDM is a scalable VSM architecture that allows data to migrate around the machine to ....

A. Gottlieb, et. al. The NYU Ultracomputer Designing a MIMD, Shared-Memory Parallel Machine. IEEE Trans. on Comp., C-32(2):175--189, February, 1983.


Design and Analysis of a Scalable Cache Coherence Scheme based.. - Min, Baer (1992)   (14 citations)  (Correct)

....hierarchy consists of private caches C, local memories LM , and a shared global memory M . Examples of such architectures (although possibly without the complete memory hierarchy) include the University of Illinois Cedar machine [15] the BBN Butterfly multiprocessor [5] the NYU Ultracomputer [18], and the IBM RP3 machine [31] One of the major problems associated with these architectures is the slow global memory access; thus the efficient management of local memory and private caches is very important. Local memory is generally used to store code and private data although shared data can ....

A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer - Designing a MIMD, Shared-Memory Parallel Machine. In Proceedings of the 9th Annual International Symposium on Computer Architecture, pages 27--42, April 1982.


A Timestamp-based Cache Coherence Scheme - Min, Baer (1989)   (18 citations)  (Correct)

....is depicted in Figure 1 with a memory hierarchy consisting of private caches C, local memory LM , and shared global memory M . Examples of such architectures include the University of Illinois Cedar machine [10] the BBN Butterfly multiprocessor [5] no C component) the NYU Ultracomputer [12], and the IBM RP3 machine [18] One of the major problems associated with these architectures is the slow global memory access; this makes an efficient management of private caches very important. But the presence of multiple private caches introduces the well known cache coherence problem [6] ....

A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer - Designing a MIMD, Shared-Memory Parallel Machine. In Proceedings of the 9th Annual International Symposium on Computer Architecture, pages 27--42, April 1982.


The Network Architecture of the Connection Machine CM-5 - Leiserson, Abuhamdeh.. (1994)   (162 citations)  (Correct)

....path) and MIMD (multiple instruction path, multiple data path) machines [7] Each processor in the CM 5 executes its own instructions, providing the flexibility of a typical MIMD machine. And, like many MIMD machines, the CM 5 is a distributed memory machine (as opposed to shared memory machine [6, 8]) in which processors communicate among themselves by sending messages [18, 19] through the data network of the machine. A deficiency of typical MIMD machines, especially as compared with their SIMD cousins, however, is that they provide little or no support for coordinating and synchronizing sets ....

A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The NYU ultracomputer --- designing a MIMD, shared-memory parallel machine. IEEE Transactions on Computers, C-32(2):175--159, February 1983.


A Unified Theory Of Interconnection Network Structure - Kruskal, Snir (1986)   (37 citations)  Self-citation (Kruskal Snir)   (Correct)

....2) In packet switching networks, when messages are sent from inputs to outputs, replies are often returned to the sender. We remark, in passing, that in a packet switching network that has labels on both the input edges and the output edges, a message need not (initially) carry the sender address [8,18]. Rather, this address can be created on the fly when the message is routed: whenever one digit from the Ultracomputer Note 106 forward path descriptor is discarded, it is replaced by one digit that identifies the edge through which the message has arrived. When the message arrives at its ....

A. GOTTLIEB, R. GRISHMAN, C. P. KRUSKAL, K. P. MCAULIFFE, L. RUDOLPH, and M. SNIR, The NYU Ultracomputer --- Designing an MIMD, Shared-Memory Parallel Machine, IEEE Trans. Comput. C-32 (1983), 75-89.


Emulation of a Virtual Shared Memory Architecture - Raina (1993)   (3 citations)  (Correct)

No context found.

A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer - Designing a MIMD, Shared-memory Parallel Machine. pages 27--43, June 1982.


Class Notes : Programming Parallel Algorithms - Cs Fall Guy (1993)   (1 citation)  (Correct)

No context found.

Allan Gottlieb, R. Grishman, Clyde P. Kruskal, Kevin P. McAuliffe, Larry Rudolph, and Marc Snir. The NYU Ultracomputer---designing a MIMD, shared-memory parallel machine. IEEE Transactions on Computers, C-32:175--189, 1983.


Distributed Counting: How to Bypass Bottlenecks - Wattenhofer (1998)   (Correct)

No context found.

Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal, Kevin P. McAuliffe, Larry Rudolph, and Marc Snir. The NYU ultracomputer: Designing a MIMD, shared memory parallel computer. IEEE Trans. Computs., C-32(2):175--189, 1983.


An Empirical Comparison of the Kendall Square Research.. - Singh, Joe, Gupta.. (1993)   (37 citations)  (Correct)

No context found.

Alan Gottlieb et al. The NYU Ultracomputer - Designing a MIMD, shared memory parallel machine. IEEE Transactions on Computers, 32(2):175-189, February 1983.


Hardware And Software Mechanisms For Reducing Load Latency - Austin (1996)   (1 citation)  (Correct)

No context found.

A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The nyu ultracomputer -- designing a mimd, shared memory parallel machine. IEEE Transactions on Computers, 32(2):175--189, February 1983.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC