107 citations found. Retrieving documents...
D. Lenoski et al. The DASH prototype: Implementation and performance. In Proc. 19th Intl. Symp. on Computer Architecture, pages 92--103, Gold Coast, Australia, May 1992.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents  Next 50

The M-Machine Multicomputer - Fillo, Keckler, Dally, Carter.. (1995)   (22 citations)  (Correct)

....within the remote read and write handlers described in Section 4.2. Using local memory as a repository will allow more remote data to be cached locally than could fit in the on chip cache alone. Discussion: Directory based, cache coherent multiprocessors such as Alewife [ 1 ] and DASH [20] implement coherence policies in hardware. This improves performance at the cost of flexibility. Like the M Machine, FLASH [19] implements remote memory access and cache coherence in software, but uses a coprocessor. However, this system does not provide block status bits in the TLB to support ....

LENOSKI, D., LAUDON, J., JOE, T., NAKAHIRA, D., STEVENS, L., GUPTA, A., AND HENNESSY, J. The DASH prototype: Implementation and performance. In Proceedings of l 9th Annual International Symposium on Computer Architecture (1992), IEEE, pp. 92-103.


Sparsely Faceted Arrays: A Mechanism Supporting Parallel.. - Brown (2002)   (2 citations)  (Correct)

....(NUMA) times: Access to memory within a node is low latency; access to memory in other nodes is higher latency, often varying depending on the relative locations of the two nodes within the machine. No inter node data caching: Unlike cache coherent NUMA (ccNUMA) architectures such as DASH [37], FLASH [31] and the SGI Origin [67] these explicitly NUMA architectures do not attempt to conceal inter node memory latency with a complex data caching strategy. I shall refer to this class of architectures as the NUMA DSM class; NUMA DSM machines are the most common type of shared memory ....

Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, and John Hennessy. The DASH prototype: Implementation and performance. 122 In Proceedings of the 19th International Symposium on Computer Architecture, pages 92--103, Gold Coast, Australia, May 1992. ACM.


Sparsely Faceted Arrays: A Mechanism Supporting Parallel.. - Brown (2002)   (2 citations)  (Correct)

....(NUMA) times: Access to memory within a node is low latency; access to memory in other nodes is higherlatency, often varying depending on the relative locations of the two nodes within the machine. No inter node data caching: Unlike cache coherent NUMA (ccNUMA) architectures such as DASH [37], FLASH [31] and the SGI Origin [67] these explicitly NUMA architectures do not attempt to conceal inter node memory latency with a complex data caching strategy. I shall refer to this class of architectures as the NUMA DSM class; NUMADSM machines are the most common type of shared memory ....

Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, and John Hennessy. The DASH prototype: Implementation and performance. In Proceedings of the 19th International Symposium on Computer Architecture, pages 92-103, Gold Coast, Australia, May 1992. ACM.


Scalable and Portable Computing Using the WPRAM Model - Nash, Dew, Dyer (1996)   (Correct)

....receipt of a packet causes an interrupt handler to execute one of a number of very simple operations. Software overheads are typically reduced to 10 s of machine cycles, or less. Machines such as the forthcoming Silicon Graphics cache coherent multiprocessor (based on the Stanford DASH computer [17]) may further simplify the programming model. The integration of the local memories (and caches) of each node into the shared address space, by automatically maintaining the coherency of multiple copies of a shared variable, removes the need to distinguish between local and shared data. 3 The ....

D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hennessy. The DASH Prototype: Implementation and Performance. In Proceedings of the 19th International Symposium on Computer Architecture, pages 92--103, 1992.


High-Level Prototyping for the HTMT Petaflop Machine - Yerosheva (2001)   (Correct)

.... passing architecture; a distributed shared memory (scalable) with additional level of (non uniform accesses) memory, shared among a subset of processors, that can be accessed without using the bus based network (SGI Cray Origin[81] HP Convex Exemplar[82] a cache only memory execution model (DASH[79], Alewife[76] distributed virtual memory or shared virtual memory models where the software supports the shared memory architecture; and a logical disjointed distributed shared memory model such as multicomputers (clusters of machines, like IBM SP2[77] Widely used programming models such as ....

D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, J. Hennessy, "The DASH Prototype: Implementation and Performance," Proc. of the 19th Intl. Symposium on Computer Architecture, pp. 92-103, Australia, May 1992.


Processor Management Policies for Multiprocessors - Yu (1994)   (Correct)

....same general mechanism but from two different viewpoints consumer s (process) and resource s (processor) 1] However, different selection of processors (processor allocation) may affect system performance, as will be discussed in this thesis. 2 architectures such as hypercubes and meshes [3] [11]. Since the path length between processors is different in a distributed memory architecture, the cooperating processes of a job need to be allocated in nearby locations to reduce the interconnection latency. Usually first come first served (FCFS) scheduling is used to assign the incoming jobs. ....

.... node can have enough local memory to support multiple processes, iii)recent development of the concept of distributed shared memory [28] 29] that allows a shared memory programming environment in distributed memory systems and finally (iv)the concept of shared memory hypercubes [30] and meshes [11] that is becoming attractive to take advantage of the shared memory design on direct interconnection networks. Putting all of these concepts together, we believe that the future distributed memory multiprocessor operating systems (OSs) will implement variations of the M 2 policy. We study the M ....

[Article contains additional citation context not shown here]

D.Lenoski, J.Laudon, T.Joe, et al, "The DASH Prototype: Implementation and Performance," Proc. Int. Symp. Comput. Arch., pp.92-103, 1992.


Performance Experiences on Sun's WildFire Prototype - Noordergraaf, van der Pas (1999)   (7 citations)  (Correct)

....on the basis of refetch statistics were first described in [Falsafi97] Page migration has been previously investigated in the context of cc NUMA machines. Chandra, et al. Chandra94] explored the impact of migration on both sequential and parallel workloads; their work utilized the Stanford DASH [Lenoski92] 16 CPU multiprocessor for sequential workload evaluation, and a 16 CPU simulator for parallel workloads. Information on TLB misses was used to trigger page migration. This work is focused on scheduling algorithms rather than performance benefits to individual applications. Several different ....

D. Lenoski, et al., The DASH Prototype: Implementation and Performance, in Proceedings of the 19th International Symposium on Computer Architecture, 1997.


The Design, Implementation, and Evaluation of Jade - Rinard, Lam (1998)   (Correct)

....explicitly parallel systems also directly expose the programmer to a host of program development and maintainance problems. Existing parallel machines present two fundamentally different programming models: the shared memory model [Hagersten et al. 1992; Kendall Square Research Corporation 1992; Lenoski et al. 1992] and the message passing model [Intel Supercomputer Systems Division 1991; Thinking Machines Corporation 1991] Even machines that support the same basic model of computation may present interfaces with significantly different functionality and performance characteristics. Developing the same ....

....implementation to check the index on all local pointer accesses. This would significantly degrade the performance of tasks that repeatedly accessed shared objects. It is worth noting that standard flat shared memory systems decouple the units of allocation and synchronization [Amza et al. 1996; Lenoski 1992; Shoinas et al. 1994] Communication takes place using a flat shared address space, and synchronization takes place using locks and barriers. The locks and barriers are not explicitly coupled to any memory location, although of course such a coupling exists implicitly if the program is correctly ....

[Article contains additional citation context not shown here]

Lenoski, D., Laudon, J., Joe, T., Nakahira, D., Stevens, L., Gupta, A., and Hennessy, J. 1992. The DASH prototype: Implementation and performance. In Proceedings of the 19th International Symposium on Computer Architecture. ACM, New York.


Volume Rendering on Scalable Shared-Memory MIMD Architectures - Nieh, Levoy (1992)   (56 citations)  (Correct)

....rendering time due to memory overhead 20 30 24 38 increase in memory overhead over 1 proc. SGI 0 84 0 142 percentage of rendering time due to idle synchronization 0 3 0 8 Table 3: Memory and synchronization overhead. cache misses, we did use the hardware performance monitor on DASH [11] to measure the difference in the number of secondlevel cache read misses with and without interframe effects. The vast majority of these read references were for data whose home cluster was not the local cluster. We measured these read misses on remote data and found that interframe effects do ....

Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, and John Hennessy. The DASH Prototype: Implementation and Performance. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 92--103, May 1992.


Fine Grain Parallel Communication on General Purpose LANs - Mummert, Kosak.. (1996)   (6 citations)  (Correct)

....(e.g. 21] Table 1 shows how the basic operations were implemented in some example systems that implement remote write on distributed memory hardware. The left hand columns represent systems that implement a shared memory programming model based on a weak consistency coherency protocol (e.g. [20]) remote memory operations are completely supported in hardware. The distributed shared memory column shows systems that implement the same programming model over workstations connected by a network. The main idea is to rely on the virtual memory system to catch memory accesses to memory regions ....

Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, and John Hennessy. The DASH Prototype: Implementation and Performance. In 19th Annual International Symposium in Computer Architecture, pages 92-- 102. IEEE, May 1992.


Eliminating Useless Messages in Write-Update Protocols .. - Bianchini, LeBlanc.. (1994)   (3 citations)  (Correct)

....number of transactions produced by WU and (2) the cost of an update transaction on the bus is roughly the same as the cost of rereading an invalidated cache block, and there are likely to be many more transactions under WU than under WI. Scalable, network based machines (such as the Stanford DASH [Lenoski et al. 1992]) offer a very different environment for comparing WU and WI. These machines may incorporate relaxed consistency [Lenoski et al. 1990] and write buffers (which reduce the cost of writes) or may use page based coherence [Bisiani and Ravishankar, 1990; Carter et al. 1991; Wilson and LaRowe, 1992] ....

D. Lenoski, J. Laudon, L. Stevens, T. Joe, D. Nakahira, A. Gupta, and J. Hennessy, "The DASH Prototype: Implementation and Performance," In Proceedings of the 19th International Symposium on Computer Architecture, May 1992.


Critical Performance Path Analysis, and Efficient.. - Bright, Fineberg, ..   (Correct)

....if multiprocessor systems are doomed to evolutionary cycles too long to permit commercial successes. This is done by directly addressing the obvious pitfalls of using commodity components in Seamless a system with the same latencytolerance goals as machines employing custom CPU technology [AgC91, AlC90, DaC89, LeL92]. The paper is organized as follows. Section 2 overviews the major elements of the Seamless hardware. As the purpose of this paper is to provide an analysis of critical performance characteristics of Seamless, a detailed description of Seamless itself is not possible due to space limitations. ....

D. Lenoski et al, "The DASH Prototype: Implementation and Performance," The 19th International Symposium on Computer Architecture, Gold Coast, Australia, May 1992, pp. 92-103.


On Memory Models and Cache Management for Shared-Memory.. - Dennis, Gao (1995)   (3 citations)  (Correct)

....significant modification. This philosophy of multiprocessor architecture is represented by two main styles: the cache coherent nonuniform memory access architecture (CC NUMA) and the cache only memory architecture (COMA) Examples of CC NUMA machines are the Stanford DASH architecture [Lenoski 90, Lenoski 92] and the MIT Alwife Machine [Agarwal 90] while examples of COMA machines include the KSR 1 [KSR 1992] and the DDM [Hagersten 92] In these systems, cache coherence is achieved by maintaining a directory at each processor. The directory contains, for each block of memory in its portion of the ....

Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, and John Hennessy. The DASH prototype: Implementation and performance. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 92-- 103. IEEE and ACM, May 1992.


Commit-Reconcile Fences (CRF): A New Memory Model for.. - Shen, Arvind, Rudolph (1999)   (1 citation)  (Correct)

....have been tolerable in small SMP s using snoopy bus protocols, but this may not be so in future. Release consistency (RC) allows non atomic memory accesses since the execution of memory accesses between acquire and release operations does not have to be visible immediately to other processors [12, 17]. The essence of RC is that memory accesses before a release must be globally performed before the synchronization lock can be released. Lazy release consistency (LRC) goes a step further; it allows a synchronization lock to be released to another processor even before previous memory accesses ....

D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hennessy. The DASH Prototype: Implementation and Performance. In Proceedings of the 19th International Symposium on Computer Architecture, pages 92--103, May 1992.


Minimal Adaptive Routing on the Mesh with Bounded Queue Size - Chinn, Leighton, Tompa (1994)   (4 citations)  (Correct)

....use of space when physically realized. Examples of machines that use the mesh or torus topology include the MPP from Goodyear Aerospace [2] the MP 1 from MasPar [21] the Paragon from Intel Scientific, the J machine from MIT [23] the Touchstone DELTA from Intel [11] the DASH from Stanford [19], and the Mosaic from Cal Tech [26] One of the simplest benchmarks for a router s performance is how it performs in the worst case on static one to one (or partial permutation) routing problems, where each processor sends at most one message and receives at most one message. At the very least, a ....

D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hennessy. The DASH prototype: Implementation and performance. In Proc. 19th Annual Symposium on Computer Architecture, pages 92--103, June 1992.


Data Locality and Load Balancing in COOL - Chandra, Gupta, Hennessy (1993)   (51 citations)  Self-citation (Gupta Hennessy)   (Correct)

....to the remote memory of another cluster take about 100 150 cycles. We measure the time spent in the parallel portion of the code, and plot its speedup with respect to the time taken by a serial version of the code running on one processor. We also use the hardware performance monitor on DASH [11], that enables us to monitor the bus and network activity in a non intrusive manner. class region c . data declarations . parallel void laplace (region c ) class grid c Grid composed of regions. region c region; void laplace (grid c p) int i; waitfor for (i=0; i N; i ) ....

D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. L. Hennessy. The DASH prototype: Implementation and performance. In Proceedings of the 19th International Symposium on Computer Architecture, pages 92 103, May 1992.


Array Data Layout for the Reduction of Cache Conflicts - Naraig Manjikian And (1995)   (9 citations)  (Correct)

No context found.

D. Lenoski et al. The DASH prototype: Implementation and performance. In Proc. 19th Intl. Symp. on Computer Architecture, pages 92--103, Gold Coast, Australia, May 1992.


RC23764 (W0510-231) October 28, 2005 - Computer Science Ibm   (Correct)

No context found.

Lenoski, D., Laudon, J., Joe, T., Nakahira, D., Stevens, L., Gupta, A., and Hennessy, J. The DASH prototype: Implementation and performance. Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA), pages 92--103, Gold Coast, Australia, May 1992.


Emulation of a Virtual Shared Memory Architecture - Raina (1993)   (3 citations)  (Correct)

No context found.

D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hennesy. The DASH Prototype Implementation and Performance. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 92--105, ACM Press, Gold Coast, Australia, May 1992.


Assessment of Cache Coherence Protocols in Shared-memory.. - Grbic (2003)   (Correct)

No context found.

Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, and John L. Hennessy. The DASH Prototype: Implementation and Performance. In 92--103, Gold Coast, Queensland, Australia, May 1992.


Multistriped Addressing - Grossman, Brown, Huang, Knight (2000)   (Correct)

No context found.

Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, John Hennessy, "The DASH Prototype: Implementation and Performance"


Data Locality Optimization of Shared Memory Programs on NUMA.. - Tao   (Correct)

No context found.

D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hennessy. The DASH Prototype: Implementation and Performance. In Proceedings of the 19th International Symposium on Computer Architecture, pages 92--103, Gold Coast, Australia, May 1992.


Design and Evaluation of the Hamal Parallel Computer - Grossman (2002)   (1 citation)  (Correct)

No context found.

Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, John Hennessy, "The DASH Prototype: Implementation and Performance", Proc. ISCA '92, pp. 92-103.


Alleviating Memory Contention in Matrix Computations on.. - Bianchini, al. (1993)   (2 citations)  (Correct)

No context found.

D. Lenoski, J. Laudon, L. Stevens, T. Joe, D. Nakahira, A. Gupta, and J. Hennessy, "The DASH Prototype: Implementation and Performance," In Proceedings of the Nineteenth International Symposium on Computer Architecture, May 1992.


Packet Routing in Multiprocessor Networks - Chinn (1995)   (Correct)

No context found.

D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hennessy. The DASH prototype: Implementation and performance. In Proc. 19th Annual Symposium on Computer Architecture, pages 92-- 103, June 1992.

First 50 documents  Next 50

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC