| D. E. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam, "The Stanford Dash multiprocessor," IEEE Computer, vol. 25, no. 3, pp. 63-79, March 1992. |
....Shared Memory) systems. Researchers in computer architecture and DSM have been studying the cache coherence problem for a long time. Many concepts and theories such as several consistency models [35, 22, 16, 21, 7] and numerous practical systems such as the MIT Alewife [2] the Stanford Dash [36], CRL [30] and so on, have been developed to attack the problem. This problem is no different in the field of distributed virtual environments, so we simply borrow the idea from the DSM literature and modify it slightly to fit our specific requirements. This technique is relatively mature due to ....
....LP j have a shared copy of s k . To solve these problems, we present a coherence protocol in the following section. 3. 7 Cache Coherence We employ a fixed owner, directory based invalidate protocol similar to that used in many hardware or software based DSM (distributed shared memory) systems [2, 36, 30]. Directory based coherence reduces network traffic because it does not use a broadcast scheme for one LP to send invalidate update messages to all other LPs, which usually generates network traffic that is proportional to the number of LPs squared (M ) Invalidate coherency protocols [28] ....
D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford Dash multiprocessor. IEEE Computer, pages 63--79, March 1992.
....in a scalable topology, such as a mesh or a hypercube. Each node contains a few processors, a portion of the globally distributed memory, a node controller, and possibly some I O devices. The node controller handles all memory coherency and I O traffic going through the node. Several research [2, 35, 37] and commercial [18, 36, 39] projects have built sharedmemory multiprocessors based on the above mentioned design. These machines have been available for several years, and are becoming a popular platform in the server market. Besides traditional computation intensive workloads, such as raytrace ....
Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, John Hennessy, Mark Horowitz, and Monica S. Lam. The Stanford DASH multiprocessor. Computer, 25(3):63--79, March 1992.
....a directory controller, a network interface, and a portion of the main memory of the system (Figure 2) The processor is a 6 issue dynamic superscalar. The caches are non blocking and write back. The system uses a full map directory and a cache coherence protocol similar to that used in DASH [12]. The directory controller is extended to support logging and distributed parity needed for ReVive, as described in Section 3.2. Contention is accurately modeled in the entire system, including the busses, the network and the main memory. Table 3 lists the main characteristics of the ....
D. Lenoski et al. The Stanford Dash Multiprocessor. IEEE Computer, pages 63--79, Mar. 1992. It is Dash, not DASH.
....1(b) Cache coherence is maintained by a bus based snoopy protocol within an SMP and enforced by CCs using a directory based cache coherence protocol across the machine. Our directory based protocol uses an invalidation based approach. Each CC also connects to either a Remote Access Cache (RAC) [10] or an L3 cache. A RAC keeps recently accessed copies of remote memory lines. An L3 cache keeps both local and remote memory lines. If the RAC or L3 can satisfy a local request to a remote memory line, the request does not need to traverse the network in order to fetch that memory line from the ....
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford DASH Multiprocessor. IEEE Computer, pages 63--79, March 1992.
....memory modules in a transparent way, although it may su#er increased latencies when accessing memory located on remote clusters. SMPs with this type of physical memory organization are called Non Uniform Memory Access (NUMA) SMPs. Examples of such NUMA SMP architectures include Stanford s Dash [21] and Flash [17] architectures, University of Toronto s Hector [42] and NUMAchine [41] architectures, Sequent s NUMA Q [32] architecture and SGI s Cray Origin2000 [19] NUMA SMPs that implement cache coherency in hardware are called CC NUMA SMPs. In contrast, multiprocessors based on a single bus ....
Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, John Hennessy, Mark Horowitz, and Monica S. Lam. The Stanford Dash multiprocessor. Computer, 25(3):63--79, March 1992.
....with low overhead, reaping the communication reduction and lower latency benefits without computational overhead. Of course, there are a wealth of cache system optimizations proposed within parallel machines which could be applied in an application specific manner to achieve best performance [43, 23, 25]. 4.4 Custom Prefetching The pointer based data access to sparse matrix data structures in current day memory hierarchies yields poor performance because the indirection introduces main memory and memory hierarchy latencies into the innermost computational loop. Techniques such as software ....
Lenoski, D., and et al. The Stanford DASH Multiprocessor. IEEE Computer (Mar 1992), 63--79.
....transfer a nonuniform communication architecture. A NUCA is an architecture in which the unloaded latency for a processor accessing data recently modified by another processor differs at least by a factor of two, depending on where that processor is located. DASH was the first NUCA machine [13]. Each DASH node consists of four processors connected by a snooping bus. A cache to cache transfer from a cache in a remote node is 4.5 times slower than a transfer from a cache in the same node. We call this the NUCA ratio. Sequent s NUMAQ has a similar topology, but its NUCA ratio is closer to ....
....locks to provide a hardware queued lock behavior without requiring any software support or new instructions [21] The load linked store conditional instructions are used to demonstrate a possible implementation. Stanford DASH uses directories to indicate which processors are spinning on the lock [13]. When the lock is released, one of the waiting nodes is chosen at random and is granted the lock. The grant request invalidates only that node s caches and allows one processor in that node to acquire the lock with a local operation. This scheme lowers both the traffic and the latency involved in ....
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam. The Stanford Dash Multiprocessor. IEEE Computer, 25(3):63-- 79, Mar. 1992.
....owner. When the holder releases the lock, it sends the corresponding cache block directly to the next processor, thus transferring the lock in exactly one network message. Following the QOLB proposal, the Stanford DASH prototype implemented a variant of the queue based synchronization primitive [110]. Unlike QOLB, their proposal stored the queue at the directory rather than the caches. Doing so introduced an indirection in transferring locks the lock can no longer be transferred directly from the releaser to the next waiter; instead the lock must go through the directory. Lee and ....
Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, John L. Hennessy, Mark Horowitz, and Monica Lam. The Stanford DASH Multiprocessor. IEEE Computer, 25(3):63--79, March 1992.
....[3] Several memory consistency protocols other than the standard sequential consistency [4] have been proposed [5, 6] to overcome the communication overhead and to increase the scalability degree of the DSM systems. Weak consistency [7, 8] processor consistency [9] and release consistency [10] are the relaxed consistency models, which have focused on methods to weaken memory consistency so that the overhead to preserve the memory consistency can be reduced. Weak consistency and processor consistency models have re defined the consistency so that the updated data pages do not have to be ....
D.E. Lenoski, J. Ludon, K. Gharachorloo, W.D. Weber, A. Gupta, J.L. Hennessy, M. Horowitz, and M.S. Lam, "The stanford dash multiprocessor," in IEEE Computer, Vol. 25, No. 3, pages 63-79, Mar. 1992.
....and lastly, works addressing the evaluation of a multiprocessor system. In the domain of trade off analysis for multiprocessor topologies, we mostly encounter works related to exploration of different on chip communication networks [2] and different approaches to memory consistency and coherency [20][21] Much of these works address the issue of how to design an efficient on chip communication network or techniques to avoid invalidated memory accesses, given remaining architectural components. In addition, past work has focused on fine grained multiprocessor systems like systolic arrays and ....
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, M. S. Lam, "The Stanford DASH multiprocessor," IEEE Computer, Vol. 25(3), pages 63-79, March 1992.
....processors to minimize communication. Data Consistency. The data consistency problem occurs when data is spread across multiple address spaces in a distributed memory architecture. The existing techniques for maintaining data consistency across multiple address spaces range from hardware [72], to automated software data consistency, to manually ensured (application level) data consistency [8] Automatically ensuring data consistency, by hardware or software solutions is often related to simulating a shared memory on top of a distributed memory architecture [16, 54, 72, 94] Ensuring a ....
.... range from hardware [72] to automated software data consistency, to manually ensured (application level) data consistency [8] Automatically ensuring data consistency, by hardware or software solutions is often related to simulating a shared memory on top of a distributed memory architecture [16, 54, 72, 94]. Ensuring a consistent view of the shared memory built on top of a physically distributed memory usually requires replication of physical pages or data objects and invalidation or updating of remote copies of data that is locally written. The hardware coherence schemes are triggered on read and ....
[Article contains additional citation context not shown here]
Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, John Hennessy, Mark Horowitz, and Monica S. Lam. The stanford dash multiprocessor. Computer, pages 63,79, March 1992.
....and network throughput. In this paper we evaluate the performance of a 2D torus network with wormhole routing and virtual channel flow control in shared memory multiprocessors. We selected a 2 D torus network with bidirectional links for our performance study, because it is a popular topology [5, 6, 7, 8]. Also, mesh networks without end around connections have significant performance degradations at the boundary nodes, even under uniform communication [6, 9] The performance of wormhole networks with virtual channels has been evaluated in various studies [4, 7, 10] in a message passing ....
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam, "The Stanford DASH Multiprocessor," Computer, pp. 63--79, March 1992.
....than 90 . Less time is used for most of the protocols, also. Our new method also complements the symmetry reduction strategy [9, 2, 5] allowing for additional reductions when the two methods are combined. 2 An Example In this section, we illustrate our method through a cache coherence protocol [12]. Cache coherence is a way of implementing a shared memory abstraction on top of a message passing network. Whenever a processor wants to load a cache entry into its cache, it sends a request to the memory, which keeps track of which processors have read only copies or writable copies of the ....
.... description of the system, similar to the one in [9] The verification results for the following protocols are presented in Table 2: ffl an industrial directory based cache coherence protocol (ICCP) ffl the Stanford DASH multiprocessor cache coherence protocol (DASHC) and lock protocol (DASHL) [12]; ffl distributed linked list protocols (LIST1,LIST2) For these protocols, a processor typically issues a request on the network, and becomes blocked until a message arrives from the network. Since the messages in the network are received one by one, these transition rules form a symmetric, ....
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford DASH multiprocessor. Computer, 25(3), 1992.
....Each event queue has a fixed owner, which is defined as the pilot LP of the associated avatar. To implement the shared event queue structure, we use a fixed owner, directorybased invalidate protocol similar to that used in many hardware or software based DSM (distributed shared memory) systems [13, 14, 15]. Directory based coherency reduces network traffic because it does not use a broadcast scheme for one LP to send invalidate update messages to all other LPs, which usually generates network traffic that is proportional to the number of LPs squared (M ) Invalidate coherency protocols [21] ....
D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz and M. Lam: "The Stanford Dash Multiprocessor" IEEE Computer, pages 63--79, March 1992.
....processor counts. The SGI Origin 2000, although not the first commercially available scalable multiprocessor system, was nonetheless one of the most ambitious designs when it was introduced in the market. Its design came out of two research efforts at Stanford University, the DASH multiprocessor [23] and its follow on, the FLASH project [22] Both projects explored the limits of a particular multiprocessor design, the cache coherent nonuniform memory architecture. The Origin 2000 family of multiprocessor systems was introduced by SGI in 1996. The modular design allows for scalability from two ....
....hierarchy where the inner protocol keeps the data coherent within a node and the outer protocol maintains global coherence. The inner and outer protocols need not be the same. A common organization is for the outer protocol to be a directory protocol and the inner one to be a snooping protocol [23, 25]. Other combinations, such as snooping snooping [13] and directory directory [9] are also possible. On a machine with physically distributed memory, nonlocal data may be replicated either in the processor s caches or in the local main memory. The systems that replicate data only in processor ....
LENOSKI, D. The Stanford DASH Multiprocessor. PhD thesis, Computer Systems Laboratory, Stanford University, 1992.
....of an entire large program. Therefore, simulation is hardly an effective tool for performance debugging. It is used more for detailed analysis of architectural tradeoffs and is important because it allows evaluation without real hardware. Simulation has been used extensively in the Stanford DASH [17] project, as well as in Alewife during the architectural design phase. 1.1.3 Emulation Emulation is a method of hardware system debugging that is becoming increasingly popular. Field programmable gate arrays have made possible an implementation technology that is ideal for full system ....
Daniel Lenoski et. al. The Stanford Dash Multiprocessor. In IEEE Computer,pp 63-79, March 1992.
....other. There are two primary mechanisms of inter processor communication: distributed shared memory (DSM) and message passing. Originally, parallel architecture designs focused on one particular communication paradigm. To be certain, distributed shared memory machines such as the Stanford DASH [6] and MIT Alewife [4] are capable of emulating message passing functionality while message passing machines like the Cray T3D [10] can implement shared memory functionality; however, the software overhead required to do so makes program execution considerably less efficient. Hence, while ....
....NESAddress and NESData encoded within the sPBusAddress. NESAddress is an internal addressing scheme used to access BIU or Ctrl state; bit 6 of the sPBusAddress selects between the two. NESData contains the data for a state write if one is specified by the Update field. 0:5] 011010 QPtr Space [6] 0 NESAddress[0] Fixed) 7:9] X NESAddress[3:5] Comm Group [10:11] 01 NESAddress[1:2] 12:13] X Unused [14:16] 100 NESAddress[6,9:10] Fixed) 17] X NESAddress[11] 18] X Unused [19:25] X NESData[0:6] 26] 0 No Update 1 Update [27:28] X NESAddress[7:8] 29] X High Low ....
[Article contains additional citation context not shown here]
Daniel Lenoski et al. The Stanford DASH Multiprocessor. IEEE Computer, 25(3):63--79, March 1992.
....the problem of coherence. In the remainder of this dissertation I use data migration to include the use of replication, and explicitly state when replication is not involved. Data migration can take the form of hardware caching in shared memory multiprocessors such as Alewife [1] and DASH [66]. Data migration can also be implemented in software. For example, the Munin system [9] allows the programmer to choose among several di erent coherence protocols for shared data; some of the protocols use replication, and some do not. The Emerald system [55] is one of the few systems that uses ....
.... Memory model Coherence Example compiler or library object object Emerald [55] Amber [24] Prelude [100] Orca [6] Midway [102] CRL [51] operating system flat page Ivy [67] Munin [9] TreadMarks [59] hardware flat cache line NYU Ultracomputer [38] Cedar [37] IBM RP3[78] DASH [66], Alewife [1] Tab l e 8 . 1 . Characterization of DSM systems. Implementation refers to the level of the system at which shared memory is implemented. Memory model describes the organization of the global address space. Coherence indicates the unit of coherence. Example lists a few ....
D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M.S. Lam. The Stanford Dash Multiprocessor. IEEE Computer, pages 63--79, March 1992.
....DSM may lag somewhat behind that provided by a static software DSM. A hardware DSM system is one in which all interprocessor communication is effected through loads and stores to locations in a shared global address space. Examples include the NYUUltracomputer [26] IBMRP3 [62] Stanford DASH [49], and KSR 1 [39] Other communication mechanisms (e.g. message passing) are synthesized in software using the shared memory interface. Like dynamic software DSMs, hardware DSM systems support a very general programming model. Current hardware DSMs typically provide automatic migration and ....
....yields the following breakdown of the spectrum of dynamic DSM systems and implementation techniques that have been discussed in the literature. All Hardware In all hardware DSM systems, all three of these mechanisms are implemented in specialized hardware; the Stanford DASH multiprocessor [49] and KSR1 [39] are typical all hardware systems. Mostly Hardware As discussed in Section 5.2, the MIT Alewife machine implements a mostly hardware DSM system processor side mechanisms are always implemented in hardware, but memory side support is handled in software when widespread sharing is ....
[Article contains additional citation context not shown here]
D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford Dash Multiprocessor. IEEE Computer, pages 63--79, March 1992.
....is no longer consistent. When the parallel call terminates, LCM reconciles multiple versions of a block to a single consistent value. LCM provides consistent memory as a default and is similar in many respects to protocols providing sequentially consistent distributed shared memory such as DASH [17], Alewife [1] and Stache[23] but it differs in several key respects. Most importantly, LCM allows global memory to become temporarily inconsistent under program control. During these phases, a given data item may have different values on different processors, making correct management of shared ....
Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, John Hennessy, Mark Horowitz, and Monica Lam. The Stanford DASH Multiprocessor. IEEE Computer, 25(3):63--79, March 1992.
....use divect tetwor ks, meaning that the computing nodes are embedded in the network topology, and as a result, some nodes are closer than others. In addition to use in multicomputers, direct networks are gaining acceptance in shared memory machines such as the MIT Alewife [13] Stanford DASH [14], and Tera ComputeFs TERA machine [10] Some recent parallel machines such as the Thinking Machines CM5 [15] Meiko CS 2 [16] and Kendall Square Research KSR1 [17] use itzdir ect networks in which computing nodes are separated from networks. In contrast to previous multistage interconnection ....
....computers, only a few features for fault tolerance have been introduced in commercial multiprocessor routing networks. For exam pie, a number of machines include parity on each physical channel to detect errors, but can do little but kill the process or reboot the machine when an error occurs [8, 12, 11, 14, 13]. More aggressive machines support checksums or error correcting codes for each packet on each link [25, 26] In all of these machines, faulty channels require reconfiguration of the network and machine with loss of some working processors and network channels. Generally, data errors cannot be ....
D. Lenoski, J. Laudon, K. Gharacharloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lain, "The Stanford Dash multiprocessor," IEEE Computer, pp. 63 79, March 1992.
....Thus, multiprocessor programmers are left with the challenge of understanding precisely the logical behavior of the underlying machine. Multiprocessor machines have been built or proposed without a clear and unambiguous description of their memory models. Examples include SPARC V8 [13] and DASH [6]. Because existing descriptions use different types and degrees of formalism, they are difficult to compare. To date, there is no unified formalization of the different memory models provided by several existing machines. Such a unifying framework, based on partial order constraints on the ....
Lenoski D., Laudon J., Gharachorloo K., Weber WF., Gupta A., Hennessy J., Horwitz M., and Lam MS. "The Stanford DASH Multiprocessor", IEEE Computer, 63:3, pp. 63-79, 1994.
....systems due to its inherent advantages like low latency communication and reduced commu nication hardware overhead [21] In addition to the basic wormhole routing switching, systems are gradually incorporating multiple communication ports. Intel Paragon [18] Cray T3D [ and Stanford DASH [13] are some early representative systems in this trend. These systems provide low latency communication when the traffic in the system is low. However, with increase in communication traffic, messages undergo severe litk cottettiot and the system starts performing poorly. Similarly, when a single ....
D. Lenoski et. al. The Stanford DASH Multiprocessor. IEEE Computer', pages 63 79, Mar. 1992.
....NCUBE ten[25] and Intel s iPSC 860115] A more implementable version of the k ary N cube is referred to as a mesh. In this case N=2 is used to characterize the topology. Meshes have been used commercially in Intel s Paragon[7] and experimentally in a number of machines, including Stanford s Dash[8] and MIT s Alewife[9] A somewhat more aggressive topology is the 3D torus, which is an extension of a three dimensional k ary N cube. In this topology the first and the last node in each dimension are connected together, creating a topology which is similar to a three dimensional circular array. ....
D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam, "The Stanford DASH Multiprocessor," IEEE Computer, 25(3), March 1992, pp. 63-79.
....is non uniform across clusters. The extent to which memory access latency is non uniform depends on the interconnection network. Remote memory access latency can range anywhere from one to two orders of magnitude higher than local memory access latency. Examples of SSMMs include the Stanford Dash [1] and Flash [2] the University of Toronto Hector [3] and NUMAchine [4] the KSR1 [5] the HP Convex Exemplar [6] and the SGI Origin 2000 [7] A common approach to improving the performance of applications on SSMMs has been to parallelize the applications manually, which is complex and tedious. ....
Daniel Lenoski, James Landon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, John Hennesy, Mark Horowitz, and Monica Lain. The Stanford DASH Multi- processor. IEEE Computer', 25(3):63 79, March 1992.
....Solutions to this cache coherent problem distinguishes severa classes of parallel computers. Multis [6] are bus based computers in which all processors watch mem ory accesses occurring on a shared bus and modify their caches appropriately. Directory based computers such as Stanford DASH [27] and MIT Alewife [1] eliminate the non scalable bus by having hardware and sometimes software maintain a directory that records which processors hold copies of a cache block. A cache coherence protocol uses the directories to serialize conflicting updates and to invalidate copies at updates. ....
Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, John Hennessy, Mark Horowitz, and Monica Lam. The Stanford DASH Multiprocessor. IEEE Computer, 25(3):63 79, March 1992. 18
....evident in designs such as Sun s Enterprise Server 6000, the scalability of bus based systems is ultimately limited. When a processor supports a bus based coherence scheme, a separate bus snooping agent can perform a lookup similar to that performed by a memory controller. Stanford DASH [LLG 92] and Typhoon [RLW94] among experimental designs, employ this approach. Many recent commercial shared memory machines, such as Sequent STING [LC96] and SGI Origin [LL97] also follow this approach. 2.1.2 Protocol Action 19 Custom Hardware. High performance shared memory systems use dedicated ....
.... protocol events ranges from 500 650 cycles for the fastest fine grain tag implementation (Blizzard S) to 1000 4200 cycles for the slowest (Blizzard E) In hardware shared memory systems, the fine grain access control overhead is an order of magnitude smaller than the best Blizzard system [K 94,LLG 92,LL97,LC96,SGC93] Bliz zard s fine grain access control cannot reach this performance level. However, the Blizzard techniques are based on mostly commodity software and hardware. Page based software shared memory systems have fine grain access control overheads that an order of magnitude ....
Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, John Hennessy, Mark Horowitz, and Monica Lam. The stanford DASH multiprocessor. IEEE Computer, 25(3):63-79, March 1992.
....to handle basic loads and stores specifically, no provision is made in the memory system to maintain lists of which nodes have or want a copy of a cache line. However, over the past five years, directory based multiprocessors have emerged as the dominant scalable shared memory architecture [1, 7, 12, 13, 17]. On these machines, communication is expensive and the distributed directory controllers maintain a list of nodes that have copies of each cache line so that they can be invalidated or updated. Furthermore, the recent trend has been to introduce greater intelligence and flexibility to the node ....
....so that all reads and writes can be snooped off of the Runway by the Widget node controller. 3.2. 1 Simple Centralized Hardware Lock Mechanism The DASH multiprocessor s directory controllers maintain a 64 bit copyset for each cache line, where each bit represents one of the 64 DASH processors [13]. This full map directory design can be adapted to support locking by converting acquire requests for remote locks into lock requests to the relevant directory controller. When a directory controller receives a lock request message, it either immediately replies with a lock grant message if the ....
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam. The Stanford DASH multiprocessor. IEEE Computer, 25(3):63--79, March 1992.
....improvements in processor speeds, the memory system is increasingly the limiting factor in the performance of computer systems. The latency of memory accesses can be especially high in large scale shared memory multiprocessors that have deep memory hierarchies. For example, in the Stanford DASH [10], while the local cache access takes only a single clock cycle, a miss serviced by the local portion of shared memory takes about thirty clock cycles, and a remote miss takes over a hundred clock cycles. Improving data locality in parallel programs can reduce the time spent by a processor waiting ....
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford Dash multiprocessor. IEEE Computer, 25(3):63 79, March 1992.
No context found.
D. E. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam, "The Stanford Dash multiprocessor," IEEE Computer, vol. 25, no. 3, pp. 63-79, March 1992.
No context found.
D. Lenoski, J. Laudon, K. Gharachorloo,W.Weber, A. Gupta, J. Hennessy,M.Horowitz and M. Lam. The Stanford Dash multiprocessor. IEEE Computer, 25(3):63--79, 1992.
No context found.
D. Lenoski et al. The Stanford Dash Multiprocessor. IEEE Computer, pp. 63--79, March 1992.
No context found.
D. Lenoski, et al.: "The Stanford DASH multiprocessor", IEEE Computer, 25(3), pp.6379, March 1992.
No context found.
D. Lenoski, J. Laudon, K. Garachorloo, W.-D. Weber, A. Gupta, J. Henessy, M. Horowitz, and M.S. Lam. The stanford dash multiprocessor. IEEE Computer, 25(3):63--79, 1992.
No context found.
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford DASH Multiprocessor. IEEE Computer, 25(3):63--79, Mar. 1992.
No context found.
D. Lenoski et al. The Stanford Dash Multiprocessor. IEEE Computer, pages 63--79, Mar. 1992. It is Dash, not DASH.
No context found.
Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, John L. Hennessy, Mark Horowitz, and Monica S. Lam. The Stanford Dash Multiprocessor. IEEE Computer, 25(3):63--79, March 1992.
No context found.
Lenoski, D., et al.: The Stanford DASH Multiprocessor. IEEE Computer , 25(3):63-- 79, March 1992.
No context found.
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam. The Stanford DASH multiprocessor. IEEE COMPUTER, 25(3):63--79, March 1992.
No context found.
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. L. Hennessy, M. Horowitz, and M. Lam. The Stanford DASH multiprocessor. IEEE Computer, 25(3):63-- 79, Mar. 1992.
No context found.
Lenoski, D. et al. (1992) The Stanford Dash Multiprocessor. IEEE Comput. Mag., 25, 63--79.
No context found.
D. Lenoski, et al. The Stanford DASH Multiprocessor. IEEE Computer 25, 3 (March 1992), pp. 63--79.
No context found.
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy,M.Horowitz, and M. S. Lam. The Stanford DASH multiprocessor. IEEE Computer, March1992.
No context found.
Daniel Lenoski, James Laudon, Kourosh Gharachorlo, Wolf-Dietrich Weber, Anoop Gupta, John Hennessy, Mark Horowitz and Monica S. Lam, "The Stanford Dash Multiprocessor", Proc. 1992.
No context found.
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam. The Stanford DASH multiprocessor. IEEE COMPUTER, 25(3):63--79, March 1992.
No context found.
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford DASH Multiprocessor. IEEE Computer, 25(3):63--79, Mar. 1992.
No context found.
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam. The Stanford DASH Multiprocessor. IEEE Computer, 25(3):63--79, March 1992.
No context found.
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam. The stanford dash multiprocessor. Computer, pages 63-79, March 1992.
No context found.
D. Lenoski, J. Laudon, K. Gharachorloo, W-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam. The Stanford Dash Multiprocessor. IEEE Computer, 25(3):63--79, March 1992.
No context found.
D. Lenoski, J. Laudon, K. Gharachorloo, W. D. Weber, A. Gupta, and J. Hennessy. The stanford dash multiprocessor. IEEE Computer, 25(3):63--79, March 1992.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC