| D. Culler, J. P. Singh, and A. Gupta. "Parallel Computer Architecture: A Hardware/Software Approach". Morgan Kaufmann Publishers, 1998. |
....model, but in a way that is strictly stronger than necessary, disallowing certain behavior allowed by the memory model. 1 Introduction A shared memory multiprocessor machine is characterized by a collection of processors that exchange information with one another through a global address space [1, 6]. In such a machine, processors access memory locations concurrently through standard read and write instructions. Shared memory machines have various bu ers where data written by a processor can be stored before it is shared with other processors. Thus, multiple values written to a single memory ....
.... (which previous write operations are currently visible) A memory consistency model is a contract between a program and the underlying machine architecture that constrains the order in which memory operations appear to be performed with respect to one another (i.e. become visible to processors) [6]. By constraining the order of operations, a memory consistency model determines which values can legally be returned by each read operation. The implementation of a memory consistency model in a shared memory machine with caches requires a cache protocol, that invalidates or updates cached values ....
D.E. Culler and J.P. Singh, with A. Gupta (1999). Parallel computer architecture: a hardware/software approach. Morgan Kaufmann.
....turned on or o . Multiplexed inputs require a switching schedule (to enable demux) Internally nonblocking (the big win ) Constraints: For N inputs, needs N 2 crosspoints O(N 2 ) time to set each cross point Susceptible to single faults. Used in High Performance Computing [1, 3]. Computer Communication Networks, CSI 516 W. A. Maniatty Limits of TSI Size Dept. of Computer Science, SUNYA Computer Communication Networks, CSI 516 W. A. Maniatty 33 Limits of TSI Size Dept. of Computer Science, SUNYA 33.2 Multistage Crossbar Switching Pioneered by Clos, sometimes ....
D. E. Culler, J. P Singh, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, inc., San Franscisco, CA, rst edition, 1999.
....As a result of these highlevel properties there are many implications for the hardware implementation that deal with the issues of deadlock, livelock, fairness and starvation. These high level correctness properties provide a set of sufficient conditions for memory coherence and consistency [24,25,16,2,3,4]. Data coherence. A memory system is coherent if the value returned by a load is always the value from the latest store to the same memory location. Preservation of program order. The memory system will impose a serial order on all memory operations to the same address. The order in which memory ....
David Culler, Jaswinder Pal Singh, and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach, pages 273--305. Morgan Kaufmann Publishers, 1998.
....is obvious that they were in the same type of region (same) The symbol is used as a wildcard and refers to both categories. We further extend each classification by breaking them down into cache hits and misses. Our classification is based upon a 4 state (MESI) Write Back Invalidation Protocol [9]. In a write invalidation protocol, such as MESI, misses can be divided into compulsory misses, capacity conflict misses, and coherence misses [6] A compulsory miss occurs at the first reference to a data block by a given processor. A capacity conflict miss is a replacement miss, and a coherence ....
David Culler, Jaswinder Pal Singh, and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, 1998.
.... Afsahi, and Ghodsieh Sakouie for their support and encouragement through the years 1 Chapter 1 Introduction Research in the area of advanced computer architecture has been primarily focused on how to improve the performance of computers in order to solve computationally intensive problems [32, 62, 69]. Some of these problems are called grand challenges. A grand challenge is a fundamental problem in science or engineering that has a broad economic and or scientific impact; coupled fields, geophysical, and astrophysical fluid dynamics (GAFD) turbulence, modeling the global climate system, ....
.... whether message passing or distributed shared memory (DSM) is a collection of complete computers, including processor and memory, that communicate through a generalpurpose, high performance, scalable interconnection network using a communication assist (CA) and or a network interface (NI) [32], as shown in Figure 1.1. P P Communication Assist Memory Network Interface Figure 1.1: A generic parallel computer Interconnection Network P: Processor : Cache 3 Message passing multicomputers, among all known parallel architectures, are the best to achieve such computing ....
[Article contains additional citation context not shown here]
D. E. Culler, J. P. Singh and A. Gupta, Parallel Computer Architecture: A Hardware /Software Approach, Morgan Kaufmann, 1999.
....machine could be prohibitive. For example, for a simple Full Map sharing code and for a 128 byte line size, the directory overhead (measured as sharing code size divided by memory line size) for a system with 256 nodes is 25 , but when the node count reaches 1024 this overhead becomes 100 [6]. Several sharing code schemes have been proposed in the literature with a variety of sizes. On the one hand, Dir 0 (None in this work) does not use any bit. Thus, for a N node system, it always sends N 1 coherence messages (invalidations or cache to cache transfer orders) when the home node ....
.... SGI Origin 2000 multiprocessor [15] Tristate [1] Gray Tristate [16] and Home [16] Others proposals reduce directory width by having a limited number of pointers per entry to keep track of sharers [1] 4] 22] Differences between them are mainly found in the way they handle overflow situations [6]. A comparison with such directory schemes is out of the scope of this paper. A third alternative way of keeping track of sharers is the Chained directory protocol, such as the IEEE Standard Scalable Coherent Interface (SCI) 10] It relies on distributing the sharing code between them. Each ....
[Article contains additional citation context not shown here]
D.E. Culler, J.P. Singh and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, Inc., 1999.
....coherency optimizer: The number of Read and Write stream buffers was set to 32, each of these buffers being of depth 1. The address buffer s depth was set to 32 The size of the SD table was set to 256. 4. 4 Benchmarks All of the benchmarks used, are extracted coming from the SPLASH, SPLASH 2 [CSG98] suite except the CG code which is a NAS benchmarks. This last code was tested using two different preconditionners, a diagonal one (CG DIA) and a polynomial one (CG POLY) Benchmarks can be sorted into four categories, depending upon two parameters: memory access regularity (spatial locality) ....
David E. Culler, Jaswinder Pal Singh, and Anoop Gupta. Parallel Computer Architecture a Hardware/ Software approach. Morgan Kaufman, 1998.
....0. Figure 5. OpenMP classes for dynamic information ffl non parallelized code: time needed to execute non parallelized code ffl seq fraction: non parallelized code duration ffl nr remote accesses: number of accesses to remote memory by load and store operations in ccNUMA machines [CuSiGu 99] ffl scheduling: time needed for scheduling operations (e.g. scheduling of threads) ffl additional calc: time needed for additional computations in parallelized code (e.g. to enforce a specific distribution of loop iterations) or for additional computations (e.g. where it is cheaper for all ....
....in bytes. This information can be measured if address range specific monitoring is supported, e.g. KaLeObWa 98] The last attribute of this class is page sums which is a set of page level remote access counters. For example, the remote access counters on SGI Origin 2000 provide such information [CuSiGu 99] With the help of additional mapping information, i.e. mapping variables to addresses, this information can be related back to program variables. Each object of class PageRemoteAccesses determines the page no and the number of remote accesses. The second attribute of SmRegionSummary is given by ....
D. E. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware/Software Approach . Morgan Kaufmann Publisher Inc., 1999.
....Verilog implementation level. 1 Introduction Shared memory multiprocessors provide both scalability and a flexible programming model. These features, however, come at the expense of additional hardware complexity in the coherent memory subsystem. The Distributed Shared Memory (DSM) architecture [1, 2, 3] provides a logically shared address space, although the physical memory is distributed among the computing nodes. This organization creates an extended memory hierarchy that spans from the load store unit of a given processor through multiple levels of cache, and possibly across multiple nodes ....
David Culler, Jaswinder Pal Singh, and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach, pages 273--305 and 589--610. Morgan Kaufmann Publishers, 1998.
....parallel environments in which we experimented, the elements of a quadrant of a matrix are spread out in shared memory, and a single shared memory block can contain elements from two quadrants, and thus be written by the two processors computing 9 those quadrants. This leads to false sharing [9]. # In a message passing parallel environment such as those used in implementations of High Performance Fortran [28] typical array distributions would again spread a matrix quadrant over many processors, thereby increasing communication costs. # The dilation effect can compromise single node ....
D. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, 1998. 35
....that the communication layer provides a real overlap and an asynchronous execution of the communication is not obvious. Several papers have presented some ways to hide communication latency [1] or to use asynchronous communications to improve the implementation of parallel algorithms [3, 5] In [4], a good presentation of communication latency hiding is presented, including active messages [6] In this paper, we study the possibility of communication overlap on a cluster of PCs interconnected with the high speed Myrinet network through two di erent communication layers: BIP (Basic ....
D.E. Culler, J. Pal Singh, and A. Gupta. Parallel Computer Architecture: A Hardware /Software Approach. Morgan Kaufmann Publishers, 1998. ISBN 1-55860-343-3.
....The FSB is assumed to be a pipelined snooping bus that enforces the transactions to complete in the same order in which they were initiated. It is assumed that the cache coherence among the processors is implemented through the popular MESI (Modify Exclusive Shared Invalidate) coherence protocol [5]. Such a protocol results in two additional sources of traffic on the processor bus: 1. Invalidation of potential entries in other caches when a processor attempts to gain exclusive access over data currently in shared mode in its cache. 2. Attempt by a processor to access data in modified state ....
D.E. Culler and J.P. Singh, Parallel Computer Architecture -- A hardware/software approach, Morgan Kaufmann, 1999.
....main memory. Both local memories and main memory are interconnected through one or several buses (that are called memory buses) As the cache is physically partitioned among the clusters, coherence among the local caches and the main memory has to be kept. For this reason, a snoopy MSI protocol [5] has been implemented. This protocol is completely transparent to the ISA, and further, both the coherence and the bus arbitration are managed by the hardware. When a memory access misses in its local cache, the miss request is queued in a local MSHR (Miss information Status Handling Register) ....
D. Culler and J.P. Singh, "Parallel Computer Architecture. A Hardware/Software Approach", Morgan Kaufmann Publishers, Inc., 1999
....expanded further if the remote node memory is classified according to the distance in hops between the accessing processor and the accessed node. Table 1 shows the base contented memory access latency by one processor to the different levels of the Origin2000 memory hierarchy on a 16 node system [14]. The nodes of the Origin2000 are organized in a fat hypercube topology with two nodes on each edge. The difference in the access latency between the L1 and the L2 caches is one order of magnitude. The difference between the access latency of the L2 cache and local memory accounts for another ....
D. Culler, J. Pal Singh and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, August 1998.
....parallel environments in which we experimented, the elements of a quadrant of a matrix are spread out in shared memory, and a single shared memory block can contain elements from two quadrants, and thus be written by the two processors computing those quadrants. This leads to false sharing [9]. In a message passing parallel environment such as those used in implementations of High Performance Fortran [25] typical array distributions would again spread a matrix quadrant over many processors, thereby increasing communication costs. The dilation effect can compromise single node memory ....
D. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, 1998.
.... a variety of machines, including the bus based SMP (Symmetric Multi Processors) such as the Silicon Graphics Challenge series[59] and the CC NUMA (Cache Coherent Non Uniform Memory Access) architectures such as the Stanford DASH[75] the Stanford FLASH[72] and the Silicon Graphics Origin series[29]. While parallel speedups measure the overall effectiveness of a parallel system, they are also highly machine dependent. Since parallel reductions incur more overhead than simple parallelization, not only do speedups depend on the number of processors, they are sensitive to many aspects of the ....
....set associative. The cache line size for the second level cache is 128 bytes. The Origin system consists of 780 MB s SysAD bus (which is the memory bus of the two R10000 proces 109 sors) 670 MB s bandwidth for local memory access, and 780 MB s bandwidth for nodeto network access each way[29]. 6.5.2. Commutative Updates To evaluate the applicability of our reduction recognition algorithm, we apply our algorithm on all the programs in the SPEC92 floating point benchmark suite[105] The suite is a set of 14 floating point programs. Figure 6 2 provides the program description and the ....
[Article contains additional citation context not shown here]
D. Culler, J. P. Singh, A. Gupta. Parallel Computer Architecture: A Hardware/ Software Approach. Morgan Kaufmann Publishers, San Mateo, CA, 1999. 134
....apparently balanced cases where the overhead may be expected to be substantial. Let us proceed now to application performance in general. 2. 3 Evaluation Using Workloads: Methods, Models and Metrics This section discusses some key methodological issues for evaluation with real workloads (see also [16]) To evaluate a given machine, we need to choose (i) workloads, ii) problem sizes for a given number of processors, and (iii) a scaling model for the CHAPTER 2. PERF. ON A MODERATE SCALE, HW CC SYSTEM 22 application parameters (as processor count changes) with its associated performance ....
.... Both spatial and temporal locality interactions with machine parameters are often threshold e#ects, so they may require us to add problem sizes (to include situations representing both sides of the threshold when these situations are realistic, even if inherent characteristics don t change much [16]) We do so in this chapter, based on known application characteristics as in [84] We also try to use a problem size that is as large as the system will let us run, to stress memory system and TLB interactions where relevant. 2.3.3 Metrics and Scaling Methods Of the two metrics by which we may ....
[Article contains additional citation context not shown here]
D. E. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware /Software Approach. Morgan Kaufmann Publishers, 1998.
....such a prefetching request will be of no use to an executing program. A possible solution to this problem is not to discard a prefetching request on a TLB miss but to prefetch the translation information from a page table if it is not available in the TLB and then use it to prefetch the data [29, 13]. Obviously, such TLB prefetching can only work for the data residing on pages that can be accessedand are available in memory. 4.8 Classification of data related misses We classify the cause of a miss to TLB and L2 cache that occurs on an access to heap allocated data into the following ....
D. Culler, J. P. Singh, and A. Gupta. "Parallel Computer Architecture: A Hardware/Software Approach". Morgan Kaufmann Publishers, 1998.
....for representative network sensor platform desire a clean open platform to explore alternatives. The problem we must tackle is strikingly similar to that of building e#cient network interfaces, which also must maintain a large number of concurrent flows and juggle numerous outstanding events [20]. This has been tackled through physical parallelism [21] and virtual machines [27] We tackle it by building an extremely e#cient multithreading engine. As in TAM [22] and CILK [23] it maintains a two level scheduling structure, so a small amount of processing associated with hardware events can ....
D. Culler, J. Singh, and A. Gupta. Parallel computer architecture a hardware/software approach, 1999.
No context found.
Culler, D., and Singh, J. P., with Gupta, A., Parallel Computer Architecture: A Hardware /Software Approach. Morgan Kaufman, 1998.
No context found.
D. E. Culler, J. P. Singh and A. Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann Publishers, 1999
No context found.
D. E. Culler and J. P. Singh, Parallel Computer Architectures A Hardware/Software Approach, Morgan Kaufmann, 1999.
No context found.
D. E. Culler, J. P. Singh, and A. Gupta. Parallel Computer Architecture: A Hardware /Software Approach. Morgan Kaufmann Publishers, Inc., 1999.
No context found.
D. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann Publishers, Inc., 1998.
No context found.
D. Culler and J. Singh, Parallel Computer Architecture: A Hardware/Software Approach, - http://www.cs.berkeley.edu/culler/book.alpha/
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC