66 citations found. Retrieving documents...
D. Culler, J. P. Singh, and A. Gupta. "Parallel Computer Architecture: A Hardware/Software Approach". Morgan Kaufmann Publishers, 1998.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents  Next 50

The Location Consistency Memory Model and Cache.. - Wallace, Tremblay.. (2001)   (Correct)

....model, but in a way that is strictly stronger than necessary, disallowing certain behavior allowed by the memory model. 1 Introduction A shared memory multiprocessor machine is characterized by a collection of processors that exchange information with one another through a global address space [1, 6]. In such a machine, processors access memory locations concurrently through standard read and write instructions. Shared memory machines have various bu ers where data written by a processor can be stored before it is shared with other processors. Thus, multiple values written to a single memory ....

.... (which previous write operations are currently visible) A memory consistency model is a contract between a program and the underlying machine architecture that constrains the order in which memory operations appear to be performed with respect to one another (i.e. become visible to processors) [6]. By constraining the order of operations, a memory consistency model determines which values can legally be returned by each read operation. The implementation of a memory consistency model in a shared memory machine with caches requires a cache protocol, that invalidates or updates cached values ....

D.E. Culler and J.P. Singh, with A. Gupta (1999). Parallel computer architecture: a hardware/software approach. Morgan Kaufmann.


List of Slides - Switching And Multiplexing   (Correct)

....turned on or o . Multiplexed inputs require a switching schedule (to enable demux) Internally nonblocking (the big win ) Constraints: For N inputs, needs N 2 crosspoints O(N 2 ) time to set each cross point Susceptible to single faults. Used in High Performance Computing [1, 3]. Computer Communication Networks, CSI 516 W. A. Maniatty Limits of TSI Size Dept. of Computer Science, SUNYA Computer Communication Networks, CSI 516 W. A. Maniatty 33 Limits of TSI Size Dept. of Computer Science, SUNYA 33.2 Multistage Crossbar Switching Pioneered by Clos, sometimes ....

D. E. Culler, J. P Singh, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, inc., San Franscisco, CA, rst edition, 1999.


Dimensions of Verifying the Hardware-Software.. - Abts, Lilja.. (1999)   (Correct)

....As a result of these highlevel properties there are many implications for the hardware implementation that deal with the issues of deadlock, livelock, fairness and starvation. These high level correctness properties provide a set of sufficient conditions for memory coherence and consistency [24,25,16,2,3,4]. Data coherence. A memory system is coherent if the value returned by a load is always the value from the latest store to the same memory location. Preservation of program order. The memory system will impose a serial order on all memory operations to the same address. The order in which memory ....

David Culler, Jaswinder Pal Singh, and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach, pages 273--305. Morgan Kaufmann Publishers, 1998.


Characterization of Locality in Loop-Parallel Programs - Kim, Voss, Eigenmann   (Correct)

....is obvious that they were in the same type of region (same) The symbol is used as a wildcard and refers to both categories. We further extend each classification by breaking them down into cache hits and misses. Our classification is based upon a 4 state (MESI) Write Back Invalidation Protocol [9]. In a write invalidation protocol, such as MESI, misses can be divided into compulsory misses, capacity conflict misses, and coherence misses [6] A compulsory miss occurs at the first reference to a data block by a given processor. A capacity conflict miss is a replacement miss, and a coherence ....

David Culler, Jaswinder Pal Singh, and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, 1998.


Design and Evaluation of Communication Latency Hiding/Reduction.. - Afsahi (2000)   (Correct)

.... Afsahi, and Ghodsieh Sakouie for their support and encouragement through the years 1 Chapter 1 Introduction Research in the area of advanced computer architecture has been primarily focused on how to improve the performance of computers in order to solve computationally intensive problems [32, 62, 69]. Some of these problems are called grand challenges. A grand challenge is a fundamental problem in science or engineering that has a broad economic and or scientific impact; coupled fields, geophysical, and astrophysical fluid dynamics (GAFD) turbulence, modeling the global climate system, ....

.... whether message passing or distributed shared memory (DSM) is a collection of complete computers, including processor and memory, that communicate through a generalpurpose, high performance, scalable interconnection network using a communication assist (CA) and or a network interface (NI) [32], as shown in Figure 1.1. P P Communication Assist Memory Network Interface Figure 1.1: A generic parallel computer Interconnection Network P: Processor : Cache 3 Message passing multicomputers, among all known parallel architectures, are the best to achieve such computing ....

[Article contains additional citation context not shown here]

D. E. Culler, J. P. Singh and A. Gupta, Parallel Computer Architecture: A Hardware /Software Approach, Morgan Kaufmann, 1999.


A New Scalable Directory Architecture for Large-Scale .. - Acacio, González..   (Correct)

....machine could be prohibitive. For example, for a simple Full Map sharing code and for a 128 byte line size, the directory overhead (measured as sharing code size divided by memory line size) for a system with 256 nodes is 25 , but when the node count reaches 1024 this overhead becomes 100 [6]. Several sharing code schemes have been proposed in the literature with a variety of sizes. On the one hand, Dir 0 (None in this work) does not use any bit. Thus, for a N node system, it always sends N 1 coherence messages (invalidations or cache to cache transfer orders) when the home node ....

.... SGI Origin 2000 multiprocessor [15] Tristate [1] Gray Tristate [16] and Home [16] Others proposals reduce directory width by having a limited number of pointers per entry to keep track of sharers [1] 4] 22] Differences between them are mainly found in the way they handle overflow situations [6]. A comparison with such directory schemes is out of the scope of this paper. A third alternative way of keeping track of sharers is the Chained directory protocol, such as the IEEE Standard Scalable Coherent Interface (SCI) 10] It relies on distributing the sharing code between them. Each ....

[Article contains additional citation context not shown here]

D.E. Culler, J.P. Singh and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, Inc., 1999.


Hardware Prediction for Data Coherency of Scientific Codes on.. - Acquaviva, Jalby (2000)   (Correct)

....coherency optimizer: The number of Read and Write stream buffers was set to 32, each of these buffers being of depth 1. The address buffer s depth was set to 32 The size of the SD table was set to 256. 4. 4 Benchmarks All of the benchmarks used, are extracted coming from the SPLASH, SPLASH 2 [CSG98] suite except the CG code which is a NAS benchmarks. This last code was tested using two different preconditionners, a diagonal one (CG DIA) and a polynomial one (CG POLY) Benchmarks can be sorted into four categories, depending upon two parameters: memory access regularity (spatial locality) ....

David E. Culler, Jaswinder Pal Singh, and Anoop Gupta. Parallel Computer Architecture a Hardware/ Software approach. Morgan Kaufman, 1998.


Formalizing OpenMP Performance Properties with ASL - Fahringer, Gerndt, Riley.. (1999)   (Correct)

....0. Figure 5. OpenMP classes for dynamic information ffl non parallelized code: time needed to execute non parallelized code ffl seq fraction: non parallelized code duration ffl nr remote accesses: number of accesses to remote memory by load and store operations in ccNUMA machines [CuSiGu 99] ffl scheduling: time needed for scheduling operations (e.g. scheduling of threads) ffl additional calc: time needed for additional computations in parallelized code (e.g. to enforce a specific distribution of loop iterations) or for additional computations (e.g. where it is cheaper for all ....

....in bytes. This information can be measured if address range specific monitoring is supported, e.g. KaLeObWa 98] The last attribute of this class is page sums which is a set of page level remote access counters. For example, the remote access counters on SGI Origin 2000 provide such information [CuSiGu 99] With the help of additional mapping information, i.e. mapping variables to addresses, this information can be related back to program variables. Each object of class PageRemoteAccesses determines the page no and the number of remote accesses. The second attribute of SmRegionSummary is given by ....

D. E. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware/Software Approach . Morgan Kaufmann Publisher Inc., 1999.


Toward Complexity-Effective Verification: A Case Study of.. - Abts, Lilja, Scott (2000)   (1 citation)  (Correct)

....Verilog implementation level. 1 Introduction Shared memory multiprocessors provide both scalability and a flexible programming model. These features, however, come at the expense of additional hardware complexity in the coherent memory subsystem. The Distributed Shared Memory (DSM) architecture [1, 2, 3] provides a logically shared address space, although the physical memory is distributed among the computing nodes. This organization creates an extended memory hierarchy that spans from the load store unit of a given processor through multiple levels of cache, and possibly across multiple nodes ....

David Culler, Jaswinder Pal Singh, and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach, pages 273--305 and 589--610. Morgan Kaufmann Publishers, 1998.


Recursive Array Layouts and Fast Matrix Multiplication - Chatterjee, Lebeck.. (1999)   (13 citations)  (Correct)

....parallel environments in which we experimented, the elements of a quadrant of a matrix are spread out in shared memory, and a single shared memory block can contain elements from two quadrants, and thus be written by the two processors computing 9 those quadrants. This leads to false sharing [9]. # In a message passing parallel environment such as those used in implementations of High Performance Fortran [28] typical array distributions would again spread a matrix quadrant over many processors, thereby increasing communication costs. # The dilation effect can compromise single node ....

D. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, 1998. 35


Asynchronous Communications in MPI - the BIP/Myrinet.. - Chaussumier, Desprez..   (Correct)

....that the communication layer provides a real overlap and an asynchronous execution of the communication is not obvious. Several papers have presented some ways to hide communication latency [1] or to use asynchronous communications to improve the implementation of parallel algorithms [3, 5] In [4], a good presentation of communication latency hiding is presented, including active messages [6] In this paper, we study the possibility of communication overlap on a cluster of PCs interconnected with the high speed Myrinet network through two di erent communication layers: BIP (Basic ....

D.E. Culler, J. Pal Singh, and A. Gupta. Parallel Computer Architecture: A Hardware /Software Approach. Morgan Kaufmann Publishers, 1998. ISBN 1-55860-343-3.


A Server Performance Model for Static Web Workloads - Kant, Sundaram (2000)   (Correct)

....The FSB is assumed to be a pipelined snooping bus that enforces the transactions to complete in the same order in which they were initiated. It is assumed that the cache coherence among the processors is implemented through the popular MESI (Modify Exclusive Shared Invalidate) coherence protocol [5]. Such a protocol results in two additional sources of traffic on the processor bus: 1. Invalidation of potential entries in other caches when a processor attempts to gain exclusive access over data currently in shared mode in its cache. 2. Attempt by a processor to access data in modified state ....

D.E. Culler and J.P. Singh, Parallel Computer Architecture -- A hardware/software approach, Morgan Kaufmann, 1999.


Modulo Scheduling for a Fully-Distributed Clustered . . . - Sánchez, al. (2000)   (Correct)

....main memory. Both local memories and main memory are interconnected through one or several buses (that are called memory buses) As the cache is physically partitioned among the clusters, coherence among the local caches and the main memory has to be kept. For this reason, a snoopy MSI protocol [5] has been implemented. This protocol is completely transparent to the ISA, and further, both the coherence and the bus arbitration are managed by the hardware. When a memory access misses in its local cache, the miss request is queued in a local MSHR (Miss information Status Handling Register) ....

D. Culler and J.P. Singh, "Parallel Computer Architecture. A Hardware/Software Approach", Morgan Kaufmann Publishers, Inc., 1999


Is Data Distribution Necessary in OpenMP? - Nikolopoulos, Papatheodorou.. (2000)   (1 citation)  (Correct)

....expanded further if the remote node memory is classified according to the distance in hops between the accessing processor and the accessed node. Table 1 shows the base contented memory access latency by one processor to the different levels of the Origin2000 memory hierarchy on a 16 node system [14]. The nodes of the Origin2000 are organized in a fat hypercube topology with two nodes on each edge. The difference in the access latency between the L1 and the L2 caches is one order of magnitude. The difference between the access latency of the L2 cache and local memory accounts for another ....

D. Culler, J. Pal Singh and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, August 1998.


Recursive Array Layouts and Fast Parallel Matrix.. - Chatterjee, Lebeck.. (1999)   (17 citations)  (Correct)

....parallel environments in which we experimented, the elements of a quadrant of a matrix are spread out in shared memory, and a single shared memory block can contain elements from two quadrants, and thus be written by the two processors computing those quadrants. This leads to false sharing [9]. In a message passing parallel environment such as those used in implementations of High Performance Fortran [25] typical array distributions would again spread a matrix quadrant over many processors, thereby increasing communication costs. The dilation effect can compromise single node memory ....

D. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, 1998.


SUIF Explorer: An Interactive and Interprocedural Parallelizer - Liao (2000)   (31 citations)  (Correct)

.... a variety of machines, including the bus based SMP (Symmetric Multi Processors) such as the Silicon Graphics Challenge series[59] and the CC NUMA (Cache Coherent Non Uniform Memory Access) architectures such as the Stanford DASH[75] the Stanford FLASH[72] and the Silicon Graphics Origin series[29]. While parallel speedups measure the overall effectiveness of a parallel system, they are also highly machine dependent. Since parallel reductions incur more overhead than simple parallelization, not only do speedups depend on the number of processors, they are sensitive to many aspects of the ....

....set associative. The cache line size for the second level cache is 128 bytes. The Origin system consists of 780 MB s SysAD bus (which is the memory bus of the two R10000 proces 109 sors) 670 MB s bandwidth for local memory access, and 780 MB s bandwidth for nodeto network access each way[29]. 6.5.2. Commutative Updates To evaluate the applicability of our reduction recognition algorithm, we apply our algorithm on all the programs in the SPEC92 floating point benchmark suite[105] The suite is a set of 14 floating point programs. Figure 6 2 provides the program description and the ....

[Article contains additional citation context not shown here]

D. Culler, J. P. Singh, A. Gupta. Parallel Computer Architecture: A Hardware/ Software Approach. Morgan Kaufmann Publishers, San Mateo, CA, 1999. 134


Performance Portability and Scalability in Shared-Address-Space.. - Jiang (2000)   (Correct)

....apparently balanced cases where the overhead may be expected to be substantial. Let us proceed now to application performance in general. 2. 3 Evaluation Using Workloads: Methods, Models and Metrics This section discusses some key methodological issues for evaluation with real workloads (see also [16]) To evaluate a given machine, we need to choose (i) workloads, ii) problem sizes for a given number of processors, and (iii) a scaling model for the CHAPTER 2. PERF. ON A MODERATE SCALE, HW CC SYSTEM 22 application parameters (as processor count changes) with its associated performance ....

.... Both spatial and temporal locality interactions with machine parameters are often threshold e#ects, so they may require us to add problem sizes (to include situations representing both sides of the threshold when these situations are realistic, even if inherent characteristics don t change much [16]) We do so in this chapter, based on known application characteristics as in [84] We also try to use a problem size that is as large as the system will let us run, to stress memory system and TLB interactions where relevant. 2.3.3 Metrics and Scaling Methods Of the two metrics by which we may ....

[Article contains additional citation context not shown here]

D. E. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware /Software Approach. Morgan Kaufmann Publishers, 1998.


Characterizing the Memory Behavior of Java Workloads: A .. - Shuf, Serrano, Gupta.. (2000)   (15 citations)  Self-citation (Singh Gupta)   (Correct)

....such a prefetching request will be of no use to an executing program. A possible solution to this problem is not to discard a prefetching request on a TLB miss but to prefetch the translation information from a page table if it is not available in the TLB and then use it to prefetch the data [29, 13]. Obviously, such TLB prefetching can only work for the data residing on pages that can be accessedand are available in memory. 4.8 Classification of data related misses We classify the cause of a miss to TLB and L2 cache that occurs on an access to heap allocated data into the following ....

D. Culler, J. P. Singh, and A. Gupta. "Parallel Computer Architecture: A Hardware/Software Approach". Morgan Kaufmann Publishers, 1998.


System Architecture Directions for Networked Sensors - Hill, Szewczyk, Woo..   (296 citations)  Self-citation (Culler)   (Correct)

....for representative network sensor platform desire a clean open platform to explore alternatives. The problem we must tackle is strikingly similar to that of building e#cient network interfaces, which also must maintain a large number of concurrent flows and juggle numerous outstanding events [20]. This has been tackled through physical parallelism [21] and virtual machines [27] We tackle it by building an extremely e#cient multithreading engine. As in TAM [22] and CILK [23] it maintains a two level scheduling structure, so a small amount of processing associated with hardware events can ....

D. Culler, J. Singh, and A. Gupta. Parallel computer architecture a hardware/software approach, 1999.


-740: Computer Architecture Fall 2000 Syllabus - Course Details At   Self-citation (Culler Singh With)   (Correct)

No context found.

Culler, D., and Singh, J. P., with Gupta, A., Parallel Computer Architecture: A Hardware /Software Approach. Morgan Kaufman, 1998.


A Simple, Fast and Scalable Non-Blocking Concurrent FIFO Queue .. - Tsigas, Zhang (2000)   (Correct)

No context found.

D. E. Culler, J. P. Singh and A. Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann Publishers, 1999


On the Design and Implementation of a Parallel.. - Kamath, Baldwin.. (2000)   (Correct)

No context found.

D. E. Culler and J. P. Singh, Parallel Computer Architectures A Hardware/Software Approach, Morgan Kaufmann, 1999.


Advances in Design and Implementation of Optimization Software - Maros, al. (2000)   (Correct)

No context found.

D. E. Culler, J. P. Singh, and A. Gupta. Parallel Computer Architecture: A Hardware /Software Approach. Morgan Kaufmann Publishers, Inc., 1999.


Computer Architecture Support for Database Applications - Keeton (1999)   (3 citations)  (Correct)

No context found.

D. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann Publishers, Inc., 1998.


Cluster Computing: The Commodity Supercomputing - Baker, Buyya (1988)   (Correct)

No context found.

D. Culler and J. Singh, Parallel Computer Architecture: A Hardware/Software Approach, - http://www.cs.berkeley.edu/culler/book.alpha/

First 50 documents  Next 50

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC