30 citations found. Retrieving documents...
Kuck, D. et al. (1993) The Cedar system and an initial performance study. In Proc. 20th Ann. Int. Symp. on Computer Architecture, San Diego, CA, May 16--19, pp. 213-- 223. IEEE Computer Society Press, Los Alamitos, CA.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents

Architectural Support for Parallel Reductions in.. - Garzaran.. (2001)   (1 citation)  (Correct)

.... 7 Related Work Nearly all of the past work on reduction parallelization has been based on software only transformations [8, 27] The most related architectural work that we are aware of is the work of Larus et al. 20] Zhang et al. 28] and the work on advanced synchronization mechanisms [3, 9, 10, 16, 17, 18, 23, 24, 25, 29]. Larus et al. briefly mention an idea similar to PCLR as one application of their Reconcilable Shared Memory (RSM) 20] RSM is a family of memory systems whose behavior can be controlled by the compiler. They use RSM to support programming language constructs. The paper only mentions the ....

.... Such work includes the Full Empty bit of the HEP multiprocessor [25] the atomic Fetch Add primitive of the NYU Ultracomputer [10] the Fetch Op synchronization primitives of the IBM RP3 [3, 23] support for combining trees [16, 24] the memory based synchronization primitives in Cedar [17, 18, 29], and the set of synchronization primitives proposed by Goodman et al. [9] 8 Summary In this paper, we have proposed new architectural support to speed up parallel reductions in scalable sharedmemory multiprocessors. The support consists of architec11 tural modifications that are mostly confined ....

D. J. Kuck et al. The Cedar System and an Initial Performance Study. In Proc. 20th Annual Intl. Symp. on Computer Architecture, pages 213--224, May 1993.


SurfBoard - A Hardware Performance Monitor for SHRIMP - Karlin, Clark, Martonosi (1999)   (3 citations)  (Correct)

....responses based on observed per page statistics, but did not allow for general interrupts based on any observed statistic. In contrast, our hardware monitor supports both such specific studies as well as more general monitoring. The performance monitoring system for Cedar used simple histograms [14], while IBM RP3 used a small set of hardware event counters [6] The Intel Paragon includes rudimentary per node counters [18] but cannot measure message latency. Histogram based hardware monitors were also used to measure uniprocessor performance in the VAX models 11 780 and 8800 [8, 11] These ....

D. Kuck, E. Davidson, D. Lawrie, et al. The Cedar System and an Initial Performance Study. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 213--223, May 1993.


Maintaining Cache Coherence through Compiler-Directed Data.. - Lim, Yew (1998)   (Correct)

.... the implementation and performance evaluation of software prefetching algorithms on real systems [2, 30] Since data prefetching is a well established technique, hardware support for prefetching has been provided in several experimental and commercial multiprocessors, such as the Illinois Cedar [18], the KSR1 [17] and the Cray T3D [7] However, most of these efforts focused only on the traditional application of data prefetching for memory latency hiding, particularly in the context of hardware cache coherent systems. When used in this manner, data prefetching will not violate program ....

D. Kuck et. al. The Cedar system and an initial performance study. In Proceedings of the 20th International Symposium on Computer Architecture, pages 213--223, May 1993.


Chief: A Simulation Environment for Studying Parallel Systems - Pavlos Konas (1994)   (1 citation)  (Correct)

.... for execution driven simulation, and critical path simulation can all be controlled by the execution of benchmark codes instrumented at the source level [9] Another useful instrumentation tool adds library routine calls for hardware or software performance monitoring on the Cedar multiprocessor [10]. 2.3 Types of Simulations A variety of types of simulations can be performed in Chief depending on the level of detail required, taking cost accuracy tradeoffs into account. The simulation capabilities of Chief are divided into three major categories: critical path simulation, trace driven ....

....serial or parallel codes produce traces when executed on serial or parallel host machines and critical path simulation produces optimistic parallel traces from serial codes. Architecture dependent traces are acquired using hardware or software monitoring, for example on the Cedar multiprocessor [10], or using other tools such as Pixie [13] EPG sim provides execution driven simulation capabilities in Chief. Serial or parallel application codes are instrumented to form execution driven event generators. The resulting event generators are coupled with parallel system simulators, using a ....

D. Kuck et al., "The Cedar System and an Initial Performance Study," in Proceedings of International Symposium on Computer Architecture, pp. 213--223, 1993.


Benefits of Processor Clustering in Designing Large Parallel.. - Basak Panda (1995)   (Correct)

....research on clustered systems have focused mostly on proposing and proving different interconnection topologies. Examples of hierarchical configurations proposed by researchers in the last decade to build scalable systems using processor clusters include cluster of processors with buses and MINs[9], local and global meshes[7] and two level systems based on hypercube and other network topologies[12, 17] Other researchers have studied the design problem under very realistic packaging constraints [14, 5, 20] and proved that under such conditions clustering becomes more useful. However, these ....

D. J. Kuck et al. The Cedar System and an Initial Performance Study. In Proc. of the Int'l Symposium on Computer Architecture, pages 213--223, 1993.


Data Prefetching And Data Forwarding In Shared Memory.. - Poulsen, Yew (1994)   (13 citations)  (Correct)

.... prefetching schemes have the disadvantage of introducing instruction overhead for prefetching (and related) instructions; this can be a significant performance issue [2, 3] Although data prefetching has been shown to be effective in reducing memory latency in shared memory multiprocessors [6, 7], few multiprocessor studies have considered the implementation of compiler algorithms for multiprocessor prefetching [8] and the performance impact of these algorithms on large, numerical applications with loop level and vector parallelism. Data prefetching may not be the most effective technique ....

....for the latter accesses in each block. An example of this type of prefetching transformation for a simple vector statement is given in Figure 1. This algorithm is inspired by techniques employed in the Cedar multiprocessor for parallelizing vector operations and fetching vectors from global memory [7]. a(i:j) b(i:j) c(m:n) do k = 0,j i,N itmp = min(k N 1,j i) prefetch (b( k:itmp i) prefetch (c( k:itmp m) a( k:itmp i) b( k:itmp i) c( k:itmp m) end do Figure 1 Blocked Vector Prefetching Strategy The multiprocessor software pipelined prefetching algorithm supports ....

[Article contains additional citation context not shown here]

Kuck, D., et al., "The Cedar System and an Initial Performance Study", Proceedings of ISCA, 1993, pp. 213-223.


The Performance of the Cedar Multistage Switching Network - Torrellas, Zhang (1997)   (12 citations)  (Correct)

....switching networks perform in real systems and how to compare their performance [13] In this paper, we present an in depth empirical analysis of a multistage switching network in a vector multiprocessor. We use hardware probes to monitor the omega network [8] of the Cedar shared memory machine [7] executing real applications. The machine is configured with 16 CPUs. We examine each individual queue, switch element, and link in the network. The analysis suggests that the performance of multistage switching networks is limited by traffic non uniformities. We identify two major ....

....setup, and the workloads run. 2.1 The Cedar Machine and Its Network In this research, we use a hardware performance monitor to examine the network of Cedar. Cedar is a shared memory multiprocessor developed at the Center for Supercomputing Research and Development, University of Illinois [7]. The machine has 32 Alliant FX 8 vector processors. Limited performance monitoring hardware, however, forces us to disable 16 of them. Processors do not stall on writes; however, they stall with two pending scalar reads. Messages in a vector access are pipelined. All processors share 64 Mbytes of ....

D. Kuck et al. The Cedar System and an Initial Performance Study. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 213--224, May 1993.


Codesign for Real-Time Video Applications - Wilberg (1996)   (1 citation)  (Correct)

....processor. But automatic partitioning on a coarse grained level would require some kind of parallelizing compiler that automatically extracts parts of the specification which can be executed in parallel. This has been a research topic in multiprocessing since a long time [ 210] 158] 90] 77] [ 113]. But good performance is only achieved if the designer controls the parallelization [ 29] 110] Thus extending the automatic partitioning approaches to complex, high performance applications seems at least very difficult. A performance analysis based on the pixie tool is used for the software ....

D. Kuck et al. The cedar system and an initial performance study. In Proc. 20th Annual Int. Symp. Computer Architecture, pages 213--223, San Diego, May 1993.


The SHRIMP Performance Monitor: Design and Applications - Martonosi, Clark, Mesarina (1996)   (10 citations)  (Correct)

.... of a paral 20 bit Address Latency[19 16] SenderID[7 4] Size[7 4] Size[11 8] Page Tag Category Select lines Category Page Tag Latency[11 8] Latency[7 4] Select lines Select lines Select lines Size[3 0] Latency[19 16] SenderID[7 4] Size[7 4] Size[11 8] Page Tag Size[3 0] SenderID[3 0] Latency[15 12] Select lines Latency[19 16] SenderID[7 4] Size[7 4] Size[11 8] Size[3 0] SenderID[3 0] Latency[15 12] Latency[11 8] Latency[15 12] SenderID[7 4] Size[7 4] Size[11 8] SenderID[3 0] Latency[11 8] Latency[7 4] Latency[19 16] Latency[15 12] Size[11 8] SenderID[3 0] Latency[11 8] Latency[7 4] ....

.... lines Category Page Tag Latency[11 8] Latency[7 4] Select lines Select lines Select lines Size[3 0] Latency[19 16] SenderID[7 4] Size[7 4] Size[11 8] Page Tag Size[3 0] SenderID[3 0] Latency[15 12] Select lines Latency[19 16] SenderID[7 4] Size[7 4] Size[11 8] Size[3 0] SenderID[3 0] Latency[15 12] Latency[11 8] Latency[15 12] SenderID[7 4] Size[7 4] Size[11 8] SenderID[3 0] Latency[11 8] Latency[7 4] Latency[19 16] Latency[15 12] Size[11 8] SenderID[3 0] Latency[11 8] Latency[7 4] Latency[19 16] Latency[3 0] Latency[3 0] Figure 3: Multiplexing metrics to form a 20 bit histogram address. In ....

[Article contains additional citation context not shown here]

D. Kuck, E. Davidson, D. Lawrie, et al. The Cedar System and an Initial Performance Study. In Proc. 20th Int'l Symp. on Computer Architecture, pages 213--223, May 1993.


Scalable Architectures with k-ary n-cube cluster-c Organization - Basak, Panda (1993)   (2 citations)  (Correct)

....interconnected together to build large scale systems. A variety of hierarchical configurations have been proposed by researchers in the last decade to build scalable systems, using either single processor or processor cluster per node. Some examples include cluster of processors with buses and MINs[7], local and global meshes[5] twolevel systems based on hypercube and other network topologies[9, 15] and combination of intra cluster bus and inter cluster mesh hypercube networks[12] Two desired features in parallel architectures are scalability of the system and its ability to exploit the ....

D. J. Kuck et al. The Cedar System and an Initial Performance Study. In Proc. of the Int'l Symposium on Comp. Arch., pp. 213--223, 1993.


Hardware And Compiler Support For Cache Coherence In Large-Scale.. - Choi (1996)   (5 citations)  (Correct)

....many commercially available large scale multiprocessors, such as the Cray T3D [31] and the Intel Paragon [23] do not provide hardware coherent caches. In several early multiprocessor systems, such as the CMU C. mmp [50] the NYU Ultracomputer [29] the IBM RP3 [6] and the Illinois Cedar [34], compiler directed techniques were used to solve the cache coherence problem. In this approach, cache coherence is maintained locally without the need for interprocessor communication or hardware directories. The C.mmp was the first to allow read only shared data to be kept in private caches ....

....space, or volatile data can be invalidated. Thus, the cache coherence mechanism on the RP3 is more flexible than that of the Ultracomputer, since the compiler can minimize the amount of over invalidation by selecting the most suitable invalidation granularity. Illinois Cedar The Illinois Cedar [34] is a cluster based, shared memory multiprocessor. It consists of four clusters (each with eight processors) that are connected to a globally shared memory. In each cluster, the processors are connected to a 4 way interleaved shared cache, which is in turn connected to an interleaved cluster ....

[Article contains additional citation context not shown here]

D. Kuck, E. Davidson, et al. The Cedar System and an Initial Performance Study. Proceedings of the 20th Annual International Symposium on Com puter Architecture, pages 213--223, May 1993.


Extracting data flow information for parallelizing FORTRAN nested .. - Walker (1994)   (1 citation)  (Correct)

....the Meiko CS 2 can be described with fM; c local g j 1 and fN; c distant , COMPUTE g assuming typical values. Shared memory multiprocessors like the Cray Y MP will have fM; c local ; c distant g j 1 with fN; COMPUTEg assuming typical values. Experimental hierarchical multiprocessors like Cedar [35] and virtual shared memory architectures like KSR1 and BBN GP1000 can also be modelled with fM , N; c local ; c distant ; COMPUTEg assuming their CHAPTER 5. EXECUTION PROFILING ON PARALLEL ARCHITECTURES79 associated typical values. 5.3 Tracking statements and shadow variables In our performance ....

....j 1 which simulates multiprocessors like the BBN GP 1000. In the hierarchical memory model, M 1 and the processors and memory modules in each cluster are assumed to be connected by a bus. The memory model, therefore, simulates hierarchical architectures like the experimental Cedar multiprocessor [35]. As was noted in chapter 5, our processor model is concisely described by the parametric set f M;N; c local ; c distant COMPUTE g. We summarise the parametric sets used to define our performance estimation experiments in tables 6.2, 6.3 and 6.4. CHAPTER 6. SHARED MEMORY ARCHITECTURES 105 DO ....

D. Kuck, E. Davidson, D. Lawrie, A. Sameh, D. Padua, and P. Yew, "The Cedar System and an Initial Performance Study", University of Illinois at Urbana-Champaign, CSRD Tech. Report No. 1261, 1993.


Performance Analysis Of Multiprocessor Interconnection Networks.. - Turner (1995)   (Correct)

....and use traces of application kernels to drive simulations of variations in its design. These simulations are used to determine system bottlenecks, suggest improvements in the design, verify our simulation methodology, and confirm the results of our previous work regarding the performance of MINs [1, 2, 3]. This examination of a real machine forms the basis of our later evaluation of a more modern shared memory multiprocessor design and provides evidence that valid estimations of system performance are possible through use of detailed traffic and hardware models. The next section discusses the ....

D. Kuck, et al., "The cedar system and an initial performance study," in Proceedings 20th International Symposium on Computer Architecture, pp. 213--223, May 1993.


The Performance of the Cedar Multistage Switching Network - Torrellas, Zhang (1997)   (12 citations)  (Correct)

....multistage switching networks in real systems and how to compare their performance [17] In this paper, we present an in depth empirical analysis of a multistage switching network in a vector multiprocessor. We use hardware probes to monitor the omega network [9] of the Cedar shared memory machine [8] executing real applications. The machine is configured with 16 CPUs. The analysis of queues, switch elements, and links suggests that the performance of multistage switching networks is limited by traffic non uniformities. We identify two major non uniformities that degrade Cedar s performance ....

....setup, and the workloads run. 2.1 The Cedar Machine and Its Network In this research, we use a hardware performance monitor to examine the network of Cedar. Cedar is a shared memory multiprocessor developed at the Center for Supercomputing Research and Development (CSRD) University of Illinois [8]. The machine has 32 Alliant FX 8 vector processors. Limited performance monitoring hardware, however, forces us to disable 16 of them. Processors do not stall on writes; however, they stall with two pending scalar reads. Messages in a vector access are pipelined. All processors share 64 Mbytes of ....

D. Kuck et al. The Cedar System and an Initial Performance Study. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 213--224, May 1993.


Comparing the Performance and Programmability of the DASH.. - Torrellas, Koufaty   (Correct)

....Program Performance. 1 Introduction Scalable shared memory multiprocessors are attractive because they achieve the benefits of large scale processing without surrendering much programmability. Several such machines have been built or are currently being built, for example RP3 [15] Cedar [11], KSR1 [10] DASH [12] DDM1 [8] Alewife [5] NYU Ultracomputer [7] or the Tera computer system [1] While all these machines support the sharedmemory paradigm, they have substantial differences. For example, some of them use hardware schemes to maintain the caches coherent, while others rely on ....

....to increase the effectiveness of the clusters. Such cache is called remote access cache (RAC) 12] We will study its effectiveness. 2. 2 The Cedar Machine Cedar is a 4 cluster vector multiprocessor developed at the Center for Supercomputing Research and Development, University of Illinois [11]. Each cluster is an 8 processor bus based Alliant FX 8. Each processor has a 16 Kbyte direct mapped instruction cache. In addition, all processors in a cluster share a 512 Kbyte direct mapped cache and 64 Mbytes of memory visible only to the cluster. All processors in the machine are connected to ....

[Article contains additional citation context not shown here]

D. Kuck et al. The Cedar System and an Initial Performance Study. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 213-- 224, May 1993.


Hardware and Compiler-Directed Cache Coherence in Large-Scale.. - Choi, Yew (1996)   (1 citation)  (Correct)

....instead, provide software mechanisms while relying mostly on users to maintain data coherence either through language extensions or message passing paradigms. In several early multiprocessor systems, such as the CMU C. mmp [38] the NYU Ultracomputer [23] the IBM RP3 [6] and the Illinois Cedar [27], compiler directed techniques were used to solve the cache coherence problem. In this approach, cache coherence is maintained locally without the need for interprocessor communication or hardware directories. The C.mmp was the first to allow read only shared data to be kept in private caches ....

....compiler directed (HSCD) cache coherence scheme, called the two phase invalidation (TPI) scheme which relies mostly on compiler analysis, yet al..so provides a reasonable amount of hardware support. This approach has a long history of predecessors, including C. mmp [38] IBM s RP3 [6] Illinois Cedar [27], and several recently proposed schemes [10, 14, 12, 13, 18, 21, 29, 30] The TPI scheme can be implemented on a large scale multiprocessor using off the shelf microprocessors, and can be adapted to various cache organizations, including multi word cache lines and byte addressable architectures. ....

D. Kuck, E. Davidson, et al. The Cedar System and an Initial Performance Study. Proceedings of the 20th Annual International Symposium on Com puter Architecture, pages 213--223, May 1993.


The Performance of the Cedar Multistage Switching Network - Torrellas, Zhang (1997)   (12 citations)  (Correct)

....multistage switching networks in real systems and how to compare their performance [17] In this paper, we present an in depth empirical analysis of a multistage switching network in a vector multiprocessor. We use hardware probes to monitor the omega network [9] of the Cedar shared memory machine [8] executing real applications. The machine is configured with 16 CPUs. The analysis of queues, switch elements, and links suggests that the performance of multistage switching networks is limited by traffic non uniformities. We identify two major non uniformities that degrade Cedar s performance ....

....setup, and the workloads run. 2.1 The Cedar Machine and Its Network In this research, we use a hardware performance monitor to examine the network of Cedar. Cedar is a shared memory multiprocessor developed at the Center for Supercomputing Research and Development (CSRD) University of Illinois [8]. The machine has 32 Alliant FX 8 vector processors. Limited performance monitoring hardware, however, forces us to disable 16 of them. Processors do not stall on writes; however, they stall with two pending scalar reads. Messages in a vector access are pipelined. All processors share 64 Mbytes of ....

D. Kuck et al. The Cedar System and an Initial Performance Study. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 213--224, May 1993.


Scalability of the Cedar System - Turner, Veidenbaum (1994)   (4 citations)  (Correct)

....improving the scalability of performance. We do so by examining the behavior of Cedar via hardware monitoring and by simulating larger versions of it. These simulations are used to assess improvements to the design and to confirm the results of our previous work regarding the performance of MINs [1, 2, 3, 4]. This research was supported by the Department of Energy under Grant No. DE FG02 85ER25001 and by the National Science Foundation under Grants No. US NSF MIP 8410110 and NSF MIP 89 20891, IBM Corporation, and the State of Illinois. A major issue in any simulation study is verification: do the ....

D. Kuck, et al., "The cedar system and an initial performance study," in Proceedings 20th International Symposium on Computer Architecture, pp. 213--223, May 1993.


Comparing the Performance of the DASH and Cedar.. - Torrellas, Koufaty.. (1994)   (Correct)

....is supported in hardware: a snoopy based protocol within each cluster and a directory based one across clusters. We examine a machine configured with 8 clusters. Cedar is a 4 cluster vector multiprocessor developed at the Univ. of Illinois Center for Supercomputing Research and Development [3]. Each cluster is an 8 CPU bus based Alliant FX 8. All processors in a cluster share a 512 Kbyte direct mapped cache and 64 Mbytes of memory visible only to the cluster. Fast synchronization is possible via a percluster synchronization bus. Each processor has a 4 Kbyte prefetch buffer. All ....

D. Kuck et al. The Cedar System and an Initial Performance Study. In Proc. of ISCA '93, pages 213--224, May 1993.


An Efficient Algorithm for the Run-time Parallelization of .. - Chen, Torrellas, Yew (1994)   (25 citations)  (Correct)

....[14] it speeds up execution by significantly reducing the amount of communication required and by increasing the overlap among dependent iterations. The effectiveness of this algorithm is evaluated via measurements of parameterized loops in the 32 processor Cedar shared memory multiprocessor [5]. The results show speedups that reach up to 14 with the full overhead of the analysis and up to 27 if part of the analysis work is reused across loop invocations. Moreover, our algorithm outperforms the older scheme with the same generality in nearly all cases, reaching a 37 fold speedup when the ....

....phases over the Zhu Yew algorithm. In addition, it removes redundant operations in the inspector. It may, however, increase the spinlocking during execution since the processor may have to wait for A(I2(i) key to become valid. 4 The Experimental System Our experiments are timing runs on Cedar [5], a 32processor scalable shared memory multiprocessor designed at the Center for Supercomputing Research and Development. The machine has 4 clusters of 8 processors each. The latency to access the cache is 170 ns, while accessing the cluster memory takes 1190 ns and accessing the global memory ....

D. Kuck et al. The Cedar system and an initial performance study. In 20th Int'l Symp. on Computer Architecture, May 1993.


Integrating Fine-Grained Message Passing In Cache Coherent.. - Yew, Poulsen (1996)   (1 citation)  Self-citation (Kuck)   (Correct)

....misses. Multiprocessor caches can hide memory latency for sharing accesses by exploiting spatial locality, but increasing cache block sizes may lead to undesirable false sharing [3] Data prefetching has been shown to be effective in reducing memory latency in shared memory multiprocessors [4, 5]; however, while data prefetching has the ability to hide memory latency for both sharing and nonsharing accesses, data forwarding [6, 7] may be a more effective technique than data prefetching for reducing the latency of sharing accesses. While many of the data forwarding mechanisms previously ....

....the concomitant cost of this support [8] Adding processor support for prefetching and forwarding instructions is straightforward. Several machines already support these types of operations, including the Stanford Dash multiprocessor [9] the Kendall Square KSR1 [22] and the Cedar multiprocessor [5]. The prefetch and Forwarding Write operations assumed in this work most closely resemble the Dash Prefetch and Dash Deliver instructions. A prefetch instruction is similar to a regular load, except that it is non blocking and is dropped on exceptions [23, 24] A Forwarding Write is similar to a ....

Kuck, D., et al., "The Cedar System and an Initial Performance Study", Proceedings of the 20th Annual International Symposium on Computer Architecture, 1993, pp. 213-223.


Integrating Fine-Grained Message Passing In Cache Coherent.. - Poulsen, Yew (1996)   (3 citations)  Self-citation (Kuck)   (Correct)

....misses. Multiprocessor caches can hide memory latency for sharing accesses by exploiting spatial locality, but increasing cache block sizes may lead to undesirable false sharing [3] Data prefetching has been shown to be effective in reducing memory latency in shared memory multiprocessors [4, 5]; however, while data prefetching has the ability to hide memory latency for both sharing and nonsharing accesses, data forwarding [6, 7] may be a more effective technique than data prefetching for reducing the latency of sharing accesses. This paper studies and compares the performance advantages ....

....the concomitant cost of this support [8] Adding processor support for prefetching and forwarding instructions is straightforward. Several machines already support these types of operations, including the Stanford Dash multiprocessor [14] the Kendall Square KSR1 [22] and the Cedar multiprocessor [5]. The prefetch and Forwarding Write operations assumed in this work most closely resemble the Dash Prefetch and Dash Deliver instructions. A prefetch instruction is similar to a regular load, except that it is non blocking and is dropped on exceptions [23, 24] A Forwarding Write is similar to a ....

Kuck, D., et al. The Cedar system and an initial performance study. Proc. 20th Annual International Symposium on Computer Architecture. 1993, pp. 213-223. (30)


Fault-Tolerant Hierarchical Networks for Shared Memory .. - Mahmud, Samaratunga.. (2002)   (Correct)

No context found.

Kuck, D. et al. (1993) The Cedar system and an initial performance study. In Proc. 20th Ann. Int. Symp. on Computer Architecture, San Diego, CA, May 16--19, pp. 213-- 223. IEEE Computer Society Press, Los Alamitos, CA.


Compiler Optimizations For Parallel Loops With Fine-Grained.. - Chen (1994)   (5 citations)  (Correct)

No context found.

D. Kuck et al. The Cedar system and an initial performance study. In 20th Int'l Symp. on Computer Architecture, May 1993.


Run-time parallelization of irregular DOACROSS loops - Thulasiraman, Krothapalli, .. (1995)   (Correct)

No context found.

D. Kuck et al, The Cedar system and an initial performance study, in 20th Int'l Symp. on Computer Architecture, May 1993.

First 50 documents

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC