Results 1 - 10
of
161
Token flow control
"... As companies move towards many-core chips, an efficient onchip communication fabric to connect these cores assumes critical importance. To address limitations to wire delay scalability and increasing bandwidth demands, state-of-the-art on-chip networks use a modular packet-switched design with route ..."
Abstract
-
Cited by 635 (35 self)
- Add to MetaCart
synthetic traffic and traces from the SPLASH-2 benchmark suite show reduction in packet latency by up to 77.1 % with upto 39.6 % reduction in average router energy consumption as compared to a state-of-theart baseline packet-switched design. For the same saturation throughput as the baseline network, TFC
The SGI Origin: A ccNUMA highly scalable server
- In Proceedings of the 24th International Symposium on Computer Architecture (ISCA’97
, 1997
"... The SGI Origin 2000 is a cache-coherent non-uniform memory access (ccNUMA) multiprocessor designed and manufactured by Silicon Graphics, Inc. The Origin system was designed from the ground up as a multiprocessor capable of scaling to both small and large processor counts without any bandwidth, laten ..."
Abstract
-
Cited by 497 (0 self)
- Add to MetaCart
the Origin 2000 and then describes its architecture and implementation. In addition, performance results are presented for the NAS Parallel Benchmarks V2.2 and the SPLASH2 applications. Finally, the Origin system is compared to other contemporary commercial ccNUMA systems. 1
TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems
- IN PROCEEDINGS OF THE 1994 WINTER USENIX CONFERENCE
, 1994
"... TreadMarks is a distributed shared memory (DSM) system for standard Unix systems such as SunOS and Ultrix. This paper presents a performance evaluation of TreadMarks running on Ultrix using DECstation-5000/240's that are connected by a 100-Mbps switch-based ATM LAN and a 10-Mbps Ethernet. Ou ..."
Abstract
-
Cited by 526 (17 self)
- Add to MetaCart
of Water from the SPLASH benchmark suite, we achieved only moderate speedups (4.0) due to the high communication and synchronization rate. Speedups decline on the 10-Mbps Ethernet (5.5 for Jacobi, 6.5 for TSP, 4.2 for Quicksort, 5.1 for ILINK, and 2.1 for Water), reflecting the bandwidth limitations
Logtm: Log-based transactional memory
- in HPCA
, 2006
"... Transactional memory (TM) simplifies parallel programming by guaranteeing that transactions appear to execute atomically and in isolation. Implementing these properties includes providing data version management for the simultaneous storage of both new (visible if the transaction commits) and old (r ..."
Abstract
-
Cited by 282 (11 self)
- Add to MetaCart
detection on evicted blocks and fast commit (using lazy cleanup). Second, LogTM handles aborts in (library) software with little performance penalty. Evaluations running micro- and SPLASH-2 benchmarks on a 32way multiprocessor support our decision to optimize for commit by showing that only 1-2
PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on Chip-Multiprocessors
- 112 Proceedings of the IEEE International Symposium on Workload Characterization (IISWC ’08
, 2008
"... The PARSEC benchmark suite was recently released and has been adopted by a significant number of users within a short amount of time. This new collection of workloads is not yet fully under-stood by researchers. In this study we compare the SPLASH-2 and PARSEC benchmark suites with each other to gai ..."
Abstract
-
Cited by 51 (3 self)
- Add to MetaCart
The PARSEC benchmark suite was recently released and has been adopted by a significant number of users within a short amount of time. This new collection of workloads is not yet fully under-stood by researchers. In this study we compare the SPLASH-2 and PARSEC benchmark suites with each other
Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems
- In Proceedings of the Operating Systems Design and Implementation Symposium
, 1996
"... This paper investigates the performance of shared virtual memory protocols on large-scale multicomputers. Using experiments on a 64-node Paragon, we show that the traditional Lazy Release Consistency (LRC) protocol does not scale well, because of the large number of messages it requires, the large a ..."
Abstract
-
Cited by 160 (20 self)
- Add to MetaCart
overlapping to the base LRC protocol, with similar results. Our experiments were done using ve of the Splash-2 benchmarks. We report overall execution times, as well as detailed breakdowns of elapsed time, message trac, and memory use for each of the protocols. 1
A Communication Characterisation of Splash-2 and Parsec
"... Recent benchmark suite releases such as Parsec specifically utilise the tightly coupled cores available in chipmultiprocessors to allow the use of newer, high performance, models of parallelisation. However, these techniques introduce additional irregularity and complexity to data sharing and are en ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
are presented for the full collection of Splash-2 and Parsec benchmarks. Our results aim to support the design of future communication systems for CMPs, encompassing coherence protocols, network-on-chip and thread mapping. 1
Neighborhood Prefetching on Multiprocessors Using Instruction History
, 2000
"... A multiprocessor prefetch scheme is described in which a miss is followed by a prefetch of a group of lines, a neighborhood, surrounding the demand-fetched line. The neighborhood is based on the data address and the past behavior of the instruction that missed the cache. A neighborhood for an instru ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
access patterns. Neighborhood prefetching was compared to adaptive sequential prefetching using execution-driven simulation. Results show more useful prefetches and lower execution time for neighborhood prefetching for six of eight SPLASH-2 benchmarks. On eight SPLASH-2 benchmarks the average normalized
The Augmint Multiprocessor Simulation Toolkit for Intel x86 Architectures
, 1996
"... Most publicly-available simulation tools only simulate RISC architectures. These tools cannot capture the instruction mix and memory reference patterns of CISC architectures. In this paper, we present an overview of Augmint, an execution-driven multiprocessor simulation toolkit that fills this gap b ..."
Abstract
-
Cited by 61 (6 self)
- Add to MetaCart
by supporting Intel x86 architectures. Augmint also supports trace-driven simulation for uniprocessors as well as multiprocessors, with minor effort on the part of simulator developers. Augmint runs m4-macro-extended C and C++ applications such as those in the SPLASH and SPLASH-2 benchmark suites. Augmint
Neighborhood Prefetching on Multiprocessors Using Instruction History ∗
"... A multiprocessor prefetch scheme is described in which a miss is followed by a prefetch of a group of lines, a neighborhood, surrounding the demand-fetched line. The neighborhood is based on the data address and the past behavior of the instruction that missed the cache. A neighborhood for an instru ..."
Abstract
- Add to MetaCart
access patterns. Neighborhood prefetching was compared to adaptive sequential prefetching using execution-driven simulation. Results show more useful prefetches and lower execution time for neighborhood prefetching for six of eight SPLASH-2 benchmarks. On eight SPLASH-2 benchmarks the average normalized
Results 1 - 10
of
161