Results 1 - 10
of
34
Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors
- INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2005
"... In this paper, we consider tiled chip multiprocessors (CMP) where each tile contains a slice of the total on-chip L2 cache storage and tiles are connected by an on-chip network. The L2 slices can be managed using two basic schemes: 1) each slice is treated as a private L2 cache for the tile 2) all s ..."
Abstract
-
Cited by 163 (3 self)
- Add to MetaCart
(Show Context)
In this paper, we consider tiled chip multiprocessors (CMP) where each tile contains a slice of the total on-chip L2 cache storage and tiles are connected by an on-chip network. The L2 slices can be managed using two basic schemes: 1) each slice is treated as a private L2 cache for the tile 2) all slices are treated as a single large L2 cache shared by all tiles. Private L2 caches provide the lowest hit latency but reduce the total effective cache capacity, as each tile creates local copies of any line it touches. A shared L2 cache increases the effective cache capacity for shared data, but incurs long hit latencies when L2 data is on a remote tile. We present a new cache management policy, victim replication, which combines the advantages of private and shared schemes. Victim replication is a variant of the shared scheme which attempts to keep copies of local primary cache victims within the local L2 cache slice. Hits to these replicated copies reduce the effective latency of the shared L2 cache, while retaining the benefits of a higher effective capacity for shared data. We evaluate the various schemes using full-system simulation of both single-threaded and multi-threaded benchmarks running on an 8-processor tiled CMP. We show that victim replication reduces the average memory access latency of the shared L2 cache by an average of 16%for multi-threaded benchmarks and 24%for single-threaded benchmarks, providing better overall performance than either private or shared schemes.
Cooperative caching for chip multiprocessors
- In Proceedings of the 33nd Annual International Symposium on Computer Architecture
, 2006
"... Chip multiprocessor (CMP) systems have made the on-chip caches a critical resource shared among co-scheduled threads. Limited off-chip bandwidth, increasing on-chip wire delay, destructive inter-thread interference, and diverse workload characteristics pose key design challenges. To address these ch ..."
Abstract
-
Cited by 145 (1 self)
- Add to MetaCart
Chip multiprocessor (CMP) systems have made the on-chip caches a critical resource shared among co-scheduled threads. Limited off-chip bandwidth, increasing on-chip wire delay, destructive inter-thread interference, and diverse workload characteristics pose key design challenges. To address these challenge, we propose CMP cooperative caching (CC), a unified framework to efficiently organize and manage on-chip cache resources. By forming a globally managed, shared cache using cooperative private caches. CC can effectively support two important caching applications: (1) reduction of average memory access latency and (2) isolation of destructive inter-thread interference. CC reduces the average memory access latency by balancing between cache latency and capacity opti-mizations. Based private caches, CC naturally exploits their access latency benefits. To improve the effective cache capacity, CC forms a “shared ” cache using replication control and LRU-based global replacement policies. Via cooperation throttling, CC provides a spectrum of caching behaviors between the two extremes of private and shared caches, thus enabling dynamic adaptation to suit workload requirements. We show that CC can achieve a robust performance advantage over private and shared cache schemes across different processor, cache and memory configurations, and a wide selection of multithreaded and multiprogrammed
Asr: Adaptive selective replication for cmp caches,”
- in Proceedings of the 39th Annual IEEE/ACM International Symposium of Microarchitecture,
, 2006
"... ..."
Victim migration: Dynamically adapting between private and shared cmp caches
, 2005
"... Future CMPs will have more cores and greater onchip cache capacity. The on-chip cache can either be divided into separate private L2 caches for each core, or treated as a large shared L2 cache. Private caches provide low hit latency but low capacity, while shared caches have higher hit latencies but ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
(Show Context)
Future CMPs will have more cores and greater onchip cache capacity. The on-chip cache can either be divided into separate private L2 caches for each core, or treated as a large shared L2 cache. Private caches provide low hit latency but low capacity, while shared caches have higher hit latencies but greater capacity. Victim replication was previously introduced as a way of reducing the average hit latency of a shared cache by allowing a processor to make a replica of a primary cache victim in its local slice of the global L2 cache. Although victim replication performs well on multithreaded and single-threaded codes, it performs worse than the private scheme for multiprogrammed workloads where there is little sharing between the different programs running at the same time. In this paper, we propose victim migration, which improves on victim replication by adding an additional set of migration tags on each node which are used to implement an exclusive cache policy for replicas. When a replica has been created on a remote node, it is not also cached on the home node, but only recorded in the migration tags. This frees up space on the home node to store shared global lines or replicas for the local processor. We show that victim migration performs better than private, shared, and victim replication schemes across a range of single threaded, multithreaded, and multiprogrammed workloads, while using less area than a private cache design. Victim migration provides a reduction in average memory access latency of up to 10% over victim replication. 1.
The Effectiveness of SRAM Network Caches in Clustered DSMs
, 1998
"... The frequency of accesses to remote data is a key factor affecting the performance of all Distributed Shared Memory (DSM) systems. Remote data caching is one of the most effective and general techniques to fight processor stalls due to remote capacity misses in the processor caches. The design space ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
(Show Context)
The frequency of accesses to remote data is a key factor affecting the performance of all Distributed Shared Memory (DSM) systems. Remote data caching is one of the most effective and general techniques to fight processor stalls due to remote capacity misses in the processor caches. The design space of remote data caches (RDC) has many dimensions and one essential performance trade-off: hit ratio versus speed. Some recent commercial systems have opted for large and slow (S)DRAM network caches (NC), but others completely avoid them because of their damaging effects on the remote/local latency ratio. In this paper we will explore small and fast SRAM network caches as a means to reduce the remote stalls and capacity traffic of multiprocessor clusters. The major appeal of SRAM NCs is that they add less penalty on the latency of NC hits and remote accesses. Their small capacity can handle conflict misses and a limited amount of capacity misses. However, they can be coupled with main memory...
Proximity-Aware Directory-based Coherence for Multi-core Processor Architectures
- In Proceedings of SPAA-19
, 2007
"... As the number of cores increases on chip multiprocessors, coherence is fast becoming a central issue for multi-core performance. This is exacerbated by the fact that interconnection speeds are not scaling well with technology. This paper describes mechanisms to accelerate coherence for a multi-core ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
(Show Context)
As the number of cores increases on chip multiprocessors, coherence is fast becoming a central issue for multi-core performance. This is exacerbated by the fact that interconnection speeds are not scaling well with technology. This paper describes mechanisms to accelerate coherence for a multi-core architecture that has multiple private L2 caches and a scalable point-to-point interconnect between cores. These techniques exploit the differences in geometry between chip multiprocessors and traditional multiprocessor architectures. Directory-based protocols have been proposed as a scalable alternative to snoop-based protocols. In this paper, we discuss implementations of coherence for CMPs and propose and evaluate a novel directory-based coherence scheme to improve the performance of parallel programs on such processors. Proximity-aware coherence accelerates read and write misses by initiating cache-to-cache transfers from the spatially closest sharer. This has the dual benefit of eliminating unnecessary accesses to off-chip memory, and minimizing the distance over which communicated data moves across the network. The proposed schemes result in speedups up to 74.9 % for our workloads.
The Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols
, 1998
"... that I have read this dissertation and that in my opinion it is ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
(Show Context)
that I have read this dissertation and that in my opinion it is
Cache-only memory architectures
- IEEE Computer Magazine
, 1999
"... A Cache-Only Memory Architecture (COMA) is a type of cache-coherent nonuniform memory access (CC-NUMA) architecture. Unlike in a conventional CC-NUMA architecture, in a COMA, every sharedmemory module in the machine is a cache, where each memory line has a tag with the line’s address and state. As a ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
A Cache-Only Memory Architecture (COMA) is a type of cache-coherent nonuniform memory access (CC-NUMA) architecture. Unlike in a conventional CC-NUMA architecture, in a COMA, every sharedmemory module in the machine is a cache, where each memory line has a tag with the line’s address and state. As a processor references a line, it transparently brings it to both its private cache(s) and its nearby portion of the NUMA shared memory (Local Memory) – possibly displacing a valid line from its local memory. Effectively, each shared-memory module acts as a huge cache memory, giving the name COMA to the architecture. Since the COMA hardware automatically replicates the data and migrates it to the memory module of the node that is currently accessing it, COMA increases the chances of data being available locally. This reduces the possibility of frequent long-latency memory accesses. Effectively, COMA dynamically adapts the shared data layout to the application’s reference patterns.
Switch Cache: A Framework for Improving the Remote Memory Access Latency of CC-NUMA Multiprocessors
- Fifth International Conference on High Performance Computer Architecture (HPCA-5
, 1999
"... Cache coherent non-uniform memory access (CC-NUMA) multiprocessors continue to suffer from remote memory access latencies due to comparatively slow memory technology and data transfer latencies in the interconnection network. In this paper, we propose a novel hardware caching technique, called switc ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
Cache coherent non-uniform memory access (CC-NUMA) multiprocessors continue to suffer from remote memory access latencies due to comparatively slow memory technology and data transfer latencies in the interconnection network. In this paper, we propose a novel hardware caching technique, called switch cache. The main idea is to implement small fast caches in crossbar switches of the interconnect medium to capture and store shared data as they flow from the memory module to the requesting processor. This stored data acts as a cache for subsequent requests, thus reducing the latency of remote memory accesses tremendously. The implementation of a cache in a crossbar switch needs to be efficient and robust, yet flexible for changes in the caching protocol. The design and implementation details of a CAche Embedded Switch ARchitecture, CAESAR, using wormhole routing with virtual channels is presented. Using detailed executiondriven simulations, we find that the CAESAR switch cache is capable ...
PRISM: An Integrated Architecture for Scalable Shared Memory
- IN PROCEEDINGS OF THE FOURTH IEEE SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE
, 1998
"... This paper describes PRISM, a distributed shared-memory architecture that relies on a tightly integrated hardware and operating system design for scalable and reliable performance. PRISM's hardware provides mechanisms for flexible management and dynamic configuration of shared-memory pages with ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
This paper describes PRISM, a distributed shared-memory architecture that relies on a tightly integrated hardware and operating system design for scalable and reliable performance. PRISM's hardware provides mechanisms for flexible management and dynamic configuration of shared-memory pages with different behaviors. As an example, PRISM can provide configure individual sharedmemory pages in both CC-NUMA and Simple-COMA styles, maintaining the advantages of both without incorporating any of their disadvantages. PRISM's operating system is structured as multiple independent kernels, where each kernel manages the resources on its local node. PRISM's system structure minimizes the amount of global coordination when managing shared-memory: page faults do not involve global TLB invalidates, and pages can be replicated and migrated without requiring global coordination. The structure also provides natural fault containment boundaries around each node because physical addresses do not address re...