Results 1 - 10
of
145
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches
- IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 2006
"... This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resource ..."
Abstract
-
Cited by 260 (5 self)
- Add to MetaCart
(Show Context)
This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the application that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each application at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each application. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation
- IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 2006
"... This paper presents and studies a distributed L2 cache management approach through OS-level page allocation for future many-core processors. L2 cache management is a crucial multicore processor design aspect to overcome non-uniform cache access latency for good program performance and to reduce on-c ..."
Abstract
-
Cited by 134 (11 self)
- Add to MetaCart
This paper presents and studies a distributed L2 cache management approach through OS-level page allocation for future many-core processors. L2 cache management is a crucial multicore processor design aspect to overcome non-uniform cache access latency for good program performance and to reduce on-chip network traffic and related power consumption. Unlike previously studied hardwarebased private and shared cache designs implementing a "fixed" caching policy, the proposed OS-microarchitecture approach is flexible; it can easily implement a wide spectrum of L2 caching policies without complex hardware support. Furthermore, our approach can provide differentiated execution environment to running programs by dynamically controlling data placement and cache sharing degrees. We discuss key design issues of the proposed approach and present preliminary experimental results showing the promise of our approach.
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0
- IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 2007
"... A significant part of future microprocessor real estate will be dedicated to L2 or L3 caches. These on-chip caches will heavily impact processor perfor- mance, power dissipation, and thermal management strategies. There are a number of interconnect design considerations that influence power/performa ..."
Abstract
-
Cited by 122 (18 self)
- Add to MetaCart
(Show Context)
A significant part of future microprocessor real estate will be dedicated to L2 or L3 caches. These on-chip caches will heavily impact processor perfor- mance, power dissipation, and thermal management strategies. There are a number of interconnect design considerations that influence power/performance/area characteristics of large caches, such as wire mod- els (width/spacing/repeaters), signaling strategy (RC/differential/transmission), router design, etc. Yet, to date, there exists no analytical tool that takes all of these parameters into account to carry out a design space exploration for large caches and estimate an optimal organization. In this work, we implement two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache. First, we add the ability to model different types of wires, such as RC-based wires with different power/delay characteristics and differential low-swing buses. Second, we add the ability to model Non-uniform Cache Access (NUCA). We not only adopt state-of-the-art design space exploration strategies for NUCA, we also enhance this exploration by considering on-chip network contention and a wider spectrum of wiring and routing choices. We present a validation analysis of the new tool (to be released as CACTI 6.0) and present a case study to showcase how the tool can improve architecture research methodologies.
Virtual hierarchies to support server consolidation
- ISCA '07
, 2007
"... Server consolidation is becoming an increasingly popular technique to manage and utilize systems. This paper develops CMP memory systems for server consolidation where most sharing occurs within Virtual Machines (VMs). Our memory systems maximize shared memory accesses serviced within a VM, minimize ..."
Abstract
-
Cited by 74 (2 self)
- Add to MetaCart
Server consolidation is becoming an increasingly popular technique to manage and utilize systems. This paper develops CMP memory systems for server consolidation where most sharing occurs within Virtual Machines (VMs). Our memory systems maximize shared memory accesses serviced within a VM, minimize interference among separate VMs, facilitate dynamic reassignment of VMs to processors and memory, and support content-based page sharing among VMs. We begin with a tiled architecture where each of 64 tiles contains a processor, private L1 caches, and an L2 bank. First, we reveal why single-level directory designs fail to meet workload consolidation goals. Second, we develop the paper’s central idea of imposing a two-level virtual (or logical) coherence hierarchy on a physically flat CMP that harmonizes with VM assignment. Third, we show that the best of our two virtual hierarchy (VH) variants performs 12-58 % better than the best alternative flat directory protocol when consolidating Apache, OLTP, and Zeus commercial workloads on our simulated
Computation spreading: Employing hardware migration to specialize CMP cores on-the-fly
- In Proc. of 12th ASPLOS
, 2006
"... In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among di ..."
Abstract
-
Cited by 71 (8 self)
- Add to MetaCart
(Show Context)
In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among different processors, causing redundancy (e.g., in our server workloads, 45–65 % of all instruction blocks are accessed by all processors). Moreover, largely independent fragments of computation compete for the same private resources causing destructive interference. Together, this redundancy and interference lead to poor utilization of private microarchitecture resources such as caches and branch predictors. We present Computation Spreading (CSP), which employs hardware migration to distribute a thread’s dissimilar fragments of computation across the multiple processing cores of a chip multiprocessor (CMP), while grouping similar computation fragments from different threads together. This paper focuses on a specific example of CSP for OS intensive server applications: separating application level (user) computation from the OS calls it makes. When performing CSP, each core becomes temporally specialized to execute certain computation fragments, and the same core is repeatedly used for such fragments. We examine two specific thread assignment policies for CSP, and show that these policies, across four server workloads, are able to reduce instruction misses in private L2 caches by 27–58%, private L2 load misses by 0–19%, and branch mispredictions by 9–25%.
Asr: Adaptive selective replication for cmp caches,”
- in Proceedings of the 39th Annual IEEE/ACM International Symposium of Microarchitecture,
, 2006
"... ..."
WCET analysis for multi-core processors with shared l2 instruction caches
- In RTAS ’08: Proceedings of the 2008 IEEE Real-Time and Embedded Technology and Applications Symposium
, 2008
"... Multi-core chips have been increasingly adopted by microprocessor industry. For real-time systems to safely harness the potential of multi-core computing, designers must be able to accurately obtain the worst-case execution time (WCET) of applications running on multi-core platforms, which is very c ..."
Abstract
-
Cited by 53 (1 self)
- Add to MetaCart
(Show Context)
Multi-core chips have been increasingly adopted by microprocessor industry. For real-time systems to safely harness the potential of multi-core computing, designers must be able to accurately obtain the worst-case execution time (WCET) of applications running on multi-core platforms, which is very challenging due to the possible runtime inter-core interferences in us-ing shared resources such as the shared L2 caches. As the first step toward time-predictable multi-core computing, this paper presents a novel approach to bounding the worst-case performance for threads run-ning on multi-core processors with shared L2 instruc-tion caches. The idea of our approach is to com-pute the worst-case instruction access interferences between different threads based on the program con-trol flow information of each thread, which can be statically analyzed. Our experiments indicate that the proposed approach can reasonably estimate the worst-case shared L2 instruction cache misses by consider-ing inter-thread instruction conflicts. Also, the WCET of applications running on multi-core processors esti-mated by our approach is much better than the esti-mation by simply assuming all L2 instruction accesses are misses. 1.
Interconnect Design Considerations for Large NUCA Caches
- In Proceedings of the 34th International Symposium on Computer Architecture (ISCA-34
, 2007
"... The ever increasing sizes of on-chip caches and the growing domination of wire delay necessitate significant changes to cache hierarchy design methodologies. Many recent proposals advocate splitting the cache into a large number of banks and employing a network-on-chip (NoC) to allow fast access to ..."
Abstract
-
Cited by 47 (10 self)
- Add to MetaCart
(Show Context)
The ever increasing sizes of on-chip caches and the growing domination of wire delay necessitate significant changes to cache hierarchy design methodologies. Many recent proposals advocate splitting the cache into a large number of banks and employing a network-on-chip (NoC) to allow fast access to nearby banks (referred to as Non-Uniform Cache Architectures – NUCA). Most studies on NUCA organizations have assumed a generic NoC and focused on logical policies for cache block placement, movement, and search. Since wire/router delay and power are major limiting factors in modern processors, this work focuses on interconnect design and its influence on NUCA performance and power. We extend the widely-used CACTI cache modeling tool to take network design parameters into account. With these overheads appropriately accounted for, the optimal cache organization is typically very different from that assumed in prior NUCA studies. To alleviate the interconnect delay bottleneck, we propose novel cache access optimizations that introduce heterogeneity within the inter-bank network. The careful consideration of interconnect choices for a large cache results in a 51 % performance improvement over a baseline generic NoC and the introduction of heterogeneity within the network yields an additional 11-15 % performance improvement. Categories and Subject Descriptors C.1.2 [Multiple Data Stream Architectures (Multiprocessors)]: [Interconnection architectures (e.g., common bus, multiport memory, crossbar switch)]; B.3.2 [Design
Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches
- In HPCA
, 2009
"... ..."
(Show Context)
An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors
- INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE
, 2007
"... The significant speed-gap between processor and memory and the limited chip memory bandwidth make last-level cache performance crucial for future chip multiprocessors. To use the capacity of shared last-level caches efficiently and to allow for a short access time, proposed non-uniform cache archite ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
The significant speed-gap between processor and memory and the limited chip memory bandwidth make last-level cache performance crucial for future chip multiprocessors. To use the capacity of shared last-level caches efficiently and to allow for a short access time, proposed non-uniform cache architectures (NUCAs) are organized into per-core partitions. If a core runs out of cache space, blocks are typically relocated to nearby partitions, thus managing the cache as a shared cache. This uncontrolled sharing of all resources may unfortunately result in pollution that degrades performance. We propose a novel non-uniform cache architecture in which the amount of cache space that can be shared among the cores is controlled dynamically. The adaptive scheme estimates, continuously, the effect of increasing decreasing the shared partition size on the overall performance. We show that our scheme outperforms a private and shared cache organization as well as a hybrid NUCA organization in which blocks in a local partition can spill over to neighbor core partitions.