Results 1 - 10
of
12
3d-stacked memory architectures for multi-core processors
- In International Symposium on Computer Architecture
"... Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works only consider commodity 2D DRAM organizations. In ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works only consider commodity 2D DRAM organizations. In this work, we explore more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count. Our simulation results show that with a few simple changes to the 3D-DRAM organization, we can achieve a 1.75 × speedup over previously proposed 3D-DRAM approaches on our memoryintensive multi-programmed workloads on a quad-core processor. The significant increase in memory system performance makes the L2 miss handling architecture (MHA) a new bottleneck, which we address by combining a novel data structure called the Vector Bloom Filter with dynamic MSHR capacity tuning. Our scalable L2 MHA yields an additional 17.8 % performance improvement over our 3D-stacked memory architecture. 1.
PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-core Shared Caches
- In Proc. of the 36th Intl. Symp. on Computer Architecture
, 2009
"... Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple cores compete for the limited LLC capacity. Differ ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple cores compete for the limited LLC capacity. Different memory access patterns can cause cache contention in different ways, and various techniques have been proposed to target some of these behaviors. In this work, we propose a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism. By handling multiple types of memory behaviors, our proposed technique outperforms techniques that target only either capacity partitioning or adaptive insertion.
Extending the Effectiveness of 3D-Stacked DRAM Caches with an Adaptive Multi-Queue Policy
"... 3D-integration is a promising technology to help combat the “Memory Wall ” in future multi-core processors. Past work has considered using 3D-stacked DRAM as a large last-level cache (LLC). While significant performance benefits can be gained with such an approach, there remain additional opportunit ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
3D-integration is a promising technology to help combat the “Memory Wall ” in future multi-core processors. Past work has considered using 3D-stacked DRAM as a large last-level cache (LLC). While significant performance benefits can be gained with such an approach, there remain additional opportunities beyond the simple integration of commodity DRAM chips. In this work, we leverage the hardware organization typical of DRAM architectures to propose new cache management policies that would otherwise not be practical for standard SRAM-based caches. We propose a cache where each set is organized as multiple logical FIFO or queue structures that simultaneously provide performance isolation between threads as well as reduce the number of entries occupied by dead lines. Our results show that beyond the simplistic approach of stacking DRAM as cache, such tightly-integrated 3D architectures enable new opportunities for optimizing and improving system performance.
Criticality-based optimizations for efficient load processing
- In HPCA-15
, 2009
"... Some instructions have more impact on processor performance than others. Identification of these critical instructions can be used to modify and improve instruction processing. Previous work has shown that the criticality of instructions can be dynamically predicted with high accuracy, and that this ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Some instructions have more impact on processor performance than others. Identification of these critical instructions can be used to modify and improve instruction processing. Previous work has shown that the criticality of instructions can be dynamically predicted with high accuracy, and that this information can be leveraged to optimize the performance of load value prediction and instruction steering for clustered architectures. In this work, we revisit the idea of criticality, but we propose several processor enhancements that can exploit criticality information and can be directly applied to modern x86 microarchitectures. For the investment of a small (less than 1KB) criticality predictor, we can make a conventional single-read-port data cache achieve the performance of an ideal dual-read-port cache, yielding an average 10 % performance improvement. Our remaining techniques can reuse the predictor (i.e., no additional overhead) to further optimize other aspects of load processing (e.g., caching decisions, store-to-load forwarding, etc.), yielding an overall performance improvement of 16 % over a conventional processor. Some of these techniques also allow us to decrease power and area costs for several related hardware structures. 1.
PEEP: Exploiting Predictability of Memory Dependences in SMT Processors
"... Simultaneous Multithreading (SMT) attempts to keep a dynamically scheduled processor’s resources busy with work from multiple independent threads. Threads with longlatency stalls, however, can lead to a reduction in overall throughput because they occupy many of the critical processor resources. In ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Simultaneous Multithreading (SMT) attempts to keep a dynamically scheduled processor’s resources busy with work from multiple independent threads. Threads with longlatency stalls, however, can lead to a reduction in overall throughput because they occupy many of the critical processor resources. In this work, we first study the interaction between stalls caused by ambiguous memory dependences and SMT processing. We then propose the technique of Proactive Exclusion (PE) where the SMT fetch unit stops fetching from a thread when a memory dependence is predicted to exist. However, after the dependence has been resolved, the thread is delayed waiting for new instructions to be fetched and delivered down the front-end pipeline. So we introduce an Early Parole (EP) mechanism that exploits the predictability of dependence-resolution delays to restart fetch of an excluded thread so that the instructions reach the execution core just as the original dependence resolves. We show that combining these two techniques (PEEP) yields a 16.9 % throughput improvement on a 4-way SMT processor that supports speculative memory disambiguation. These strong results indicate that a fetch policy that is cognizant of future stalls considerably improves the throughput of an SMT machine. 1.
A Modular 3D Processor for Flexible Product Design and Technology Migration
- CF'08
, 2008
"... The current methodology used in mass-market processor design is to create a single base microarchitecture (e.g., Intel’s “Core” or AMD’s “K8”) that is used throughout all of the PC market segments from laptops to servers. To differentiate the products, manufacturers rely on speed binning, different ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The current methodology used in mass-market processor design is to create a single base microarchitecture (e.g., Intel’s “Core” or AMD’s “K8”) that is used throughout all of the PC market segments from laptops to servers. To differentiate the products, manufacturers rely on speed binning, different cache sizes, and varying the number of cores. In this paper, we propose using 3D integration to provide a new, but complementary, approach to providing product differentiation. Past research on using 3D to improve performance has focused on the construction of “fully 3D ” circuits where functional blocks are partitioned across two or more layers. This approach forces one of two undesirable situations: (1) all products must be implemented in, and therefore pay the cost of, 3D or (2) a 3D-implemented processor is designed for the high-end/high-performance markets and a separate 2D microarchitecture must be designed for the lower-cost markets thereby incurring significant additional design effort and engineering cost. We present a modular processor architecture where 3D can be used to enhance performance within a single unified design and also provides for a more gradual migration path toward fully 3D-integrated designs. To make this work, we describe a generic technique of using “phantom ” components where the baseline processor may believe that 3D-stacked resources exist, but are currently unavailable. Simply using 3D to stack more L2 cache provides a 15.1 % average performance benefit, but our proposal increases performance by 25.4%.
The Case for Holistic Query Evaluation
"... We present the case for holistic query evaluation: a novel model for query processing by generating specific, highly optimized code at runtime, that is then compiled and executed. The model is based on the notion of embedding various query operations in a single nested-loops construct. This leads to ..."
Abstract
- Add to MetaCart
We present the case for holistic query evaluation: a novel model for query processing by generating specific, highly optimized code at runtime, that is then compiled and executed. The model is based on the notion of embedding various query operations in a single nested-loops construct. This leads to a hardware-friendly query representation that can be executed with minimal instruction and data cache accesses and cache miss rates, resulting in greatly improved response times. We have implemented a prototype system adhering to these principles and have conducted a detailed experimental study. We compared our approach to existing database technology, using both response time and hardware performance events as metrics. The results demonstrate a clear performance advantage for our system, exhibiting the potential of adopting the proposed holistic model as the kernel of a database query engine. 1
1 3D-Stacked Memory Architectures for Multi-Core Processors
"... Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works only consider commodity 2D DRAM organizations. In ..."
Abstract
- Add to MetaCart
Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works only consider commodity 2D DRAM organizations. In this work, we explore more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count. Our simulation results show that with a few simple changes to the 3D-DRAM organization, we can achieve a 1.75 × speedup over previously proposed 3D-DRAM approaches on our memoryintensive multi-programmed workloads on a quad-core processor. The significant increase in memory system performance makes the L2 miss handling architecture (MHA) a new bottleneck, which we address by combining a novel data structure called the Vector Bloom Filter with dynamic MSHR capacity tuning. Our scalable L2 MHA yields an additional 17.8 % performance improvement over our 3D-stacked memory architecture.
Deconstructing the Inefficacy of Global Cache Replacement Policies
"... In a conventional two-level cache hierarchy, L1 cache hits do not propagate to the L2 cache; as a result, the L2 cache only observes a “filtered ” memory access stream. A frequently accessed address may hit in the L1, but since these accesses never make it to the L2, the corresponding copy in the L2 ..."
Abstract
- Add to MetaCart
In a conventional two-level cache hierarchy, L1 cache hits do not propagate to the L2 cache; as a result, the L2 cache only observes a “filtered ” memory access stream. A frequently accessed address may hit in the L1, but since these accesses never make it to the L2, the corresponding copy in the L2 will “decay ” with respect to its replacement policy state and may eventually get evicted. Previous studies have advocated the use of global replacement policies where the L1 access information propagates to the L2 to maintain a replacement policy state that is consistent with the overall global memory access stream. We first attempt to duplicate previously reported results on global cache replacement policies. Despite the intuitive explanation for why a global scheme should work, our experimental results show that the performance potential of global replacement is very limited. We deconstruct the problem with reuse-distance analysis and show that only under very specific reuse-distance profiles will a program be able to benefit from global replacement. Our experiments include the evaluation of multi-core shared caches, inclusive cache hierarchies, and a wide spectrum of cache sizes and associativities; we show that global replacement fails to provide significant performance benefits for any of these scenarios. 1.
ImprovingMemoryBank-LevelParallelism inthePresenceofPrefetching
"... DRAM systems achieve high performance when all DRAM banks are busy servicing useful memory requests. The degree to which DRAM banks are busy is called DRAM Bank-Level Parallelism (BLP). This paper proposes two new cost-effective mechanisms to maximize DRAM BLP. BLP-Aware Prefetch Issue (BAPI) issues ..."
Abstract
- Add to MetaCart
DRAM systems achieve high performance when all DRAM banks are busy servicing useful memory requests. The degree to which DRAM banks are busy is called DRAM Bank-Level Parallelism (BLP). This paper proposes two new cost-effective mechanisms to maximize DRAM BLP. BLP-Aware Prefetch Issue (BAPI) issues prefetches into the on-chip Miss Status Holding Registers (MSHRs) associated with each core in a multi-core system such that the requests can be serviced in parallel in different DRAM banks. BLP-Preserving Multi-core Request Issue (BPMRI) does the actual loading of the DRAM controller’s request buffers so that requests from the same core can be serviced in parallel, minimizing the serialization of each core’s concurrent requests. When combined, BAPI and BPMRI improve system performance by 11.7 % on a 4-core CMP system for a wide variety of multiprogrammed workloads. BAPI and BPMRI also complement various existing DRAM scheduling and prefetching algorithms, and can be used in conjunction with them.

