Results 1 - 10
of
226
Cache-Conscious Structure Layout
, 1999
"... Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs. This paper explores a complementary appro ..."
Abstract
-
Cited by 207 (9 self)
- Add to MetaCart
(Show Context)
Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs. This paper explores a complementary approach that attacks the source (poor reference locality) of the problem rather than its manifestation (memory latency). It demonstrates that careful data organization and layout provides an essential mechanism to improve the cache locality of pointer-manipulating programs and consequently, their performance. It explores two placement technique-lustering and colorinet improve cache performance by increasing a pointer structure’s spatial and temporal locality, and by reducing cache-conflicts. To reduce the cost of applying these techniques, this paper discusses two strategies-cache-conscious reorganization and cacheconscious allocation--and describes two semi-automatic toolsccmorph and ccmalloc-that use these strategies to produce cache-conscious pointer structure layouts. ccmorph is a transparent tree reorganizer that utilizes topology information to cluster and color the structure. ccmalloc is a cache-conscious heap allocator that attempts to co-locate contemporaneously accessed data elements in the same physical cache block. Our evaluations, with microbenchmarks, several small benchmarks, and a couple of large real-world applications, demonstrate that the cache-conscious structure layouts produced by ccmorph and ccmalloc offer large performance benefit-n most cases, significantly outperforming state-of-the-art prefetching.
Active Pages: A Computation Model for Intelligent Memory
- IN INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1998
"... Microprocessors and memory systems suffer from a growing gap in performance. We introduce Active Pages, a computation model which addresses this gap by shifting data-intensive computations to the memory system. An Active Page consists of a page of data and a set of associated functions which can ope ..."
Abstract
-
Cited by 125 (7 self)
- Add to MetaCart
Microprocessors and memory systems suffer from a growing gap in performance. We introduce Active Pages, a computation model which addresses this gap by shifting data-intensive computations to the memory system. An Active Page consists of a page of data and a set of associated functions which can operate upon that data. We describe an implementation of Active Pages on RADram (Reconfigurable Architecture DRAM), a memory system based upon the integration of DRAM and reconfigurable logic. Results from the SimpleScalar simulator [BA97] demonstrate up to 1000X speedups on several applications using the RADram system versus conventional memory systems. We also explore the sensitivity of our results to implementations in other memory technologies.
A performance comparison of contemporary DRAM architectures
- In Proceedings of the 26th Annual International Symposium on Computer Architecture
, 1999
"... In response to the growing gap between memory access time and processor speed, DRAM manufacturers have created several new DRAM architectures. This paper presents a simulation-based performance study of a representative group, each evaluated in a small system organization. These small-system organiz ..."
Abstract
-
Cited by 116 (11 self)
- Add to MetaCart
(Show Context)
In response to the growing gap between memory access time and processor speed, DRAM manufacturers have created several new DRAM architectures. This paper presents a simulation-based performance study of a representative group, each evaluated in a small system organization. These small-system organizations correspond to workstation-class computers and use on the order of 10 DRAM
Reducing DRAM latencies with an integrated memory hierarchy design
- In Proceedings of the 7th International Symposium on High Performance Computer Architecture (HPCA-7
, 2001
"... In this papel; we address the severe performance gap caused by high processor clock rates and slow DRAM accesses. We show that even with an aggressive, next-generation memory system using four Direct Rambus channels and an integrated one-megabyte level-two cache, a processor still spends over half o ..."
Abstract
-
Cited by 109 (7 self)
- Add to MetaCart
(Show Context)
In this papel; we address the severe performance gap caused by high processor clock rates and slow DRAM accesses. We show that even with an aggressive, next-generation memory system using four Direct Rambus channels and an integrated one-megabyte level-two cache, a processor still spends over half of its time stalling for L2 misses. Large cache blocks can improve performance, but only when coupled with wide memory channels. DRAM address mappings also affect performance significantly. We evaluate an aggressive prefetch unit integrated with the L2 cache and memory controllers. By issuing prefetches only when the Rambus channels are idle, prioritizing them to maximize DRAM row bufSer hits, and giving them low replacement priority, we achieve a 43% speedup across IO of the 26 SPEC2000 benchmarks, without degrading performance on the others. With eight Rambus channels, these ten benchmarks improve to within 10 % of the peflormance of a perfect L2 cache. 1.
Exploring the Design Space of Future CMPs
, 2001
"... In this paper, we study the space of chip multiprocessor (CMP) organizations. We compare the area and performance trade-offs for CMP implementations to determine how many processing cores future server CMPs should have, whether the cores should have in-order or out-of-order issue, and how big the pe ..."
Abstract
-
Cited by 107 (13 self)
- Add to MetaCart
(Show Context)
In this paper, we study the space of chip multiprocessor (CMP) organizations. We compare the area and performance trade-offs for CMP implementations to determine how many processing cores future server CMPs should have, whether the cores should have in-order or out-of-order issue, and how big the per-processor on-chip caches should be. We find that, contrary to some conventional wisdom, out-of-order processing cores will maximize job throughput on future CMPs. As technology shrinks, limited off-chip bandwidth will begin to curtail the number of cores that can be effective on a single die. Current projections show that the transistor/signal pin ratio will increase by a factor of 45 between 180 and 35 nanometer technologies. That disparity will force increases in per-processor cache capacities as technology shrinks, from 128KB at 100nm, to 256KB at 70nm, and to 1MB at 50 and 35nm, reducing the number of cores that would otherwise be possible.
Impulse: Building a Smarter Memory Controller
- IN PROCEEDINGS OF THE FIFTH ANNUAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE
, 1999
"... Impulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accesse ..."
Abstract
-
Cited by 99 (21 self)
- Add to MetaCart
(Show Context)
Impulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accessed and cached, improving their cache and bus utilization. Second, Impulse supports prefetching at the memory controller, which can hide much of the latency of DRAM accesses. In this paper
A Case for Intelligent RAM
, 1997
"... This article reviews the state of microprocessors and DRAMs today, explores some of the opportunities and challenges for IRAMs, and finally estimates performance and energy efficiency of three IRAM designs. ..."
Abstract
-
Cited by 96 (6 self)
- Add to MetaCart
This article reviews the state of microprocessors and DRAMs today, explores some of the opportunities and challenges for IRAMs, and finally estimates performance and energy efficiency of three IRAM designs.
Fair Queuing Memory Systems
- in MICRO
, 2006
"... We propose and evaluate a multi-thread memory scheduler that targets high performance CMPs. The proposed memory scheduler is based on concepts originally developed for network fair queuing schedul-ing algorithms. The memory scheduler is fair and pro-vides Quality of Service (QoS) while improving sys ..."
Abstract
-
Cited by 95 (2 self)
- Add to MetaCart
We propose and evaluate a multi-thread memory scheduler that targets high performance CMPs. The proposed memory scheduler is based on concepts originally developed for network fair queuing schedul-ing algorithms. The memory scheduler is fair and pro-vides Quality of Service (QoS) while improving system performance. On a four processor CMP running workloads containing a mix of applications with a range of memory bandwidth demands, the proposed memory scheduler provides QoS to all of the threads in all of the workloads, improves system performance by an average of 14 % (41 % in the best case), and reduces the variance in the threads ’ target memory bandwidth utilization from.2 to.0058. 1.
Improving effective bandwidth through compiler enhancement of global cache reuse
- In Proceedings of International Parallel and Distributed Processing Symposium
, 2001
"... While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth has increased by a factor of only 139 during the same period. Consequently, on modern machines the limited data supply simply cannot keep a CPU busy, and applications often utilize only a few percent o ..."
Abstract
-
Cited by 84 (18 self)
- Add to MetaCart
While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth has increased by a factor of only 139 during the same period. Consequently, on modern machines the limited data supply simply cannot keep a CPU busy, and applications often utilize only a few percent of peak CPU performance. The hardware solution, which provides layers of high-bandwidth data cache, is not effective for large and complex applications primarily for two reasons: far-separated data reuse and large-stride data access. The first repeats unnecessary transfer and the second communicates useless data. Both waste memory bandwidth. This dissertation pursues a software remedy. It investigates the potential for compiler optimizations to alter program behavior and reduce its memory bandwidth consumption. To this end, this research has studied a two-step transformation strategy: first fuse computations on the same data and then group data used by the same computation. Existing techniques such as loop blocking can be viewed as an application of this strategy within a single loop nest. In order to carry out this strategy
Exploiting Spatial Locality in Data Caches using Spatial Footprints
- In Proceedings of the 25th Annual International Symposium on Computer Architecture
, 1998
"... Modern cache designs exploit spatial locality by fetching large blocks of data called cache lines on a cache miss. Subsequent references to words within the same cache line result in cache hits. Although this approach benefits from spatial locality, less than half of the data brought into the cache ..."
Abstract
-
Cited by 83 (4 self)
- Add to MetaCart
(Show Context)
Modern cache designs exploit spatial locality by fetching large blocks of data called cache lines on a cache miss. Subsequent references to words within the same cache line result in cache hits. Although this approach benefits from spatial locality, less than half of the data brought into the cache gets used before eviction. The unused portion of the cache line negatively impacts performance by wasting bandwidth and polluting the cache by replacing potentially useful data that would otherwise remain in the cache. This paper describes an alternative approach to exploit spatial locality available in data caches. On a cache miss, our mechanism, called Spatial Footprint Predictor (SFP), predicts which portions of a cache block will get used before getting evicted. The high accuracy of the predictor allows us to exploit spatial locality exhibited in larger blocks of data yielding better miss ratios without significantly impacting the memory access latencies. Our evaluation of this mechanis...