Results 1 - 10
of
109
Fair Queuing Memory Systems
- in MICRO
, 2006
"... We propose and evaluate a multi-thread memory scheduler that targets high performance CMPs. The proposed memory scheduler is based on concepts originally developed for network fair queuing schedul-ing algorithms. The memory scheduler is fair and pro-vides Quality of Service (QoS) while improving sys ..."
Abstract
-
Cited by 95 (2 self)
- Add to MetaCart
(Show Context)
We propose and evaluate a multi-thread memory scheduler that targets high performance CMPs. The proposed memory scheduler is based on concepts originally developed for network fair queuing schedul-ing algorithms. The memory scheduler is fair and pro-vides Quality of Service (QoS) while improving system performance. On a four processor CMP running workloads containing a mix of applications with a range of memory bandwidth demands, the proposed memory scheduler provides QoS to all of the threads in all of the workloads, improves system performance by an average of 14 % (41 % in the best case), and reduces the variance in the threads ’ target memory bandwidth utilization from.2 to.0058. 1.
Guided Region Prefetching: A Cooperative Hardware/Software Approach
- In Proceedings of the 30th International Symposium on Computer Architecture
, 2003
"... Despite large caches, main-memory access latencies still cause significant performance losses in many applications. Numerous hardware and software prefetching schemes have been proposed to tolerate these latencies. Software prefetching typically provides better prefetch accuracy than hardware, but i ..."
Abstract
-
Cited by 65 (9 self)
- Add to MetaCart
(Show Context)
Despite large caches, main-memory access latencies still cause significant performance losses in many applications. Numerous hardware and software prefetching schemes have been proposed to tolerate these latencies. Software prefetching typically provides better prefetch accuracy than hardware, but is limited by prefetch instruction overheads and the compiler's limited ability to schedule prefetches sufficiently far in advance to cover level-two cache miss latencies. Hardware prefetching can be effective at hiding these large latencies, but generates many useless prefetches and consumes considerable memory bandwidth. In this paper, we propose a cooperative hardware-software prefetching scheme called Guided Region Prefetching (GRP), which uses compiler-generated hints encoded in load instructions to regulate an aggressive hardware prefetching engine. We compare GRP against a sophisticated pure hardware stride prefetcher and a scheduled region prefetching (SRP) engine. SRP and GRP show the best performance, with respective 22% and 21% gains over no prefetching, but SRP incurs 180% extra memory traffic---nearly tripling bandwidth requirements. GRP achieves performance close to SRP, but with a mere eighth of the extra prefetching traffic, a 23% increase over no prefetching. The GRP hardware-software collaboration thus combines the accuracy of compilerbased program analysis with the performance potential of aggressive hardware prefetching, bringing the performance gap versus a perfect L2 cache under 20%.
Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers
- In HPCA-13
, 2007
"... High performance processors employ hardware data prefetching to reduce the negative performance impact of large main memory latencies. While prefetching improves performance substantially on many programs, it can significantly reduce performance on others. Also, prefetching can significantly increas ..."
Abstract
-
Cited by 63 (13 self)
- Add to MetaCart
(Show Context)
High performance processors employ hardware data prefetching to reduce the negative performance impact of large main memory latencies. While prefetching improves performance substantially on many programs, it can significantly reduce performance on others. Also, prefetching can significantly increase memory bandwidth requirements. This paper proposes a mechanism that incorporates dynamic feedback into the design of the prefetcher to increase the performance improvement provided by prefetching as well as to reduce the negative performance and bandwidth impact of prefetching. Our mechanism estimates prefetcher accuracy, prefetcher timeliness, and prefetcher-caused cache pollution to adjust the aggressiveness of the data prefetcher dynamically. We introduce a new method to track cache pollution caused by the prefetcher at run-time. We also introduce a mechanism that dynamically decides where in the LRU stack to insert the prefetched blocks in the cache based on the cache pollution caused by the prefetcher. Using the proposed dynamic mechanism improves average performance by 6.5 % on 17 memory-intensive benchmarks in the SPEC CPU2000 suite compared to the best-performing conventional stream-based data prefetcher configuration, while it consumes 18.7 % less memory bandwidth. Compared to a conventional stream-based data prefetcher configuration that consumes similar amount of memory bandwidth, feedback directed prefetching provides 13.6 % higher performance. Our results show that feedback-directed prefetching eliminates the large negative performance impact incurred on some benchmarks due to prefetching, and it is applicable to streambased prefetchers, global-history-buffer based delta correlation prefetchers, and PC-based stride prefetchers. 1.
Memory Controller Optimizations for Web Servers
"... This paper analyzes memory access scheduling and virtual channels as mechanisms to reduce the latency of main memory accesses by the CPU and peripherals in web servers. Despite the address filtering effects of the CPU's cache hierarchy, there is significant locality and bank parallelism in the ..."
Abstract
-
Cited by 62 (2 self)
- Add to MetaCart
(Show Context)
This paper analyzes memory access scheduling and virtual channels as mechanisms to reduce the latency of main memory accesses by the CPU and peripherals in web servers. Despite the address filtering effects of the CPU's cache hierarchy, there is significant locality and bank parallelism in the DRAM access stream of a web server, which includes traffic from the operating system, application, and peripherals. However, a sequential memory controller leaves much of this locality and parallelism unexploited, as serialization and bank conflicts affect the realizable latency. Aggressive scheduling within the memory controller to exploit the available parallelism and locality can reduce the average read latency of the SDRAM. However, bank conflicts and the limited ability of the SDRAM's internal row buffers to act as a cache hinder further latency reduction. Virtual channel SDRAM overcomes these limitations by providing a set of channel buffers that can hold segments from rows of any internal SDRAM bank. This paper presents memory controller policies that can make effective use of these channel buffers to further reduce the average read latency of the SDRAM.
Using the Compiler to Improve Cache Replacement Decisions
- In Proceedings of the Conf. on Parallel Architectures and Compilation Techniques
, 2002
"... Memory performance is increasingly determining microprocessor performance and technology trends are exacerbating this problem. Most architectures use set-associative caches with LRU replacement policies to combine fast access with relatively low miss rates. To improve replacement decisions in set-as ..."
Abstract
-
Cited by 51 (5 self)
- Add to MetaCart
Memory performance is increasingly determining microprocessor performance and technology trends are exacerbating this problem. Most architectures use set-associative caches with LRU replacement policies to combine fast access with relatively low miss rates. To improve replacement decisions in set-associative caches, we develop a new set of compiler algorithms that predict which data will and will not be reused and provide these hints to the architecture. We prove that the hints either match or improve hit rates over LRU. We describe a practical one-bit cache-line tag implementation of our algorithm, called evict-me. On a cache replacement, the architecture will replace a line for which the evict-me bit is set, or if none is set, it will use the LRU bits. We implement our compiler analysis and its output in the Scale compiler. On a variety of scientific programs, using the evict-me algorithm in both the level 1 and 2 caches improves simulated cycle times by up to 34 % over the LRU policy by increasing hit rates. In addition, a combination of simple hardware prefetching and evict-me works together to further improve performance. 1.
Analysis of a Memory Architecture for Fast Packet Buffers
- In Proceedings of IEEE High Performance Switching and Routing
, 2001
"... All packet switches contain packet buffers to hold packets ..."
Abstract
-
Cited by 41 (4 self)
- Add to MetaCart
(Show Context)
All packet switches contain packet buffers to hold packets
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors
- In HPCA-11
, 2005
"... Memory system optimizations have been well studied on single-threaded systems; however, the wide use of simultaneous multithreading (SMT) techniques raises questions over their effectiveness in the new context. In this study, we thoroughly evaluate contemporary multi-channel DDR SDRAM and Rambus DRA ..."
Abstract
-
Cited by 38 (3 self)
- Add to MetaCart
(Show Context)
Memory system optimizations have been well studied on single-threaded systems; however, the wide use of simultaneous multithreading (SMT) techniques raises questions over their effectiveness in the new context. In this study, we thoroughly evaluate contemporary multi-channel DDR SDRAM and Rambus DRAM systems in SMT systems, and search for new thread-aware DRAM optimization techniques. Our major findings are: (1) in general, increasing the number of threads tends to increase the memory concurrency and thus the pressure on DRAM systems, but some exceptions do exist; (2) the application performance is sensitive to memory channel organizations, e.g. independent channels may outperform ganged organizations by up to 90%; (3) the DRAM latency reduction through improving row buffer hit rates becomes less effective due to the increased bank contentions; and (4) thread-aware DRAM access scheduling schemes may improve performance by up to 30 % on workload mixes of memory-intensive applications. In short, the use of SMT techniques has somewhat changed the context of DRAM optimizations but does not make them obsolete. 1
Mini-rank: Adaptive DRAM architecture for improving memory power efficiency
- In Proceedings of the 40th International Symposium on Microarchitecture
, 2008
"... The widespread use of multicore processors has dramatically increased the demand on high memory bandwidth and large memory capacity. As DRAM subsystem designs stretch to meet the demand, memory power consumption is now approaching that of processors. However, the conventional DRAM architecture preve ..."
Abstract
-
Cited by 33 (1 self)
- Add to MetaCart
(Show Context)
The widespread use of multicore processors has dramatically increased the demand on high memory bandwidth and large memory capacity. As DRAM subsystem designs stretch to meet the demand, memory power consumption is now approaching that of processors. However, the conventional DRAM architecture prevents any meaningful power and performance trade-offs for memory-intensive workloads. We propose a novel idea called mini-rank for DDRx (DDR/DDR2/DDR3) DRAMs, which uses a small bridge chip on each DRAM DIMM to break a conventional DRAM rank into multiple smaller mini-ranks so as to reduce the number of devices involved in a single memory access. The design dramatically reduces the memory power consumption with only a slight increase on the memory idle latency. It does not change the DDRx bus protocol and its configuration can be adapted for the best performancepower trade-offs. Our experimental results using four-core multiprogramming workloads show that using x32 mini-ranks reduces memory power by 27.0 % with 2.8 % performance penalty and using x16 mini-ranks reduces memory power by 44.1 % with 7.4 % performance penalty on average for memory-intensive workloads, respectively. 1.
Dynamic cluster resource allocations for jobs with known and unknown memory demands
- IEEE Trans. on Parallel and Distributed Systems
, 2002
"... AbstractÐThe cluster system we consider for load sharingis a compute farm which is a pool of networked server nodes providing high-performance computing for CPU-intensive, memory-intensive, and I/O active jobs in a batch mode. Existing resource management systems mainly target at balancing the usage ..."
Abstract
-
Cited by 25 (6 self)
- Add to MetaCart
(Show Context)
AbstractÐThe cluster system we consider for load sharingis a compute farm which is a pool of networked server nodes providing high-performance computing for CPU-intensive, memory-intensive, and I/O active jobs in a batch mode. Existing resource management systems mainly target at balancing the usage of CPU loads among server nodes. With the rapid advancement of CPU chips, memory and disk access speed improvements significantly lag behind advancement of CPU speed, increasing the penalty for data movement, such as page faults and I/O operations, relative to normal CPU operations. Aiming at reducing the memory resource contention caused by page faults and I/O activities, we have developed and examined load sharing policies by considering effective usage of global memory in addition to CPU load balancing in clusters. We study two types of application workloads: 1) Memory demands are known in advance or are predictable and 2) memory demands are unknown and dynamically changed during execution. Besides using workload traces with known memory demands, we have also made kernel instrumentation to collect different types of workload execution traces to capture dynamic memory access patterns. Conducting different groups of trace-driven simulations, we show that our proposed policies can effectively improve overall job execution performance by well utilizingboth CPU and memory resources with known and unknown memory demands. Index TermsÐCluster computing, distributed systems, load sharing, memory-intensive workloads, and trace-driven simulations. æ 1
Designing Packet Buffers for Router Linecards
, 2002
"... All routers contain buffers to hold packets during times of congestion. When designing a high-capacity router (or linecard) it is challenging to design buffers because of the buffer's speed and size, both of which grow linearly with line-rate, R. With today's DRAM technology, it is bar ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
(Show Context)
All routers contain buffers to hold packets during times of congestion. When designing a high-capacity router (or linecard) it is challenging to design buffers because of the buffer's speed and size, both of which grow linearly with line-rate, R. With today's DRAM technology, it is barely possible to design buffers for a 40Gb/s linecard in which packets are written to (read from) memory at the rate at which they arrive (depart). Over time, the problem will get harder: Link rates will increase, line cards will connect to more lines, and buffers will get larger. Ideally, we would like a memory with the density of DRAM, and the speed of SRAM. And so some commercial routers sometimes use hybrid packet buffers built from a combination of small fast SRAM and large slow DRAM. The SRAM holds ("caches") the heads and tails of packet FIFOs, allowing arriving packets to be written quickly to the tail, and departing packets to be read quickly from the head. The large DRAMs are used for bulk storage, to hold the majority of packets in each FIFO that are neither at the head nor the tail. Because of the relatively long time to write to (or read from) the DRAMs, data is transferred between SRAM and DRAM in large fixedsize blocks, consisting of perhaps many packets at a time. A memory manager shuttles packets between the SRAM cache and the DRAM with two goals: (I) Arriving packets are written to DRAM before the SRAM overflows, and (2) Departing packets are guaranteed to be in the SRAM when it's their turn to leave. In this paper we find optimal memory managers that achieve both goals, while minimizing the size of the SRAM cache. When the delay through the buffer is minimized, the size of the SRAM cache is proportional to Q In Q, where Q is the number of FIFOs t...