Results 1 - 10
of
226
Cooperative caching for chip multiprocessors
- In Proceedings of the 33nd Annual International Symposium on Computer Architecture
, 2006
"... Chip multiprocessor (CMP) systems have made the on-chip caches a critical resource shared among co-scheduled threads. Limited off-chip bandwidth, increasing on-chip wire delay, destructive inter-thread interference, and diverse workload characteristics pose key design challenges. To address these ch ..."
Abstract
-
Cited by 145 (1 self)
- Add to MetaCart
(Show Context)
Chip multiprocessor (CMP) systems have made the on-chip caches a critical resource shared among co-scheduled threads. Limited off-chip bandwidth, increasing on-chip wire delay, destructive inter-thread interference, and diverse workload characteristics pose key design challenges. To address these challenge, we propose CMP cooperative caching (CC), a unified framework to efficiently organize and manage on-chip cache resources. By forming a globally managed, shared cache using cooperative private caches. CC can effectively support two important caching applications: (1) reduction of average memory access latency and (2) isolation of destructive inter-thread interference. CC reduces the average memory access latency by balancing between cache latency and capacity opti-mizations. Based private caches, CC naturally exploits their access latency benefits. To improve the effective cache capacity, CC forms a “shared ” cache using replication control and LRU-based global replacement policies. Via cooperation throttling, CC provides a spectrum of caching behaviors between the two extremes of private and shared caches, thus enabling dynamic adaptation to suit workload requirements. We show that CC can achieve a robust performance advantage over private and shared cache schemes across different processor, cache and memory configurations, and a wide selection of multithreaded and multiprogrammed
Adaptive insertion policies for high performance caching
- In Proceedings of the 35th International Symposium on Computer Architecture
, 2007
"... The commonly used LRU replacement policy is susceptible to thrashing for memory-intensive workloads that have a working set greater than the available cache size. For such applications, the majority of lines traverse from the MRU position to the LRU position without receiving any cache hits, resulti ..."
Abstract
-
Cited by 114 (5 self)
- Add to MetaCart
The commonly used LRU replacement policy is susceptible to thrashing for memory-intensive workloads that have a working set greater than the available cache size. For such applications, the majority of lines traverse from the MRU position to the LRU position without receiving any cache hits, resulting in inefficient use of cache space. Cache performance can be improved if some fraction of the working set is retained in the cache so that at least that fraction of the working set can contribute to cache hits. We show that simple changes to the insertion policy can significantly reduce cache misses for memory-intensive workloads. We propose the LRU Insertion Policy (LIP) which places the incoming line in the LRU position instead of the MRU position. LIP protects the cache from thrashing and results in close to optimal hitrate for applications that have a cyclic reference pattern. We also propose the Bimodal Insertion Policy (BIP) as an enhancement of LIP that adapts to changes in the working set while maintaining the thrashing protection of LIP. We finally propose a Dynamic Insertion Policy (DIP) to choose between BIP and the traditional LRU policy depending on which policy incurs fewer misses. The proposed insertion policies do not require any change to the existing cache structure, are trivial to implement, and have a storage requirement of less than two bytes. We show that DIP reduces the average MPKI of the baseline 1MB 16-way L2 cache by 21%, bridging two-thirds of the gap between LRU and OPT.
BPLRU: A Buffer Management Scheme for Improving Random Writes in Flash Storage Abstract
"... Flash memory has become the most important storage media in mobile devices, and is beginning to replace hard disks in desktop systems. However, its relatively poor random write performance may cause problems in the desktop environment, which has much more complicated requirements than mobile devices ..."
Abstract
-
Cited by 112 (2 self)
- Add to MetaCart
(Show Context)
Flash memory has become the most important storage media in mobile devices, and is beginning to replace hard disks in desktop systems. However, its relatively poor random write performance may cause problems in the desktop environment, which has much more complicated requirements than mobile devices. While a RAM buffer has been quite successful in hard disks to mask the low efficiency of random writes, managing such a buffer to fully exploit the characteristics of flash storage has still not been resolved. In this paper, we propose a new write buffer management scheme called Block Padding Least Recently Used, which significantly improves the random write performance of flash storage. We evaluate the scheme using trace-driven simulations and experiments with a prototype implementation. It shows about 44 % enhanced performance for the workload of MS Office 2003 installation. 1
Reducing Energy Consumption of Disk Storage Using Power-Aware Cache Management
- In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), Febuary
, 2004
"... User News) [32], today's data centers have power require-ments that range from 75 W/ft ..."
Abstract
-
Cited by 103 (7 self)
- Add to MetaCart
(Show Context)
User News) [32], today's data centers have power require-ments that range from 75 W/ft
CAR: Clock with Adaptive Replacement
- IN PROCEEDINGS OF THE USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES (FAST
, 2004
"... CLOCK is a classical cache replacement policy dating back to 1968 that was proposed as a low-complexity approximation to LRU. On every cache hit, the policy LRU needs to move the accessed item to the most recently used position, at which point, to ensure consistency and correctness, it serializes c ..."
Abstract
-
Cited by 62 (0 self)
- Add to MetaCart
(Show Context)
CLOCK is a classical cache replacement policy dating back to 1968 that was proposed as a low-complexity approximation to LRU. On every cache hit, the policy LRU needs to move the accessed item to the most recently used position, at which point, to ensure consistency and correctness, it serializes cache hits behind a single global lock. CLOCK eliminates this lock contention, and, hence, can support high concurrency and high throughput environments such as virtual memory (for example, Multics, UNIX, BSD, AIX) and databases (for example, DB2). Unfortunately, CLOCK is still plagued by disadvantages of LRU such as disregard for "frequency", susceptibility to scans, and low performance.
As our main contribution, we propose a simple and elegant new algorithm, namely, CLOCK with Adaptive Replacement (CAR), that has several advantages over CLOCK: (i) it is scan-resistant; (ii) it is self-tuning and it adaptively and dynamically captures the "recency" and "frequency" features of a workload; (iii) it uses essentially the same primitives as CLOCK, and, hence, is low-complexity and amenable to a high-concurrency implementation; and (iv) it outperforms CLOCK across a wide-range of cache sizes and workloads. The algorithm CAR is inspired by the Adaptive Replacement Cache (ARC) algorithm, and inherits virtually all advantages of ARC including its high performance, but does not serialize cache hits behind a single global lock. As our second contribution, we introduce another novel algorithm, namely, CAR with Temporal filtering (CART), that has all the advantages of CAR, but, in addition, uses a certain temporal filter to distill pages with long-term utility from those with only short-term utility.
QPipe: A Simultaneously Pipelined Relational Query Engine
- Proc. ACM SIGMOD International Conference on Management of Data
, 2005
"... Relational DBMS typically execute concurrent queries indepen-dently by invoking a set of operator instances for each query. To exploit common data retrievals and computation in concurrent queries, researchers have proposed a wealth of techniques, ranging from buffering disk pages to constructing mat ..."
Abstract
-
Cited by 61 (15 self)
- Add to MetaCart
(Show Context)
Relational DBMS typically execute concurrent queries indepen-dently by invoking a set of operator instances for each query. To exploit common data retrievals and computation in concurrent queries, researchers have proposed a wealth of techniques, ranging from buffering disk pages to constructing materialized views and optimizing multiple queries. The ideas proposed, however, are inherently limited by the query-centric philosophy of modern engine designs. Ideally, the query engine should proactively coor-dinate same-operator execution among concurrent queries, thereby exploiting common accesses to memory and disks as well as com-mon intermediate result computation. This paper introduces on-demand simultaneous pipelining (OSP), a novel query evaluation paradigm for maximizing data and work sharing across concurrent queries at execution time. OSP enables proactive, dynamic operator sharing by pipelining the operator’s output simultaneously to multiple parent nodes. This paper also introduces QPipe, a new operator-centric relational engine that effortlessly supports OSP. Each relational operator is encapsulated in a micro-engine serving query tasks from a queue, naturally exploiting all data and work sharing opportunities. Evaluation of QPipe built on top of BerkeleyDB shows that QPipe achieves a 2x speedup over a commercial DBMS when running a workload con-sisting of TPC-H queries. 1.
Performance of Compressed Inverted List Caching in Search Engines
- IN WWW
, 2008
"... Due to the rapid growth in the size of the web, web search engines are facing enormous performance challenges. The larger engines in particular have to be able to process tens of thousands of queries per second on tens of billions of documents, making query throughput a critical issue. To satisfy th ..."
Abstract
-
Cited by 58 (12 self)
- Add to MetaCart
Due to the rapid growth in the size of the web, web search engines are facing enormous performance challenges. The larger engines in particular have to be able to process tens of thousands of queries per second on tens of billions of documents, making query throughput a critical issue. To satisfy this heavy workload, search engines use a variety of performance optimizations including index compression, caching, and early termination. We focus on two techniques, inverted index compression and index caching, which play a crucial rule in web search engines as well as other high-performance information retrieval systems. We perform a comparison and evaluation of several inverted list compression algorithms, including new variants of existing algorithms that have not been studied before. We then evaluate different inverted list caching policies on large query traces, and finally study the possible performance benefits of combining compression and caching. The overall goal of this paper is to provide an updated discussion and evaluation of these two techniques, and to show how to select the best set of approaches and settings depending on parameter such as disk speed and main memory cache size.
Cflru: a replacement algorithm for flash memory
- In CASES ’06: Proceedings of the 2006 international conference on Compilers, architecture
, 2006
"... In most operating systems which are customized for disk-based storage system, the replacement algorithm concerns only to maximize the number of memory hits. However, flash memory has different read and write cost in aspects of time and energy so the replacement algorithm with flash memory should con ..."
Abstract
-
Cited by 56 (5 self)
- Add to MetaCart
(Show Context)
In most operating systems which are customized for disk-based storage system, the replacement algorithm concerns only to maximize the number of memory hits. However, flash memory has different read and write cost in aspects of time and energy so the replacement algorithm with flash memory should consider not only the hit count but also the replacement cost caused by selecting dirty victims. The replacement cost of dirty page is higher than that of clean page with regard to both access time and energy consumption. In this paper, we propose the Clean-First LRU (CFLRU) replacement algorithm that takes also into consideration for the replacement cost of flash memory. CFLRU splits the LRU list into the working region and the clean-first region and adopts a policy that evicts clean pages preferentially in the clean-first region until the number of page hits in the working region is preserved in suitable level. Using the trace-driven simulation, the proposed algorithm reduces the average replacement cost by 28.4 % in swap system and by 26.2 % in buffer cache, in comparison with LRU algorithm. We also implement the CFLRU algorithm in the Linux kernel and present some optimization issues.
C-Miner: Mining Block Correlations in Storage Systems
- In Proceedings of the 3rd USENIX Symposium on File and Storage Technologies (FAST ’04
, 2004
"... systems. These correlations can be exploited for improving the effectiveness of storage caching, prefetching, data layout and disk scheduling. Unfortunately, information about block correlations is not available at the storage system level. Previous approaches for discovering file correlations in fi ..."
Abstract
-
Cited by 56 (3 self)
- Add to MetaCart
(Show Context)
systems. These correlations can be exploited for improving the effectiveness of storage caching, prefetching, data layout and disk scheduling. Unfortunately, information about block correlations is not available at the storage system level. Previous approaches for discovering file correlations in file systems do not scale well enough to be used for discovering block correlations in storage systems.
I/O Deduplication: Utilizing Content Similarity to Improve I/O Performance
"... Duplication of data in storage systems is becoming increasingly common. We introduce I/O Deduplication, a storage optimization that utilizes content similarity for improving I/O performance by eliminating I/O operations and reducing the mechanical delays during I/O operations. I/O Deduplication cons ..."
Abstract
-
Cited by 48 (5 self)
- Add to MetaCart
Duplication of data in storage systems is becoming increasingly common. We introduce I/O Deduplication, a storage optimization that utilizes content similarity for improving I/O performance by eliminating I/O operations and reducing the mechanical delays during I/O operations. I/O Deduplication consists of three main techniques: content-based caching, dynamic replica retrieval, and selective duplication. Each of these techniques is motivated by our observations with I/O workload traces obtained from actively-used production storage systems, all of which revealed surprisingly high levels of content similarity for both stored and accessed data. Evaluation of a prototype implementation using these workloads showed an overall improvement in disk I/O performance of 28 to 47 % across these workloads. Further breakdown also showed that each of the three techniques contributed significantly to the overall performance improvement.