Results 1 - 10
of
26
Cosmic Rays Don’t Strike Twice: Understanding the Nature
- of DRAM Errors and the Implications for System Design”, ASPLOS
, 2012
"... Main memory is one of the leading hardware causes for machine crashes in today’s datacenters. Designing, evaluating and modeling systems that are resilient against memory errors requires a good understanding of the underlying characteristics of errors in DRAM in the field. While there have recently ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
(Show Context)
Main memory is one of the leading hardware causes for machine crashes in today’s datacenters. Designing, evaluating and modeling systems that are resilient against memory errors requires a good understanding of the underlying characteristics of errors in DRAM in the field. While there have recently been a few first studies on DRAM errors in production systems, these have been too limited in either the size of the data set or the granularity of the data to conclusively answer many of the open questions on DRAM errors. Such questions include, for example, the prevalence of soft errors compared to hard errors, or the analysis of typical patterns of hard errors. In this paper, we study data on DRAM errors collected on a diverse range of production systems in total covering nearly 300 terabyte-years of main memory. As a first contribution, we provide a detailed analytical study of DRAM error characteristics, including both hard and soft errors. We find that a large fraction of DRAM errors in the field can be attributed to hard errors and we provide a detailed analytical study of their characteristics. As a second contribution, the paper uses the results from the measurement study to identify a number of promising directions for designing more resilient systems and evaluates the potential of different protection mechanisms in the light of realistic error patterns. One of our findings is that simple page retirement policies might be able to mask a large number of DRAM errors in production systems, while sacrificing only a negligible fraction of the total DRAM in the system.
Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores
"... DRAM vendors have traditionally optimized the cost-perbit metric, often making design decisions that incur energy penalties. A prime example is the overfetch feature in DRAM, where a single request activates thousands of bitlines in many DRAM chips, only to return a single cache line to the CPU. The ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
(Show Context)
DRAM vendors have traditionally optimized the cost-perbit metric, often making design decisions that incur energy penalties. A prime example is the overfetch feature in DRAM, where a single request activates thousands of bitlines in many DRAM chips, only to return a single cache line to the CPU. The focus on cost-per-bit is questionable in modern-day servers where operating costs can easily exceed the purchase cost. Modern technology trends are also placing very different demands on the memory system: (i) queuing delays are a significant component of memory access time, (ii) there is a high energy premium for the level of reliability expected for business-critical computing, and (iii) the memory access stream emerging from multi-core systems exhibits limited locality. All of these trends necessitate an
Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput
- In Proc. the 38th Ann. Int’l Symp. Computer Architecture (ISCA
, 2011
"... We propose adaptive granularity to combine the best of finegrained and coarse-grained memory accesses. We augment virtual memory to allow each page to specify its preferred granularity of access based on spatial locality and errortolerance tradeoffs. We use sector caches and sub-ranked memory system ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
(Show Context)
We propose adaptive granularity to combine the best of finegrained and coarse-grained memory accesses. We augment virtual memory to allow each page to specify its preferred granularity of access based on spatial locality and errortolerance tradeoffs. We use sector caches and sub-ranked memory systems to implement adaptive granularity. We also show how to incorporate adaptive granularity into memory access scheduling. We evaluate our architecture with and without ECC using memory intensive benchmarks from the SPEC, Olden, PARSEC, SPLASH2, and HPCS benchmark suites and micro-benchmarks. The evaluation shows that performance is improved by 61 % without ECC and 44% with ECC in memory-intensive applications, while the reduction in memory power consumption (29 % without ECC and 14 % with ECC) and traffic (78 % without ECC and 66% with ECC) is significant.
Generative Software-based Memory Error Detection and Correction for Operating System Data Structures
"... Abstract—Recent studies indicate that the number of system failures caused by main memory errors is much higher than expected. In contrast to the commonly used hardware-based countermeasures, for example using ECC memory, softwarebased fault-tolerance measures are much more flexible and can exploit ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Recent studies indicate that the number of system failures caused by main memory errors is much higher than expected. In contrast to the commonly used hardware-based countermeasures, for example using ECC memory, softwarebased fault-tolerance measures are much more flexible and can exploit application knowledge, such as the criticality of specific data structures. This paper presents a software-based memory error protection approach, which we used to harden the eCos operating system in a case study. The main benefits of our approach are the flexibility to choose from an extensible toolbox of easily pluggable error detection and correction schemes as well as its very low runtime overhead, which totals in a range of 0.09–1.7 %. The implementation is based on aspect-oriented programming and exploits the object-oriented program structure of eCos to identify well-suited code locations for the insertion of generative fault-tolerance measures. I.
A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures
"... As GPU’s compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU memory hierarchies use coarse-grained memory accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache meta-data storage. These coarse-grained memory acce ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
As GPU’s compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU memory hierarchies use coarse-grained memory accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache meta-data storage. These coarse-grained memory accesses, however, are a poor match for emerging GPU applications with irregular control flow and memory access patterns. Meanwhile, the massive multi-threading of GPUs and the simplicity of their cache hierarchies make CPU-specific memory system enhancements ineffective for improving the performance of irregular GPU applications. We design and evaluate a locality-aware memory hierarchy for throughput processors, such as GPUs. Our proposed design retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality. As such, our locality-aware memory hierarchy improves GPU performance, energy-efficiency, and memory throughput for a large range of applications.
Resilient die-stacked dram caches
"... Die-stacked DRAM can provide large amounts of in-package, high-bandwidth cache storage. For server and high-performance com-puting markets, however, such DRAM caches must also provide sufficient support for reliability and fault tolerance. While con-ventional off-chip memory provides ECC support by ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Die-stacked DRAM can provide large amounts of in-package, high-bandwidth cache storage. For server and high-performance com-puting markets, however, such DRAM caches must also provide sufficient support for reliability and fault tolerance. While con-ventional off-chip memory provides ECC support by adding one or more extra chips, this may not be practical in a 3D stack. In this paper, we present a DRAM cache organization that uses error-correcting codes (ECCs), strong checksums (CRCs), and dirty data duplication to detect and correct a wide range of stacked DRAM failures, from traditional bit errors to large-scale row, column, bank, and channel failures. With only a modest performance degradation compared to a DRAM cache with no ECC support, our proposal can correct all single-bit failures, and 99.9993 % of all row, column, and bank failures, providing more than a 54,000 ⇥ improvement in the FIT rate of silent-data corruptions compared to basic SECDED ECC protection.
ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates
- in Proceedings of the 40th Annual International Symposium on Computer Architecture. ACM
"... DRAM scaling has been the prime driver for increasing the capac-ity of main memory system over the past three decades. Unfor-tunately, scaling DRAM to smaller technology nodes has become challenging due to the inherent difficulty in designing smaller ge-ometries, coupled with the problems of device ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
(Show Context)
DRAM scaling has been the prime driver for increasing the capac-ity of main memory system over the past three decades. Unfor-tunately, scaling DRAM to smaller technology nodes has become challenging due to the inherent difficulty in designing smaller ge-ometries, coupled with the problems of device variation and leak-age. Future DRAM devices are likely to experience significantly high error-rates. Techniques that can tolerate errors efficiently can enable DRAM to scale to smaller technology nodes. However, ex-isting techniques such as row/column sparing and ECC become prohibitive at high error-rates. To develop cost-effective solutions for tolerating high error-rates, this paper advocates a cross-layer approach. Rather than hiding the faulty cell information within the DRAM chips, we expose it to the architectural level. We propose ArchShield, an architectural framework that employs runtime testing to identify faulty DRAM cells. ArchShield tolerates these faults using two components, a Fault Map that keeps information about faulty words in a cache line, and Selective Word-Level Replication (SWLR) that replicates faulty words for error resilience. Both Fault Map and SWLR are in-tegrated in reserved area in DRAM memory. Our evaluations with 8GB DRAM DIMM show that ArchShield can efficiently tolerate error-rates as higher as 10−4 (100x higher than ECC alone), causes less than 2 % performance degradation, and still maintains 1-bit er-ror tolerance against soft errors.
RAMpage: Graceful degradation management for memory errors in commodity linux servers
- in 17th IEEE Pacific Rim Int’l Symp. on Dep. Comp. (PRDC ’11
, 2011
"... Abstract—Memory errors are a major source of reliability problems in current computers. Undetected errors may result in program termination, or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Memory errors are a major source of reliability problems in current computers. Undetected errors may result in program termination, or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously assumed and regularly affects everyday operation. Often, neither additional circuitry to support hardware-based error detection nor downtime for performing hardware tests can be afforded. In the case of permanent memory errors, a system faces two challenges: detecting errors as early as possible and handling them while avoiding system downtime. To increase system reliability, we have developed RAMpage, an online memory testing infrastructure for commodity x86-64-based Linux servers, which is capable of efficiently detecting memory errors and which provides graceful degradation by withdrawing affected memory pages from further use. We describe the design and implementation of RAMpage and present results of an extensive qualitative as well as quantitative evaluation. Keywords-Fault tolerance, DRAM chips, Operating systems I.
1Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures
"... Abstract—Stacked memory modules are likely to be tightly integrated with the processor. It is vital that these modules operate reliably, where failure can require replacement of the entire socket. To make matters worse, stacked memory designs are susceptible to new failure modes (for example, due to ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Stacked memory modules are likely to be tightly integrated with the processor. It is vital that these modules operate reliably, where failure can require replacement of the entire socket. To make matters worse, stacked memory designs are susceptible to new failure modes (for example, due to faulty through-silicon vias, or TSVs) that can cause large portions of memory, such as a bank, to become faulty. To avoid data loss from large-granularity failures, the memory system may use symbol-based codes that stripe the data for a cache line across several banks (or channels). Unfortunately, such data-striping reduces memory-level parallelism, causing significant slowdown and higher memory power consumption. This paper proposes Citadel, a robust memory architecture that allows the memory system to store each cache line entirely within one bank, allowing high performance, low power and efficient protection from large-granularity failures. Citadel consists of three components; TSV-Swap, which can tolerate both faulty data-TSVs and faulty address-TSVs; Three Dimensional Parity (3DP), which can tolerate column failures, row failures, and bank failures; and Dynamic Dual-Granularity Sparing (DDS), which can mitigate permanent faults by dynamically replacing faulty memory regions with spares, either at a row granularity or at a bank granularity. Our evaluations with real-world DRAM failure data show that Citadel performs within 1 % of, and uses only an additional 4 % power versus a memory system optimized for performance and power, yet provides reliability that is 7x-700x higher than symbol-based ECC. 1
†Intelligent Infrastructure Lab Hewlett-Packard Labs
"... Chip multiprocessors enable continued performance scaling with increasingly many cores per chip. As the throughput of computation outpaces available memory bandwidth, however, the system bottleneck will shift to main memory. We present a memory system, the dynamic granularity memory system (DGMS), w ..."
Abstract
- Add to MetaCart
(Show Context)
Chip multiprocessors enable continued performance scaling with increasingly many cores per chip. As the throughput of computation outpaces available memory bandwidth, however, the system bottleneck will shift to main memory. We present a memory system, the dynamic granularity memory system (DGMS), which avoids unnecessary data transfers, saves power, and improves system performance by dynamically changing between fine and coarsegrained memory accesses. DGMS predicts memory access granularities dynamically in hardware, and does not require software or OS support. The dynamic operation of DGMS gives it superior ease of implementation and power efficiency relative to prior multi-granularity memory systems, while maintaining comparable levels of system performance. 1.