Results 1 -
7 of
7
Determining fault tolerance of XOR-based erasure codes efficiently
- In Proceedings of the 2007 International Conference on Dependable Systems and Networks (DSN
, 2007
"... We propose a new fault tolerance metric for XOR-based erasure codes: the minimal erasures list (MEL). A minimal erasure is a set of erasures that leads to irrecoverable data loss and in which every erasure is necessary and sufficient for this to be so. The MEL is the enumeration of all minimal erasu ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
We propose a new fault tolerance metric for XOR-based erasure codes: the minimal erasures list (MEL). A minimal erasure is a set of erasures that leads to irrecoverable data loss and in which every erasure is necessary and sufficient for this to be so. The MEL is the enumeration of all minimal erasures. An XOR-based erasure code has an irregular structure that may permit it to tolerate faults at and beyond its Hamming distance. The MEL completely describes the fault tolerance of an XOR-based erasure code at and beyond its Hamming distance; it is therefore a useful metric for comparing the fault tolerance of such codes. We also propose an algorithm that efficiently determines the MEL of an erasure code. This algorithm uses the structure of the erasure code to efficiently determine the MEL. We show that, in practice, the number of minimal erasures for a given code is much less than the total number of sets of erasures that lead to data loss: in our empirical results for one corpus of codes, there were over 80 times fewer minimal erasures. We use the proposed algorithm to identify the most fault tolerant XOR-based erasure code for all possible systematic erasure codes with up to seven data symbols and up to seven parity symbols. 1.
REO: A generic RAID Engine and Optimizer
"... Present day applications that require reliable data storage use one of five commonly available RAID levels to protect against data loss due to media or disk failures. With a marked rise in the quantity of stored data and no commensurate improvement in disk reliability, a greater variety is becoming ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Present day applications that require reliable data storage use one of five commonly available RAID levels to protect against data loss due to media or disk failures. With a marked rise in the quantity of stored data and no commensurate improvement in disk reliability, a greater variety is becoming necessary to contain costs. Adding new RAID codes to an implementation becomes cost prohibitive since they require significant development, testing and tuning efforts. We suggest a novel solution to this problem: a generic RAID Engine and Optimizer (REO). It is generic in that it works for any XOR-based erasure (RAID) code and under any combination of sector or disk failures. REO can systematically deduce a least cost reconstruction strategy for a read to lost pages or for an update strategy for a flush of dirty pages. Using trace driven simulations we show that REO can automatically tune I/O performance to be competitive with existing RAID implementations. 1
Flat XOR-based erasure codes in storage systems: Constructions, efficient recovery, and tradeoffs
"... Abstract—Large scale storage systems require multi-disk fault tolerant erasure codes. Replication and RAID extensions that protect against two- and three-disk failures offer a stark tradeoff between how much data must be stored, and how much data must be read to recover a failed disk. Flat XOR-codes ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract—Large scale storage systems require multi-disk fault tolerant erasure codes. Replication and RAID extensions that protect against two- and three-disk failures offer a stark tradeoff between how much data must be stored, and how much data must be read to recover a failed disk. Flat XOR-codes—erasure codes in which parity disks are calculated as the XOR of some subset of data disks—offer a tradeoff between these extremes. In this paper, we describe constructions of two novel flat XOR-code, Stepped Combination and HD-Combination codes. We describe an algorithm for flat XOR-codes that enumerates recovery equations, i.e., sets of disks that can recover a failed disk. We also describe two algorithms for flat XOR-codes that generate recovery schedules, i.e., sets of recovery equations that can be used in concert to achieve efficient recovery. Finally, we analyze the key storage properties of many flat XOR-codes and of MDS codes such as replication and RAID 6 to show the cost-benefit tradeoff gap that flat XOR-codes can fill. I.
Mean time to meaningless: MTTDL, Markov models, and storage system reliability
"... Mean Time To Data Loss (MTTDL) has been the standard reliability metric in storage systems for more than 20 years. MTTDL represents a simple formula that can be used to compare the reliability of small disk arrays and to perform comparative trending analyses. The MTTDL metric is often misused, with ..."
Abstract
- Add to MetaCart
Mean Time To Data Loss (MTTDL) has been the standard reliability metric in storage systems for more than 20 years. MTTDL represents a simple formula that can be used to compare the reliability of small disk arrays and to perform comparative trending analyses. The MTTDL metric is often misused, with egregious examples relying on the MTTDL to generate reliability estimates that span centuries or millennia. Moving forward, the storage community needs to replace MTTDL with a metric that can be used to accurately compare the reliability of systems in a way that reflects the impact of data loss in the real world. 1
TPT-RAID: a High Performance Box-Fault Tolerant Storage System
"... TPT-RAID is a multi-box RAID wherein each ECC group comprises at most one block from any given storage box, and can thus tolerate a box failure. It extends the idea of an out-of-band SAN controller into the RAID: data is sent directly between hosts and targets and among targets, and the RAID control ..."
Abstract
- Add to MetaCart
TPT-RAID is a multi-box RAID wherein each ECC group comprises at most one block from any given storage box, and can thus tolerate a box failure. It extends the idea of an out-of-band SAN controller into the RAID: data is sent directly between hosts and targets and among targets, and the RAID controller supervises ECC calculation by the targets. By preventing a communication bottleneck in the controller, excellent scalability is achieved while retaining the simplicity of centralized control. TPT-RAID, whose controller can be a software module within an out-of-band SAN controller, moreover conforms to a conventional switched network architecture, whereas an in-band RAID controller would either constitute a communication bottleneck or would have to also be a full-fledged router. The design is validated in an InfiniBand-based prototype using iSCSI and iSER, and required changes to relevant protocols are introduced. 1.
Hierarchical RAID: Organization, Operation, Reliability and Performance
"... We consider two level Hierarchical RAID (HRAID) arrays with erasure coding at both levels. The main advantage of HRAID is tolerating disk array controller failures in addition to disk failures, so that it is suitable for storage clouds based on bricks. We consider an HRAID with ..."
Abstract
- Add to MetaCart
We consider two level Hierarchical RAID (HRAID) arrays with erasure coding at both levels. The main advantage of HRAID is tolerating disk array controller failures in addition to disk failures, so that it is suitable for storage clouds based on bricks. We consider an HRAID with

