#### DMCA

## A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems (1997)

### Cached

### Download Links

Venue: | Software – Practice & Experience |

Citations: | 229 - 37 self |

### Citations

837 | A case for redundant arrays of inexpensive disks (raid
- Patterson, Gibson, et al.
- 1988
(Show Context)
Citation Context ...ly new. It came to the fore with “Redundant Arrays of Inexpensive Disks” (RAID) where batteries of small, inexpensive disks combine high storage capacity, bandwidth, and reliability all at a low cost =-=[4, 5, 6]-=-. Since then, the technique has been used to design multicomputer and network file systems with high reliability and bandwidth [7, 8], and to design fast distributed checkpointing systems [9, 10, 11, ... |

577 |
Algebraic coding theory
- Berlekamp
- 1968
(Show Context)
Citation Context ...h that if any of fail, then the contents of the failed devices can be reconstructed from the non-failed devices. Introduction ¤ ¢ � ¤�������¤ § � � Error-correcting codes have been around for decades =-=[1, 2, 3]-=-. However, the technique of distributing data among multiple storage devices to achieve high-bandwidth input and output, and using one or more error-correcting devices for failure recovery is relative... |

551 | Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance
- Rabin
- 1989
(Show Context)
Citation Context ...on RAID-like systems. However, the technique itself is harder to come by. The technique has an interesting history. It was first presented in terms of secret sharing by Karnin [17], and then by Rabin =-=[18]-=- in terms of information dispersal. Preparata [19] then showed the relationship between Rabin’s method and Reed-Solomon codes, hence the labeling of the technique as Reed-Solomon coding. The technique... |

492 | An introduction to disk drive modeling
- Ruemmler, Wilkes
- 1994
(Show Context)
Citation Context ...: the Gaussian Elimination and the recalculation. 1 We do not include any equations for the time to perform disk reads/writes because the complexity of disk operation precludes a simple encapsulation =-=[25]-=-. 11 � is:sSince at leasts� ¦ rows ofs� are identity rows, the Gaussian Elimination takess§ ¦ § �ssteps. ¦ As is likely to be small this should be very fast (i.e. milliseconds). The subsequent recalcu... |

342 | Raid: High-performance, reliable secondary storage - Chen, Lee, et al. - 1994 |

330 |
A case for redundant arrays of inexpensive disks
- Patterson, Gibson, et al.
- 1988
(Show Context)
Citation Context ...F (24 ): 3 7 = gfilog[gflog[3]+gflog[7]] = gfilog[4+10] = gfilog[14] = 9 13 10 = gfilog[gflog[13]+gflog[10]] = gfilog[13+9] = gfilog[7] = 11 13 10 = gfilog[gflog[13]-gflog[10]] = gfilog[13-9] = gfilog=-=[4]-=- = 3 3 7 = gfilog[gflog[3]-gflog[7]] = gfilog[4-10] = gfilog[9] = 14 Therefore, a multiplication or division requires one conditional, three table lookups (twoA TUTORIAL ON REED–SOLOMON CODING 1001 #... |

302 | The Zebra striped network file system
- Hartman, Ousterhout
- 1993
(Show Context)
Citation Context ... storage capacity, bandwidth, and reliability all at a low cost [4, 5, 6]. Since then, the technique has been used to design multicomputer and network file systems with high reliability and bandwidth =-=[7, 8]-=-, and to design fast distributed checkpointing systems [9, 10, 11, 12]. We call all such systems “RAID-like” systems. The above problem is central to all RAID-like systems. When storage is distributed... |

291 |
Introduction to Coding Theory
- Lint
- 1999
(Show Context)
Citation Context ...as Reed-Solomon coding. The technique has recently been discussed in varying levels of detail by Gibson [5], Schwarz [20] and Burkhard [13], with citations of standard texts on error correcting codes =-=[1, 2, 3, 21, 22]-=- for completeness. There is one problem with all the above discussions of this technique — they require the reader to have a thorough knowledge of algebra and coding theory. Any programmer with a bach... |

155 |
Redundant Disk Arrays: Reliable, Parallel Secondary Storage
- Gibson
- 1992
(Show Context)
Citation Context ...ly new. It came to the fore with “Redundant Arrays of Inexpensive Disks” (RAID) where batteries of small, inexpensive disks combine high storage capacity, bandwidth, and reliability all at a low cost =-=[4, 5, 6]-=-. Since then, the technique has been used to design multicomputer and network file systems with high reliability and bandwidth [7, 8], and to design fast distributed checkpointing systems [9, 10, 11, ... |

145 | On secret sharing systems
- Karnin, Greene, et al.
- 1983
(Show Context)
Citation Context ...ed in almost all papers on RAID-like systems. However, the technique itself is harder to come by. The technique has an interesting history. It was first presented in terms of secret sharing by Karnin =-=[17]-=-, and then by Rabin [18] in terms of information dispersal. Preparata [19] then showed the relationship between Rabin’s method and Reed-Solomon codes, hence the labeling of the technique as Reed-Solom... |

133 |
Reed-Solomon Codes and Their Applications
- Wicker
- 1994
(Show Context)
Citation Context ...as Reed-Solomon coding. The technique has recently been discussed in varying levels of detail by Gibson [5], Schwarz [20] and Burkhard [13], with citations of standard texts on error correcting codes =-=[1, 2, 3, 21, 22]-=- for completeness. There is one problem with all the above discussions of this technique — they require the reader to have a thorough knowledge of algebra and coding theory. Any programmer with a bach... |

111 | Some applications of Rabin’s fingerprinting method
- Broder
- 1993
(Show Context)
Citation Context ...ficient software solution that is easy to implement and does not consume much physical memory. For larger values ofs� � , other approaches (hardware or software) may be necessary. See References [2], =-=[27]-=- and [28] for examples of other approaches. 14sAcknowledgements The author thanks Joel Friedman, Kai Li, Michael Puening, Norman Ramsey, Brad Vander Zanden and Michael Vose for their valuable comments... |

94 | Disk array storage system reliability
- Burkhard, Menon
- 1993
(Show Context)
Citation Context ...ration per write to any single device. Its main disadvantage is that it cannot recover from more than one simultaneous failure. 2sAssgrows, the ability to tolerate multiple failures becomes important =-=[13]-=-. Several techniques have been developed for this [13, 14, 15, 16], the concentration being small values of � . The most general technique for tolerating � simultaneous failures with exactly � checksu... |

91 | The TickerTAIP Parallel RAID Architecture
- Cao, Lim, et al.
- 1993
(Show Context)
Citation Context ... storage capacity, bandwidth, and reliability all at a low cost [4, 5, 6]. Since then, the technique has been used to design multicomputer and network file systems with high reliability and bandwidth =-=[7, 8]-=-, and to design fast distributed checkpointing systems [9, 10, 11, 12]. We call all such systems “RAID-like” systems. The above problem is central to all RAID-like systems. When storage is distributed... |

81 | EVENODD: An optimal scheme for tolerating double disk failures
- Blaum, Brady, et al.
- 1994
(Show Context)
Citation Context ... minimal device overhead. In other words, there are some combinations � ¥ � of device failures that the system cannot tolerate. An important coding technique for two device failures is EVENODD coding =-=[15]-=-. This technique tolerates all two device failures with just two checksum devices, and all coding operations are XOR’s. Thus, it too is faster than RS-Raid coding. To the author’s knowledge, there is ... |

45 |
Jr., Error-Correcting Codes, second edition
- Peterson, Weldon
- 1972
(Show Context)
Citation Context ...h that if any of fail, then the contents of the failed devices can be reconstructed from the non-failed devices. Introduction ¤ ¢ � ¤�������¤ § � � Error-correcting codes have been around for decades =-=[1, 2, 3]-=-. However, the technique of distributing data among multiple storage devices to achieve high-bandwidth input and output, and using one or more error-correcting devices for failure recovery is relative... |

29 | Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques
- Plank
- 1996
(Show Context)
Citation Context ...st [4, 5, 6]. Since then, the technique has been used to design multicomputer and network file systems with high reliability and bandwidth [7, 8], and to design fast distributed checkpointing systems =-=[9, 10, 11, 12]-=-. We call all such systems “RAID-like” systems. The above problem is central to all RAID-like systems. When storage is distributed amongsdevices, the chances of one of these devices failing becomes si... |

28 |
Failure correction techniques for large disk arrays
- Gibson, Hellerstein, et al.
- 1989
(Show Context)
Citation Context ...antage is that it cannot recover from more than one simultaneous failure. 2sAssgrows, the ability to tolerate multiple failures becomes important [13]. Several techniques have been developed for this =-=[13, 14, 15, 16]-=-, the concentration being small values of � . The most general technique for tolerating � simultaneous failures with exactly � checksum devices is a technique based on Reed-Solomon coding. This fact i... |

26 | RAID Organization and Performance
- Schwarz, Burkhard
- 1992
(Show Context)
Citation Context ...between Rabin’s method and Reed-Solomon codes, hence the labeling of the technique as Reed-Solomon coding. The technique has recently been discussed in varying levels of detail by Gibson [5], Schwarz =-=[20]-=- and Burkhard [13], with citations of standard texts on error correcting codes [1, 2, 3, 21, 22] for completeness. There is one problem with all the above discussions of this technique — they require ... |

25 |
The Theory of Error-correcting Codes, Part I
- MacWilliams, Sloane
- 1977
(Show Context)
Citation Context ... device using FD = C. For example, suppose the first word of D1 is 3, the first word of D2 is 13, and the first word of D3 is 9. Then we use F to calculate the first words of C1;C2,andC3: 3 5 C1 = (1)=-=(3)-=- (1)(13) (1)(9) = 3 13 9 = 0011 1101 1001 = 0111 = 7 C2 = (1)(3) (2)(13) (3)(9) = 3 9 8 = 0011 1001 1000 = 0010 = 2 C3 = (1)(3) (4)(13) (5)(9) = 3 1 11 = 0011 0001 1011 = 1001 = 9 Suppose we change D2... |

24 | Algorithm-based diskless checkpointing for fault tolerant matrix operations,” Fault-Tolerant Computing
- Plank, Kim, et al.
- 1995
(Show Context)
Citation Context ...st [4, 5, 6]. Since then, the technique has been used to design multicomputer and network file systems with high reliability and bandwidth [7, 8], and to design fast distributed checkpointing systems =-=[9, 10, 11, 12]-=-. We call all such systems “RAID-like” systems. The above problem is central to all RAID-like systems. When storage is distributed amongsdevices, the chances of one of these devices failing becomes si... |

24 |
Evaluation of Checkpoint Mechanisms for Massively Parallel Machines
- Chiueh, Deng
- 1996
(Show Context)
Citation Context ...st [4, 5, 6]. Since then, the technique has been used to design multicomputer and network file systems with high reliability and bandwidth [7, 8], and to design fast distributed checkpointing systems =-=[9, 10, 11, 12]-=-. We call all such systems “RAID-like” systems. The above problem is central to all RAID-like systems. When storage is distributed amongsdevices, the chances of one of these devices failing becomes si... |

20 |
Efficient placement of parity and data to tolerate two disk failures in disk array systems
- Park
- 1995
(Show Context)
Citation Context ...antage is that it cannot recover from more than one simultaneous failure. 2sAssgrows, the ability to tolerate multiple failures becomes important [13]. Several techniques have been developed for this =-=[13, 14, 15, 16]-=-, the concentration being small values of � . The most general technique for tolerating � simultaneous failures with exactly � checksum devices is a technique based on Reed-Solomon coding. This fact i... |

20 |
Holographic Dispersal and Recovery of Information," in
- Preparata
- 1989
(Show Context)
Citation Context ...f is harder to come by. The technique has an interesting history. It was first presented in terms of secret sharing by Karnin [17], and then by Rabin [18] in terms of information dispersal. Preparata =-=[19]-=- then showed the relationship between Rabin’s method and Reed-Solomon codes, hence the labeling of the technique as Reed-Solomon coding. The technique has recently been discussed in varying levels of ... |

19 | Faster checkpointing with N + 1 parity - Plank, Li - 1994 |

15 |
on-line failure recovery in redundant disk arrays
- Fast
- 1993
(Show Context)
Citation Context ...isks, which is the minimum value for tolerating � failures. As in all RAID systems, the encoding information may be distributed among thes� � disks to avoid having the checksum disks become hot spots =-=[5, 26]-=-. The final operation of concern is recovery. Here, we assume ¦ ¥ � that failures have occurred and the system must recover the contents of ¦ the disks. In the RS-Raid algorithm, recovery consists of ... |

10 | Maximal and nearmaximal shift register sequences: efficient event counters and easy discrete logarithms
- Clark, Weng
- 1994
(Show Context)
Citation Context ...oftware solution that is easy to implement and does not consume much physical memory. For larger values ofs� � , other approaches (hardware or software) may be necessary. See References [2], [27] and =-=[28]-=- for examples of other approaches. 14sAcknowledgements The author thanks Joel Friedman, Kai Li, Michael Puening, Norman Ramsey, Brad Vander Zanden and Michael Vose for their valuable comments and disc... |

9 | On-Line Failure Recovery in Redundant Disk Arrays - Holland, Gibson, et al. - 1993 |

7 |
Faster Checkpointing with
- Plank, Li
- 1994
(Show Context)
Citation Context ...g[14] = 9 13 10 = gfilog[gflog[13]+gflog[10]] = gfilog[13+9] = gfilog[7] = 11 13 10 = gfilog[gflog[13]-gflog[10]] = gfilog[13-9] = gfilog[4] = 3 3 7 = gfilog[gflog[3]-gflog[7]] = gfilog[4-10] = gfilog=-=[9]-=- = 14 Therefore, a multiplication or division requires one conditional, three table lookups (twoA TUTORIAL ON REED–SOLOMON CODING 1001 #define NW (1 << w) /* In other words, NW equals 2 to the w-th p... |

5 |
Codes for Error Control and Synchronization
- Wiggert
- 1988
(Show Context)
Citation Context ...ecognizes this shutting down. This is as opposed to an error, in which a device failure is manifested by storing and retrieving incorrect values that can only be recognized by sort of embedded coding =-=[2, 23]-=-. The calculation of the contents of each checksum device � � requires a function � � applied to all the data devices. Figure 1 shows an example configuration using this technique (which we henceforth... |

2 |
Applied Parallel Research
- Plank, Li
- 1994
(Show Context)
Citation Context |