Results 1 - 10
of
40
Ceph: A scalable, highperformance distributed system,” in OSDI,
, 2006
"... Abstract We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for hetero ..."
Abstract
-
Cited by 275 (32 self)
- Add to MetaCart
(Show Context)
Abstract We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.
LH*RS -- a high-availability scalable distributed data structure
"... (SDDS). An LH*RS file is hash partitioned over the distributed RAM of a multicomputer, e.g., a network of PCs, and supports the unavailability of any of its k ≥ 1 server nodes. The value of k transparently grows with the file to offset the reliability decline. Only the number of the storage nodes p ..."
Abstract
-
Cited by 59 (11 self)
- Add to MetaCart
(SDDS). An LH*RS file is hash partitioned over the distributed RAM of a multicomputer, e.g., a network of PCs, and supports the unavailability of any of its k ≥ 1 server nodes. The value of k transparently grows with the file to offset the reliability decline. Only the number of the storage nodes potentially limits the file growth. The high-availability management uses a novel parity calculus that we have developed, based on the Reed-Salomon erasure correcting coding. The resulting parity storage overhead is about the minimal ever possible. The parity encoding and decoding are faster than for any other candidate coding we are aware of. We present our scheme and its performance analysis, including experiments with a prototype implementation on Wintel PCs. The capabilities of LH*RS offer new perspectives to data intensive applications, including the emerging ones of grids and of P2P computing.
PRO: A popularity-based multi-threaded reconstruction optimization for RAID-structured storage systems
- In Proceedings of the 5th USENIX Conference on File and Storage Technologies. USENIX Association
, 2007
"... This paper proposes and evaluates a novel dynamic data reconstruction optimization algorithm, called popularity-based multi-threaded reconstruction optimization (PRO), which allows the reconstruction process in a RAID-structured storage system to rebuild the frequently accessed areas prior to rebuil ..."
Abstract
-
Cited by 38 (13 self)
- Add to MetaCart
(Show Context)
This paper proposes and evaluates a novel dynamic data reconstruction optimization algorithm, called popularity-based multi-threaded reconstruction optimization (PRO), which allows the reconstruction process in a RAID-structured storage system to rebuild the frequently accessed areas prior to rebuilding infrequently accessed areas to exploit access locality. This approach has the salient advantage of simultaneously decreasing reconstruction time and alleviating user and system performance degradation. It can also be easily adopted in various conventional reconstruction approaches. In particular, we optimize the disk-oriented reconstruction (DOR) approach with PRO. The PRO-powered DOR is shown to induce a much earlier onset of response-time improvement and sustain a longer time span of such improvement than the original DOR. Our benchmark studies on read-only web workloads have shown that the PRO-powered DOR algorithm consistently outperforms the original DOR algorithm in the failurerecovery process in terms of user response time, with a 3.6%~23.9 % performance improvement and up to 44.7 % reconstruction time improvement simultaneously. 1.
Optimal recovery of single disk failure in RDP code storage systems
- ACM SIGMETRICS Performance Evaluation Review
"... Modern storage systems use thousands of inexpensive disks to meet the storage requirement of applications. To enhance the data availability, some form of redundancy is used. For example, conventional RAID-5 systems provide data availability for single disk failure only, while recent advanced coding ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
(Show Context)
Modern storage systems use thousands of inexpensive disks to meet the storage requirement of applications. To enhance the data availability, some form of redundancy is used. For example, conventional RAID-5 systems provide data availability for single disk failure only, while recent advanced coding techniques such as row-diagonal parity (RDP) can provide data availability with up to two disk failures. To reduce the probability of data unavailability, whenever a single disk fails, disk recovery (or rebuild) will be carried out. We show that conventional recovery scheme of RDP code for a single disk failure is inefficient and suboptimal. In this paper, we propose an optimal and efficient disk recovery scheme, Row-Diagonal Optimal Recovery (RDOR), for single disk failure of RDP code that has the following properties: (1) it is read optimal in the sense that it issues the smallest number of disk reads to recover the failed disk; (2) it has the load balancing property that all surviving disks will be subjected to the same amount of additional workload in rebuilding the failed disk. We carefully explore the design state space and theoretically show the optimality of RDOR. We carry out performance evaluation to quantify the merits of RDOR on some widely used disks.
WorkOut: I/O Workload Outsourcing for Boosting RAID Reconstruction Performance
"... User I/O intensity can significantly impact the performance of on-line RAID reconstruction due to contention for the shared disk bandwidth. Based on this observation, this paper proposes a novel scheme, called WorkOut (I/O Workload Outsourcing), to significantly boost RAID reconstruction performance ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
(Show Context)
User I/O intensity can significantly impact the performance of on-line RAID reconstruction due to contention for the shared disk bandwidth. Based on this observation, this paper proposes a novel scheme, called WorkOut (I/O Workload Outsourcing), to significantly boost RAID reconstruction performance. WorkOut effectively outsources all write requests and popular read requests originally targeted at the degraded RAID set to a surrogate RAID set during reconstruction. Our lightweight prototype implementation of WorkOut and extensive tracedriven and benchmark-driven experiments demonstrate that, compared with existing reconstruction approaches, WorkOut significantly speeds up both the total reconstruction time and the average user response time. Importantly, WorkOut is orthogonal to and can be easily incorporated into any existing reconstruction algorithms. Furthermore, it can be extended to improving the performance of other background support RAID tasks, such as re-synchronization and disk scrubbing. 1
Disk infant mortality in large storage systems
- In Proc of MASCOTS ’05
, 2005
"... As disk drives have dropped in price relative to tape, the desire for the convenience and speed of online access to large data repositories has led to the deployment of petabyte-scale disk farms with thousands of disks. Unfortunately, the very large size of these repositories renders them vulnerable ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
(Show Context)
As disk drives have dropped in price relative to tape, the desire for the convenience and speed of online access to large data repositories has led to the deployment of petabyte-scale disk farms with thousands of disks. Unfortunately, the very large size of these repositories renders them vulnerable to previously rare failure modes such as multiple, unrelated disk failures leading to data loss. While some business models, such as free email servers, may be able to tolerate some occurrence of data loss, others, including premium online services and storage of simulation results at a national laboratory, cannot. This paper describes the effect of infant mortality on long-term failure rates of systems that must preserve their data for decades. Our failure models incorporate the well-known “bathtub curve, ” which reflects the higher failure rates of new disk drives, a lower, constant failure rate during the remainder of the design life span, and increased failure rates as components wear out. Large systems are vulnerable to the “cohort effect” that occurs when many disks are simultaneously replaced by new disks. Our more accurate disk models and simulations have yielded predictions of system lifetimes that are more pessimistic than existing models that assume a constant disk failure rate. Thus, larger system scale requires designers to take disk infant mortality into account. 1.
Providing high reliability in a minimum redundancy archival storage system
- Proc.14 th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
, 2006
"... Inter-file compression techniques store files as sets of references to data objects or chunks that can be shared among many files. While these techniques can achieve much better compression ratios than conventional intra-file compression methods such as Lempel-Ziv compression, they also reduce the r ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
Inter-file compression techniques store files as sets of references to data objects or chunks that can be shared among many files. While these techniques can achieve much better compression ratios than conventional intra-file compression methods such as Lempel-Ziv compression, they also reduce the reliability of the storage system because the loss of a few critical chunks can lead to the loss of many files. We show how to eliminate this problem by choosing for each chunk a replication level that is a function of the amount of data that would be lost if that chunk were lost. Experiments using actual archival data show that our technique can achieve significantly higher robustness than a conventional approach combining data mirroring and intra-file compression while requiring about half the storage space. 1.
A Hybrid Approach to Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation
- ACM Trans. on Storage
"... The current parallel storage systemsuse thousandsof inexpensive disks to meet the storage requirement of applications.Dataredundancyand/orcodingareusedtoenhancedataavailability,e.g., Row-diagonalparity (RDP) and EVENODD codes, which are widely used in RAID-6 storage systems, provide data availabilit ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
The current parallel storage systemsuse thousandsof inexpensive disks to meet the storage requirement of applications.Dataredundancyand/orcodingareusedtoenhancedataavailability,e.g., Row-diagonalparity (RDP) and EVENODD codes, which are widely used in RAID-6 storage systems, provide data availability with up to two disk failures. To reduce the probability of data unavailability, whenever a single disk fails, disk recovery will be carried out. We find that the conventional recovery schemes of RDP and EVENODD codes forasinglefaileddisk onlyuseone paritydisk.However, there are twoparitydisks inthesystem,and bothcan be usedfor singledisk failure recovery. Inthispaper, wepropose a hybrid recovery approach which uses both parities for single disk failure recovery and we design efficient recovery schemes for RDP code (RDOR-RDP) and EVENODD code (RDOR-EVENODD). Our recovery scheme has the following attractive properties: (1) “read optimality ” in the sense that our scheme issues the smallest number of disk reads to recover a single failed disk and it reduces approximately 1/4 of disk reads compared with conventional schemes; (2) “load balancing property ” in that all surviving disks will be subjected to the same (or almost the same) amount of additional workload in rebuilding the failed disk. We carry out performance evaluation to quantify the merits of RDOR-RDP and RDOR-EVENODD on some widely used disks with DiskSim. The off-line experimental results show that RDOR-RDP and RDOR-
Improving the availability of supercomputer job input data using temporal replication
"... Supercomputers are stepping into the Peta-scale and Exascale era, wherein handling hundreds of concurrent system failures is an urgent challenge. In particular, storage system failures have been identified as a major source of service interruptions in supercomputers. RAID solutions alone cannot prov ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
(Show Context)
Supercomputers are stepping into the Peta-scale and Exascale era, wherein handling hundreds of concurrent system failures is an urgent challenge. In particular, storage system failures have been identified as a major source of service interruptions in supercomputers. RAID solutions alone cannot provide sufficient storage protection as (1) average disk recovery time is projected to grow, making RAID groups increasingly vulnerable to additional failures during data reconstruction, and (2) disk-level data protection cannot mask higherlevel faults, such as software/hardware failures of entire I/O nodes. This paper presents a complementary approach based on the observation that files in the supercomputer scratch space are typically accessed by batch jobs, whose execution can be anticipated. Therefore, we propose to transparently, selectively, and temporarily replicate ”active ” job input data, by coordinating the parallel file system with the batch job scheduler. We have implemented the temporal replication scheme in the popular Lustre parallel file system and evaluated it with both real-cluster experiments and trace-driven simulations. Our results show that temporal replication allows for fast online data reconstruction, with a reasonably low overall space and I/O bandwidth overhead.
R-ADMAD: High Reliability Provision for Large-Scale De-duplication Archival Storage Systems
"... Data de-duplication has become a commodity component in dataintensive systems and it is required that these systems provide high reliability comparable to others. Unfortunately, by storing duplicate data chunks just once, de-duped system improves storage utilization at cost of error resilience or re ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Data de-duplication has become a commodity component in dataintensive systems and it is required that these systems provide high reliability comparable to others. Unfortunately, by storing duplicate data chunks just once, de-duped system improves storage utilization at cost of error resilience or reliability. In this paper, R-ADMAD, a high reliability provision mechanism is proposed. It packs variablelength data chunks into fixed sized objects, and exploits ECC codes to encode the objects and distributes them among the storage nodes in a redundancy group, which is dynamically generated according to current status and actual failure domains. Upon failures, R-ADMAD proposes a distributed and dynamic recovery process. Experimental results show that R-ADMAD can provide the same storage utilization as RAID-like schemes, but comparable reliability to replication based schemes with much more redundancy. The average recovery time of R-ADMAD based configurations is about 2-6 times less than RAID-like schemes. Moreover, R-ADMAD can provide dynamic load balancing even without the involvement of the overloaded storage nodes.