Results 1 - 10
of
103
Plfs: A checkpoint filesystem for parallel applications
, 2009
"... Parallel applications running across thousands of processors must protect themselves from inevitable system failures. Many applications insulate themselves from failures by checkpointing. For many applications, checkpointing into a shared single file is most convenient. With such an approach, the si ..."
Abstract
-
Cited by 87 (21 self)
- Add to MetaCart
(Show Context)
Parallel applications running across thousands of processors must protect themselves from inevitable system failures. Many applications insulate themselves from failures by checkpointing. For many applications, checkpointing into a shared single file is most convenient. With such an approach, the size of writes are often small and not aligned with file system boundaries. Unfortunately for these applications, this preferred data layout results in pathologically poor performance from the underlying file system which is optimized for large, aligned writes to non-shared files. To address this fundamental mismatch, we have developed a virtual parallel log structured file system, PLFS. PLFS remaps an application’s preferred data layout into one which is optimized for the underlying file system. Through testing on PanFS, Lustre, and GPFS, we have seen that this layer of indirection and reorganization can reduce checkpoint time by an order of magnitude for several important benchmarks and real applications without any application modification.
Evaluating the viability of process replication reliability for exascale systems,”
- in Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage, and Analysis,
, 2011
"... ABSTRACT As high-end computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are increasingly problematic at these scales due to excessive overheads predi ..."
Abstract
-
Cited by 71 (7 self)
- Add to MetaCart
(Show Context)
ABSTRACT As high-end computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are increasingly problematic at these scales due to excessive overheads predicted to more than double an application's time to solution. Replicated computing techniques, particularly state machine replication, long used in distributed and mission critical systems, have been suggested as an alternative to checkpoint-restart. In this paper, we evaluate the viability of using state machine replication as the primary fault tolerance mechanism for upcoming exascale systems. We use a combination of modeling, empirical analysis, and simulation to study the costs and benefits of this approach in comparison to checkpoint/restart on a wide range of system parameters. These results, which cover different failure distributions, hardware mean time to failures, and I/O bandwidths, show that state machine replication is a potentially useful technique for meeting the fault tolerance demands of HPC applications on future exascale platforms.
Proactive Fault Tolerance Using Preemptive Migration
- In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009
"... Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
(Show Context)
Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies. 1.
Preventive migration vs. preventive checkpointing for extreme scale supercomputers
- Parallel Processing Letters
, 2011
"... An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoid-ance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive check-pointing and preven ..."
Abstract
-
Cited by 17 (11 self)
- Add to MetaCart
(Show Context)
An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoid-ance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive check-pointing and preventive migration. We instantiate these models for platform scenarios repre-sentative of current and future technology trends. We find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing. We also develop an analytical model of the performance for fault tolerance based on periodic check-pointing and compare this approach to both failure avoidance techniques. We find that this comparison is sensitive to the nature of the stochastic distribution of the time between failures, and that failure avoidance is likely inferior to fault tolerance in the long term. Regardless, our result show that each approach is likely to achieve poor utilization for large-scale platforms (e.g., 220 nodes) unless the mean time between failures is large. We show how bounding parallel job size improves utilization, but conclude that achieving good utilization in future large-scale platforms will require a combination of techniques.
Combining partial redundancy and checkpointing for HPC
- In International Conference on Distributed Computing Systems
, 2012
"... Abstract Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10 15 floating point operations per second) and exascale systems are projected within seven years. But reliability is becoming one of the major challenges faced by exascale computing. With billion-core para ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
(Show Context)
Abstract Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10 15 floating point operations per second) and exascale systems are projected within seven years. But reliability is becoming one of the major challenges faced by exascale computing. With billion-core parallelism, the mean time to failure is projected to be in the range of minutes or hours instead of days. Failures are becoming the norm rather than the exception during execution of HPC applications. Current fault tolerance techniques in HPC focus on reactive ways to mitigate faults, namely via checkpoint and restart (C/R). Apart from storage overheads, C/R-based fault recovery comes at an additional cost in terms of application performance because normal execution is disrupted when checkpoints are taken. Studies have shown that applications running at a large scale spend more than 50% of their total time saving checkpoints, restarting and redoing lost work. Redundancy is another fault tolerance technique, which employs redundant processes performing the same task. If a process fails, a replica of it can take over its execution. Thus, redundant copies can decrease the overall failure rate. The downside of redundancy is that extra resources are required and there is an additional overhead on communication and synchronization. This work contributes a model and analyzes the benefit of C/R in coordination with redundancy at different degrees to minimize the total wallclock time and resources utilization of HPC applications. We further conduct experiments with an implementation of redundancy within the MPI layer on a cluster. Our experimental results confirm the benefit of partial, dual and triple redundancy and show a close fit to the model. We show that combined C/R and redundancy results in shorter overall execution time even for medium-sized HPC applications with 4,000 processes and partial redundancy (a replica for every other process). At 60,000 processes, dual redundancy requires twice the number of processing resources for an application but allows two jobs of 128 hours wallclock time to finish within the time of just one job without redundancy. For other configurations, partial redundancy results in the lowest time. Partial redundancy further allows one to trade-off additional resource requirements for redundancy against wallclock time, which provides a tuning knob for users to adapt to resource availabilities.
Checkpointing vs. Migration for Post-Petascale Machines
, 2009
"... We craft a few scenarios for the execution of sequential and parallel jobs on future generation machines. Checkpointing or migration, which technique to choose? 1 ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
(Show Context)
We craft a few scenarios for the execution of sequential and parallel jobs on future generation machines. Checkpointing or migration, which technique to choose? 1
Using group replication for resilience on exascale systems
, 2012
"... High performance computing applications must be resilient to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-recovery, by which the application saves its state to secondary storage throughout execution and recovers fr ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
High performance computing applications must be resilient to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-recovery, by which the application saves its state to secondary storage throughout execution and recovers from the latest saved state in case of a failure. An oft studied research question is that of the optimal checkpointing strategy: when should state be saved? Unfortunately, even using an optimal checkpointing strategy, the checkpointing frequency must increase as platform scale increases, leading to higher checkpointing overhead. This overhead precludes high parallel efficiency for large-scale platforms, thus mandating other more scalable fault-tolerance mechanisms. One such mechanism is replication, which can be used in addition to checkpoint-recovery. Using replication, multiple processors perform the same computation so that a processor failure does not necessarily imply application failure. While at first glance replication may seem wasteful, it may be significantly more efficient than using solely checkpointrecovery at large scale. In this work we investigate a simple approach where entire application instances are replicated. We provide a theoretical study of checkpoint-recovery with replication in terms of expected application execution time, under an exponential distribution of failures. We design dynamic-programming based algorithms to define checkpointing dates that work under any failure distribution. We also conduct simulation experiments assuming that failures follow Exponential or Weibull distributions, the latter being more representative of real-world systems, and using failure logs from production clusters. Our results show that replication is useful in a variety of realistic application and checkpointing cost scenarios for future exascale platforms. 1
Transparent redundant computing with mpi
- In EuroMPI’10: Proceedings of the 17th European MPI user’s
, 2010
"... Abstract. Extreme-scale parallel systems will require alternative methods for applications to maintain current levels of uninterrupted execution. Redundant computation is one approach to consider, if the benefits of increased resiliency outweigh the cost of consuming additional resources. We descri ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Extreme-scale parallel systems will require alternative methods for applications to maintain current levels of uninterrupted execution. Redundant computation is one approach to consider, if the benefits of increased resiliency outweigh the cost of consuming additional resources. We describe a transparent redundancy approach for MPI applications and detail two different implementations that provide the ability to tolerate a range of failure scenarios, including loss of application processes and connectivity. We compare these two approaches and show performance results from micro-benchmarks that bound worst-case message passing performance degradation. We propose several enhancements that could lower the overhead of providing resiliency through redundancy.
Evaluation of simple causal message logging for large-scale fault tolerant hpc systems
- in 16th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems in 25th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2011
, 2011
"... Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. The next generation of exascale supercomputers will make available clusters with millions of processors. In those machines, mean time between failures will range from a few minutes to few tens of minut ..."
Abstract
-
Cited by 9 (8 self)
- Add to MetaCart
(Show Context)
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. The next generation of exascale supercomputers will make available clusters with millions of processors. In those machines, mean time between failures will range from a few minutes to few tens of minutes, making the crash of a processor the common case, instead of a rarity. Parallel applications running on those large machines will need to simultaneously survive crashes and maintain high productivity. To achieve that, fault tolerance techniques will have to go beyond checkpoint/restart, which requires all processors to roll back in case of a failure. Incorporating some form of message logging will provide a framework where only a subset of processors are rolled back after a crash. In this paper, we discuss why a simple causal message logging protocol seems a promising alternative to provide fault tolerance in large supercomputers. As opposed to pessimistic message logging, it has low latency overhead, especially in collective communication operations. Besides, it saves messages when more than one thread is running per processor. Finally, we demonstrate that a simple causal message logging protocol has a faster recovery and a low performance penalty when compared to checkpoint/restart. Running NAS Parallel Benchmarks (CG, MG, BT and DT) on 1024 processors, simple causal message logging has a latency overhead below 5%. Keywords-causal message logging; pessimistic message logging; migratable objects; parallel applications. I.
Byzantine faulttolerant MapReduce: Faults are not just crashes
- in Proceedings of the 3rd IEEE International Conference on Cloud Computing Technology and Science
"... MapReduce is often used to run critical jobs such as scientific data analysis. However, evidence in the literature shows that arbitrary faults do occur and can probably corrupt the results of MapReduce jobs. MapReduce runtimes like Hadoop tolerate crash faults, but not arbitrary or Byzantine faults. ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
(Show Context)
MapReduce is often used to run critical jobs such as scientific data analysis. However, evidence in the literature shows that arbitrary faults do occur and can probably corrupt the results of MapReduce jobs. MapReduce runtimes like Hadoop tolerate crash faults, but not arbitrary or Byzantine faults. We present a MapReduce algorithm and prototype that tolerate these faults. An experimental evaluation shows that the execution of a job with our algorithms uses twice the resources of the original Hadoop, instead of the 3 or 4 times more that would be achieved with the direct application of common Byzantine fault-tolerance paradigms. We believe this cost is acceptable for critical applications that require that level of fault tolerance. 1.