Results 1 -
4 of
4
Failure Detectors Encapsulate Fairness
, 2010
"... Failure detectors have long been viewed as abstractions for the synchronism present in distributed system models. However, investigations into the exact amount of synchronism encapsulated by a given failure detector have met with limited success. The reason for this is that traditionally, models of ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Failure detectors have long been viewed as abstractions for the synchronism present in distributed system models. However, investigations into the exact amount of synchronism encapsulated by a given failure detector have met with limited success. The reason for this is that traditionally, models of partial synchrony are specified with respect to real time, but failure detectors do not encapsulate real time. Instead, we argue that failure detectors encapsulate the fairness in computation and communication. Fairness is a measure of the number of steps executed by one process relative either to the number of steps taken by another process or relative to the duration for which a message is in transit. We argue that partially synchronous systems are perhaps better specified with fairness constraints (rather than real-time constraints) on computation and communication. We demonstrate the utility of this approach by specifying the weakest system models to implement failure detectors in the Chandra-Toueg hierarchy.
Assessing HPC Failure Detectors for MPI Jobs
"... Reliability is one of the challenges faced by exascale computing. Components are poised to fail during large-scale executions given current mean time between failure (MTBF) projections. To cope with failures, resilience methods have been proposed as explicit or transparent techniques. For the latter ..."
Abstract
- Add to MetaCart
(Show Context)
Reliability is one of the challenges faced by exascale computing. Components are poised to fail during large-scale executions given current mean time between failure (MTBF) projections. To cope with failures, resilience methods have been proposed as explicit or transparent techniques. For the latter techniques, this paper studies the challenge of fault detection. This work contributes a study on generic fault detection capabilities at the MPI level and beyond. The objective is to assess different detectors, which ultimately may or may not be implemented within the application’s runtime layer. A first approach utilizes a periodic liveness check while a second method promotes sporadic checks upon communication activities. The contributions of this paper are two-fold: (a) We provide generic interposing of MPI applications for fault detection. (b) We experimentally compare periodic and sporadic methods for liveness checking. We show that the sporadic approach, even though it imposes lower bandwidth requirements and utilizes lower frequency checking, results in equal or worse application performance than a periodic liveness test for larger number of nodes. We further show that performing liveness checks in separation from MPI applications results in lower overhead than interpositioning, as demonstrated by our prototypes. Hence, we promote separate periodic fault detection as the superior approach for fault detection.
Failure Detection within MPI Jobs: Periodic Outperforms Sporadic
"... Abstract. Reliability is one of the challenges faced by exascale computing. Components are poised to fail during large-scale executions given current mean time between failure (MTBF) projections. To cope with failures, resilience methods have been proposed as explicit or transparent techniques. For ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Reliability is one of the challenges faced by exascale computing. Components are poised to fail during large-scale executions given current mean time between failure (MTBF) projections. To cope with failures, resilience methods have been proposed as explicit or transparent techniques. For the latter techniques, this paper studies the challenge of fault detection. This work contributes generic fault detection capabilities at the MPI level and beyond. A first approach utilizes a periodic liveness check while a second method promotes sporadic checks upon communication activities. The contributions of this paper are two-fold: (a) We provide generic interposing of MPI applications for fault detection. (b) We experimentally compare periodic and sporadic methods for liveness checking. We show that the sporadic approach, even though it imposes lower bandwidth requirements and utilizes lower frequency checking, results in equal or worse application performance than a periodic liveness test for larger number of nodes. We further show that performing liveness checks in separation from MPI applications results in lower overhead than interpositioning. Hence, we promote separate periodic fault detection as the superior approach for fault detection. 1
Failure detection and partial redundancy in HPC
, 2011
"... To support the ever increasing demand of scientific computations, today’s High Performance Computing (HPC) systems have large numbers of computing elements running in parallel. Petascale computers, which are capable of reaching a performance in excess of one PetaFLOPS (1015 floating point operations ..."
Abstract
- Add to MetaCart
To support the ever increasing demand of scientific computations, today’s High Performance Computing (HPC) systems have large numbers of computing elements running in parallel. Petascale computers, which are capable of reaching a performance in excess of one PetaFLOPS (1015 floating point operations per second), are successfully deployed and used at a number of places. Exascale computers with one thousand times the scale and computing power are projected to become available in less than 10 years. Reliability is one of the major challenges faced by exascale computing. With hundreds of thousands of cores, the mean time to failure is measured in minutes or hours instead of days or months. Failures are bound to happen during execution of HPC applications. Current fault recovery techniques focus on reactive ways to mitigate faults. Central to any kind of fault recovery method is the challenge of detecting faults and propagating this knowledge. The first half of this thesis work contributes to fault detection capabilities at the MPI-level. We propose two principle types of fault detection mechanisms: the first one uses periodic liveness checks while the second one makes on-demand liveness checks. These two techniques are experimentally compared for the overhead imposed on MPI applications. Checkpoint and restart (CR) recovery is one of the fault recovery methods which is used to