Fault Tolerance, Checkpoints
Abstract: Many scientific problems benefit from computationsthat are parallel at a coarse grain. Collections of looselycoupled, heterogeneous computers are increasingly being applied to these problems. While individual computers are designed to be relatively reliable, a collection of several autonomous machines necessarily has a greater rate of failure. As data networks improve, and larger multicomputers are being used, rates of failure will increase. PVM (Parallel Virtual Machine) [Sun90, GS92] is a... (Update)
Context of citations to this paper: More
...recomputation that must be performed. To date, most checkpointing systems for long running distributed memory computations (e.g. [1, 5, 6, 18, 26, 29, 32]) are based on coordinated checkpointing [11] At each checkpoint, the global state of all the processors is defined and...
...is itself responsible for making this data persistent. Several other facilities for user level checkpointing have been implemented [4, 8, 11, 13]. Typically, the fork( UNIX system call is used to periodically create a snapshot of the task s image, allowing the application...
Cited by: More
Fault-Tolerant Execution of Computationally and Storage.. - Smith, Shrivastava (1995)
(Correct)
Asynchronous Checkpointing for PVM Requires Message-Logging - Kevin Skadron April
(Correct)
Transparent Orthogonal Checkpointing Through User-Level.. - Skoglund, Ceelen, Liedtke (2000)
(Correct)
Active bibliography (related documents): More All
0.5: Checkpointing with Multicast Communication - Lumpp, Jr., Dieter (1998)
(Correct)
0.5: An Application-Oriented Toolkit for Highly Available Distributed.. - Leon (1995)
(Correct)
0.1: Fault Tolerance and Scalability in DSM Coherence Protocols - A.. - Shah (1997)
(Correct)
Similar documents based on text: More All
0.1: Midway: Shared Memory Parallel Programming with Entry.. - Bershad, Zekauskas (1991)
(Correct)
0.1: Constructive Decomposition of Functions of Finite Central.. - Lakey Department
(Correct)
0.0: Automatic Mapping of Task and Data Parallel Programs for.. - Jaspal Subhlok (1993)
(Correct)
Related documents from co-citation: More All
13: PVM: A framework for parallel distributed computing
- Sunderam - 1990
12: The Performance of Consistent Checkpointing
- Elnozahy, Johnson et al. - 1992
10: MIST: PVM with Transparent Migration and Checkpointing
- Casas, Clark et al. - 1995
BibTeX entry: (Update)
Juan Leon, Allan Fisher, and Peter Steenkiste. Fail-safe PVM: A portable package for distributed programming with transparent recovery. Technical Report CMU-CS-93-124, Carnegie Mellon University, February 1993. http://citeseer.ist.psu.edu/3289.html More
@techreport{ juan93failsafe,
author = "Juan L{\'e}on and Allan L. Fisher and Peter Steenkiste",
title = "{Fail-safe PVM: A Portable package for Distributed Programming with Transparent Recovery}",
institution = "School of Computer Science, Carnegie Mellon University",
address = "Pittsburgh, Pennsylvania/U.S.A.",
number = "CMU-CS-93-124",
year = "1993",
month = feb,
url = "citeseer.ist.psu.edu/3289.html",
url = "\url{http://citeseer.ist.psu.edu/3289.html}" }
Citations (may not include all citations):
2732
Communicating sequential processes (context) - Hoare - 1978
917
and the ordering of events in a distributed system (context) - Lamport, clocks - 1978
587
Pvm: A framework for parallel distributed computing
- Sunderam - 1990
572
Distributed snapshots: Determining global states of distribu.. (context) - Chandy, Lamport - 1985
566
Condor - a hunter of idle workstations (context) - Litzkow, Livny et al. - 1988
293
System structure for software fault tolerance (context) - Randell - 1975
185
Linda and friends (context) - Ahuja, Carriero et al. - 1986
184
Checkpointing and rollback-recovery for distributed systems (context) - Koo, Toueg - 1987
151
Network based concurrent computing on the pvm system
- Geist, Sunderam - 1992
113
Midway: shared memory parallel programming with entry consis..
- Bershad, Zekauskas - 1991
101
Supporting checkpointing and process migration outside the u.. (context) - Litzkow, Solomon - 1992
86
Experience with the condor distributed batch system (context) - Litzkow, Livny - 1990
78
Graphical development tools for network-based concurrent sup..
- Beguelin, Dongarra et al. - 1991
68
A nonstop kernel (context) - Bartlett - 1981
68
ACM Transactions on Computer Systems (context) - Borg, Blau et al. - 1989
50
Parallel programming in linda (context) - Gelernter, Carriero et al. - 1985
38
A parallelizing compiler for distributed memory parallel com.. (context) - Tseng - 1990
35
The clouds distributed operating system: Functional descript.. (context) - Dasgupta, LeBlanc et al. - 1989
16
Munin: distributed shared memory using multiprotocol release.. (context) - Bennett, Carter et al. - 1991
11
The performance of consistent checkpointing
- Elnohazy, Johnson et al. - 1992
2
Concurrent robust checkpointing and recovery in distributed .. (context) - Leu, Bhargava - 1988
2
No title (context) - idea - 1992
The graph only includes citing articles where the year of publication is known.
Documents on the same site (http://fermivista.math.jussieu.fr/ftp/ftp.scri.fsu.edu.html): More
Compressible Navier-Stokes Computations On Unstructured.. - Kopriva (1998)
(Correct)
The Schrödinger functional running coupling with staggered.. - Urs M. Heller (1997)
(Correct)
Projected Dynamics for Metastable Decay in Ising Models - Kolesik, Novotny.. (1998)
(Correct)
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC