See this document in CiteSeerX!

Fail-safe PVM: A portable package for distributed programming with transparent recovery (1993)  (Make Corrections)  (39 citations)
Juan Leon
School of Computer Science, Carnegie Mellon University



  Home/Search   Context   Related

 
View or download:
fsu.edu/pub/clusterw...fsafe.pvm.tr.ps
leon.com/CMUCS93124.ps
cmu.edu/afs/cs/project/cmc...93fspvm.ps
Cached:  PS.gz  PS  PDF   Image  Update  Help

From:  fsu.edu (more)
From:  cmu.edu/project/nec...appl_papers
(Enter author homepages)

Rate this article: (best)
  Comment on this article  
Fault Tolerance, Checkpoints

Abstract: Many scientific problems benefit from computationsthat are parallel at a coarse grain. Collections of looselycoupled, heterogeneous computers are increasingly being applied to these problems. While individual computers are designed to be relatively reliable, a collection of several autonomous machines necessarily has a greater rate of failure. As data networks improve, and larger multicomputers are being used, rates of failure will increase. PVM (Parallel Virtual Machine) [Sun90, GS92] is a... (Update)

Context of citations to this paper:   More

...recomputation that must be performed. To date, most checkpointing systems for long running distributed memory computations (e.g. [1, 5, 6, 18, 26, 29, 32]) are based on coordinated checkpointing [11] At each checkpoint, the global state of all the processors is defined and...

...is itself responsible for making this data persistent. Several other facilities for user level checkpointing have been implemented [4, 8, 11, 13]. Typically, the fork( UNIX system call is used to periodically create a snapshot of the task s image, allowing the application...

Cited by:   More
Fault-Tolerant Execution of Computationally and Storage.. - Smith, Shrivastava (1995)   (Correct)
Asynchronous Checkpointing for PVM Requires Message-Logging - Kevin Skadron April   (Correct)
Transparent Orthogonal Checkpointing Through User-Level.. - Skoglund, Ceelen, Liedtke (2000)   (Correct)

Active bibliography (related documents):   More   All
0.5:   Checkpointing with Multicast Communication - Lumpp, Jr., Dieter (1998)   (Correct)
0.5:   An Application-Oriented Toolkit for Highly Available Distributed.. - Leon (1995)   (Correct)
0.1:   Fault Tolerance and Scalability in DSM Coherence Protocols - A.. - Shah (1997)   (Correct)

Similar documents based on text:   More   All
0.1:   Midway: Shared Memory Parallel Programming with Entry.. - Bershad, Zekauskas (1991)   (Correct)
0.1:   Constructive Decomposition of Functions of Finite Central.. - Lakey Department   (Correct)
0.0:   Automatic Mapping of Task and Data Parallel Programs for.. - Jaspal Subhlok (1993)   (Correct)

Related documents from co-citation:   More   All
13:   PVM: A framework for parallel distributed computing - Sunderam - 1990
12:   The Performance of Consistent Checkpointing - Elnozahy, Johnson et al. - 1992
10:   MIST: PVM with Transparent Migration and Checkpointing - Casas, Clark et al. - 1995

BibTeX entry:   (Update)

Juan Leon, Allan Fisher, and Peter Steenkiste. Fail-safe PVM: A portable package for distributed programming with transparent recovery. Technical Report CMU-CS-93-124, Carnegie Mellon University, February 1993. http://citeseer.ist.psu.edu/3289.html   More

@techreport{ juan93failsafe,
	author = "Juan L{\'e}on and Allan L. Fisher and Peter Steenkiste",
	title = "{Fail-safe PVM: A Portable package for Distributed Programming with Transparent Recovery}",
	institution = "School of Computer Science, Carnegie Mellon University",
	address = "Pittsburgh, Pennsylvania/U.S.A.",
	number = "CMU-CS-93-124",
	year = "1993",
	month = feb,
	url = "citeseer.ist.psu.edu/3289.html",
	url = "\url{http://citeseer.ist.psu.edu/3289.html}" }
Citations (may not include all citations):
2732   Communicating sequential processes (context) - Hoare - 1978
917   and the ordering of events in a distributed system (context) - Lamport, clocks - 1978
587   Pvm: A framework for parallel distributed computing - Sunderam - 1990
572   Distributed snapshots: Determining global states of distribu.. (context) - Chandy, Lamport - 1985
566   Condor - a hunter of idle workstations (context) - Litzkow, Livny et al. - 1988
293   System structure for software fault tolerance (context) - Randell - 1975
185   Linda and friends (context) - Ahuja, Carriero et al. - 1986
184   Checkpointing and rollback-recovery for distributed systems (context) - Koo, Toueg - 1987
151   Network based concurrent computing on the pvm system - Geist, Sunderam - 1992
113   Midway: shared memory parallel programming with entry consis.. - Bershad, Zekauskas - 1991
101   Supporting checkpointing and process migration outside the u.. (context) - Litzkow, Solomon - 1992
86   Experience with the condor distributed batch system (context) - Litzkow, Livny - 1990
78   Graphical development tools for network-based concurrent sup.. - Beguelin, Dongarra et al. - 1991
68   A nonstop kernel (context) - Bartlett - 1981
68   ACM Transactions on Computer Systems (context) - Borg, Blau et al. - 1989
50   Parallel programming in linda (context) - Gelernter, Carriero et al. - 1985
38   A parallelizing compiler for distributed memory parallel com.. (context) - Tseng - 1990
35   The clouds distributed operating system: Functional descript.. (context) - Dasgupta, LeBlanc et al. - 1989
16   Munin: distributed shared memory using multiprotocol release.. (context) - Bennett, Carter et al. - 1991
11   The performance of consistent checkpointing - Elnohazy, Johnson et al. - 1992
2   Concurrent robust checkpointing and recovery in distributed .. (context) - Leu, Bhargava - 1988
2   No title (context) - idea - 1992



The graph only includes citing articles where the year of publication is known.


Documents on the same site (http://fermivista.math.jussieu.fr/ftp/ftp.scri.fsu.edu.html):   More
Compressible Navier-Stokes Computations On Unstructured.. - Kopriva (1998)   (Correct)
The Schrödinger functional running coupling with staggered.. - Urs M. Heller (1997)   (Correct)
Projected Dynamics for Metastable Decay in Ising Models - Kolesik, Novotny.. (1998)   (Correct)

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC