See this document in CiteSeerX!

A Survey of Rollback-Recovery Protocols in Message-Passing Systems (1996)  (Make Corrections)  (180 citations)
E.N. Elnozahy, D.B. Johnson, Y.M. Wang



  Home/Search   Context   Related

 
View or download:
cmu.edu/~dbj/ftp/CMUCS96181.ps.gz
utah.edu/~cs606/pa...olerancesurvey.ps
arirang.snu.ac.kr/~woojeon...survey1.ps
Cached:  PS.gz  PS  PDF   Image  Update  Help

From:  cmu.edu/~dbj/ft (more)
From:  utah.edu/~cs606/
(Enter author homepages)

Rate this article: (best)
  Comment on this article  
(Enter summary)

Abstract: The problem of rollback-recovery in message-passing systems has undergone extensive study. In this survey, we review rollback-recovery techniques that do not require special language constructs, and classify them into two primary categories. Checkpoint-based rollback-recovery relies solely on checkpointed states for system state restoration. Depending on when checkpoints are taken, existing approaches can be divided into uncoordinated checkpointing, coordinated checkpointing and... (Update)

Cited by:   More
Unknown - Apport De Recherche   (Correct)
Can we contain Internet worms? - Manuel Costa Jon   (Correct)
Enhancing Software Reliability With Speculative Threads - And The Committee   (Correct)

Similar documents (at the sentence level):
29.3%:   A Survey of Rollback-Recovery Protocols in.. - Elnozahy, Alvisi.. (1996)   (Correct)

Active bibliography (related documents):   More   All
1.4:   Support for Software Interrupts in Log-Based Rollback-Recovery - Slye, Elnozahy (1997)   (Correct)
1.2:   On the Use and Implementation of Message Logging - Elnozahy (1994)   (Correct)
1.1:   Semantics of Recovery Lines for Backward Recovery in.. - Brzezinski, Helary.. (1995)   (Correct)

Similar documents based on text:   More   All
0.6:   Minimizing Timestamp Size for Completely Asynchronous.. - Smith, Johnson (1996)   (Correct)
0.5:   Guaranteed Deadlock Recovery: Deadlock Resolution with.. - Wang, Merritt.. (1995)   (Correct)
0.4:   Fault-Tolerance Using Cache-Coherent Distributed Shared.. - Hecht Kavi Gaede   (Correct)

Related documents from co-citation:   More   All
43:   Distributed snapshots: Determining global states of distributed systems (context) - Chandy, Lamport - 1985
27:   System structure for software fault tolerance (context) - Randell - 1975
25:   The Performance of Consistent Checkpointing - Elnozahy, Johnson et al. - 1992

BibTeX entry:   (Update)

E. N. Elnozahy, D. B. Johnson, and Y. M. Wang. A survey of rollback-recovery protocols in message-passing systems. Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996. http://citeseer.ist.psu.edu/article/elnozahy96survey.html   More

@misc{ elnozahy96survey,
  author = "E. Elnozahy and D. Johnson and Y. Wang",
  title = "A survey of rollback-recovery protocols in message-passing systems",
  text = "E. N. Elnozahy, D. B. Johnson, and Y. M. Wang. A survey of rollback-recovery
    protocols in message-passing systems. Technical Report CMU-CS-96-181, Carnegie
    Mellon University, October 1996.",
  year = "1996",
  url = "citeseer.ist.psu.edu/article/elnozahy96survey.html" }
Citations (may not include all citations):
2732   Communicating sequential processes (context) - Hoare - 1978
901   Transaction Processing: Concepts and Techniques (context) - Gray, Reuter - 1993
572   Distributed snapshots: Determining global states of distribu.. (context) - Chandy, Lamport - 1985
476   Implementing remote procedure calls - Birrell, Nelson - 1984
434   Parallel discrete event simulation (context) - Fujimoto - 1990
373   clocks and the ordering of events in a distributed system (context) - Lamport - 1978
352   Virtual time and global states of distributed systems - Mattern - 1988
293   System structure for software fault tolerance (context) - Randell - 1975
217   Optimistic recovery in distributed systems (context) - Strom, Yemini - 1985
214   Preserving and using context information in interprocess com.. - Peterson, Buchholz et al. - 1989
193   Distributed discrete-event simulation (context) - Misra - 1986
184   Checkpointing and rollback-recovery for distributed systems (context) - Koo, Toueg - 1987
177   Fail-stop processors: An approach to designing fault-toleran.. - Schlichting, Schneider - 1983
163   Debugging parallel programs with Instant Replay (context) - LeBlanc, Mellor-Crummey - 1987
156   Recovery in distributed systems using optimistic message log.. - Johnson, Zwaenepoel - 1990
133   Manetho: Transparent rollback-recovery with low overhead - Elnozahy, Zwaenepoel - 1992
132   Timestamps in message-passingsystems that preserve the parti.. (context) - Fidge - 1988
123   Preemptable remote execution facilities in the V-system (context) - Theimer, Lantz et al. - 1985
120   The performance of consistent checkpointing - Elnozahy, Johnson et al. - 1992
117   Libckpt: Transparent checkpointing under Unix - Plank, Beck et al. - 1995
110   Detecting causal relationships in distributed computations: .. - Schwarz, Mattern - 1994
109   Sender-based message logging - Johnson, Zwaenepoel - 1987
101   Supporting checkpointing and process migration outside the u.. (context) - Litzkow, Solomon - 1992
98   Fault Tolerance Principles and Practice (context) - Lee, Anderson - 1990
98   A message system supporting fault-tolerance (context) - Borg, Baumbach et al. - 1983
83   Efficient distributed recovery using message logging (context) - Sistla, Welch - 1989
83   Fault Tolerance in Distributed Systems (context) - Jalote - 1994
74   Publishing: A reliable broadcast communication mechanism (context) - Powell, Presotto - 1983
69   Fault tolerance in concurrent object-oriented software throu.. - Xu, Randell et al. - 1995
68   A NonStop Kernel (context) - Bartlett - 1981
67   Consistent global states of distributed systems: Fundamental.. - Babaoglu, Marzullo - 1993
60   Independent checkpointing and concurrent rollback for recove.. (context) - Bhargava, Lian - 1988
58   Nonblocking and orphan-free message logging protocols - Alvisi, Hoppe et al. - 1993
58   Rollback mechanisms for optimistic distributed simulation sy.. (context) - Gafni - 1988
58   Crash recovery in a distributed data storage system - Lampson, Sturgis - 1979
57   Distributed reset - Arora, Gouda - 1994
57   Software implemented fault tolerance: Technologies and exper.. (context) - Huang, Kintala - 1993
56   Checkpointing and its applications - Wang, Huang et al. - 1995
54   Checkpointing distributed applications on mobile computers (context) - Acharya, Badrinath - 1994
53   Necessary and sufficient conditions for consistent global sn.. (context) - Netzer, Xu - 1995
51   Optimal tracing and replay for debuggingmessage-passing para.. - andB, Miller - 1992
48   Recoverable distributed shared virtual memory (context) - Wu, Fuchs - 1990
46   A distributed domino-effect free recovery algorithm (context) - Briatico, Ciuffoletti et al. - 1984
46   Causal distributed breakpoints - Fowler, Zwaenepoel - 1990
46   fault-tolerant distributed systems (context) - Strom, Bacon et al. - 1988
45   the use and implementation of message logging - Elnozahy, Zwaenepoel - 1994
45   State restoration in systems of communicating processes (context) - Russell - 1980
44   Error recovery in asynchronous systems (context) - Campbell, Randell - 1986
44   A timestamp-based checkpointing protocol for long-lived dist.. (context) - Cristian, Jahanian - 1991
43   Using logging and asynchronous checkpointing to implement re.. - Singhal - 1993
40   Distributed system fault tolerance using message logging and.. - Johnson - 1989
39   Fail-safe PVM: A portable package for distributed programmin.. - Leon, Fisher et al. - 1993
39   Information Processing Letters (context) - Lai, Yang et al. - 1987
38   Efficient transparent optimistic rollback recovery for distr.. - Johnson - 1993
38   Crash recovery with little overhead (context) - T-Y, Venkatesan - 1991
37   Lazy checkpoint coordination for bounding rollback propagati.. - Wang, Fuchs - 1993
36   Checkpointing multicomputer applications (context) - Li, Naughton et al. - 1991
34   Global checkpointing for distributed programs (context) - Silva, Silva - 1992
34   Efficient distributed snapshots (context) - Spezialetti, Kearns - 1986
34   Consistent global checkpoints that contain a given set of lo.. (context) - Wang
33   Reduced overhead logging for rollback recovery in distribute.. - Suri, Janssens et al. - 1995
32   Error recovery in multicomputers using global checkpoints (context) - Tamir, Sequin - 1984
32   Transparent fault-tolerance in parallel Orca programs - Kaashoek, Michiels et al. - 1991
32   Optimistic message logging for independent checkpointing in .. - Wang, Fuchs - 1992
32   An efficient implementation of vector clocks (context) - Singhal, Kshemkalyani - 1992
32   Coordinated checkpointing-rollback error recovery for distri.. - Janakiraman, Tamir - 1994
31   Optimal checkpointing and local recording for domino-free ro.. (context) - Venkatesh, Radhakrishnan et al. - 1987
30   Relaxing consistency in recoverable distributed shared memor.. - Janssens, Fuchs - 1993
29   AdvancedConcepts in Operating Systems (context) - Singhal, Shivaratri - 1994
28   Using message semantics to reduce rollback in optimistic mes.. - Leong, Agrawal - 1994
28   A low-overhead recovery technique using quasisynchronous che.. - Manivannan, Singhal - 1996
28   Message logging: Pessimistic (context) - Alvisi, Marzullo - 1995
27   The evolution of the recovery block concept (context) - Randell, Xu - 1995
27   Internet Request For Comments RFC (context) - Postel - 1981
27   How to recover efficiently and asynchronously when optimism .. - Damani, Garg - 1996
27   Efficient Checkpointing on MIMD Architectures (context) - Plank - 1993
26   Fault tolerance under UNIX (context) - Borg, Blau et al. - 1989
26   Approaches to mechanization of the conversation scheme based.. (context) - Kim - 1982
25   on Programming Languages and Systems (context) - Jefferson, Trans - 1985
23   A scheme for coordinated execution of independently designed.. (context) - Kim, You et al. - 1986
23   Fast recovery in distributed shared virtual memory systems (context) - Tam, Hsu - 1990
22   A software instruction counter (context) - Mellor-Crummey, LeBlanc - 1989
22   Adaptive independent checkpointing for reducing rollback pro.. (context) - Xu, Netzer - 1993
22   Transparent recovery of Mach applications - Goldberg, Gopal et al. - 1990
22   Scheduling message processing for reducing rollback propagat.. (context) - Wang, Fuchs - 1992
20   Application-transparent setting of recovery points (context) - Barigazzi, Strigini - 1983
20   Consistent checkpoints of PVM applications - Stellner - 1994
20   CATCH: Compiler-assisted techniques for checkpointing (context) - Li, Fuchs - 1990
20   Manetho: Fault Tolerance in Distributed Systems Using Rollba.. (context) - Elnozahy - 1993
20   Progressive retry for software error recovery in distributed.. (context) - Wang, Huang et al. - 1993
19   Filter: An algorithm for reducing cascaded rollbacks in opti.. - Prakash, Subramanian - 1991
18   Cheap hardware support for software debugging and profiling (context) - Cargill, Locanthi - 1987
18   An efficient checkpointing method for multicomputers with wo.. - Li, Naughton et al. - 1992
18   Converting a swap-based system to do paging in an architectu.. (context) - Babaoglu, Joy - 1981
18   Replay for concurrent nondeterministic shared memory applica.. (context) - Russinovich, Cogswell et al.
18   An efficient protocol for checkpointing recovery in distribu.. (context) - Kim, Park - 1993
18   Error recovery in shared memory multiprocessors using privat.. (context) - Wu, Fuchs et al. - 1990
18   WOLF: A rollback algorithm for optimistic distributed simula.. (context) - Madisetti, Walrand et al. - 1988
17   Supporting nondeterministic execution in fault-tolerant syst.. (context) - Slye, Elnozahy - 1996
17   Some optimal algorithms for decomposed partially ordered set.. - Garg - 1992
16   Experimental evaluation of concurrent checkpointing and roll.. (context) - Bhargava, Lian et al. - 1990
16   Compiler-assisted checkpointing - Beck, Plank et al. - 1994
15   Compiler-assisted static checkpoint insertion (context) - Long, Fuchs et al. - 1992
15   Restoring consistent global states of distributed computatio.. (context) - Goldberg, Gopal et al. - 1991
14   Faster checkpointing with n+1 parity (context) - Plank, Li - 1994
13   Replicated distributed processes in manetho - Elnozahy, Zwaenepoel - 1992
13   Processor- and memory-based checkpoint and rollback recovery (context) - Bowen, Pradhan - 1993
13   Checkpointing and rollback-recovery in distributed object ba.. (context) - Lin, Ahamad - 1990
13   Network multicomputing using recoverable distributed shared .. - Carter, Cox et al. - 1993
12   the provision of backward error recovery in production progr.. (context) - Gregory, Knight - 1989
12   Fault tolerant processes (context) - Jalote - 1989
12   Why optimistic message logging has not been used in telecomm.. - Huang, Wang - 1995
12   Cache-aided rollback error recovery (context) - Ahmed, Frazier et al. - 1990
12   Orphan detection (context) - Liskov, Scheifler et al. - 1987
12   Optimistic failure recovery for very large networks (context) - Lowry, Russell et al. - 1991
12   Global states of a distributed system (context) - Fischer, Griffeth et al. - 1982
12   Integrating coherency and recovery in distributed systems (context) - Feeley, Chase et al. - 1994
12   Consistent global checkpoints based on direct dependency tra.. (context) - Wang, Lowry et al. - 1994
12   Checkpointing and rollback recovery in a distributed system .. (context) - Ramanathan, Shin - 1988
11   Quasi-synchronous checkpointing: Models (context) - Manivannan, Singhal - 1996
11   A new linguistic approach to backward error recovery (context) - Gregory, Knight - 1985
11   Using checkpoints to localize the effects of faults in distr.. (context) - Ahamad, Lin - 1989
11   An architecture for tolerating processor failures in shared-.. (context) - Banatre, Gefflaut et al. - 1993
10   Application transparent fault management in fault-tolerant m.. (context) - Russinovich, Segall et al. - 1993
10   rollback in a distributed system using coarsegrained dataflo.. (context) - Cummings, Alkalaj - 1994
10   A decentralized recovery control protocol (context) - Wood - 1981
10   Compiler-assisted memory exclusion for fast checkpointing (context) - Plank, Beck et al. - 1995
10   Adaptive message logging for incremental program replay (context) - Netzer, Xu - 1993
10   The maximum and minimum consistent global checkpoints and th.. - Wang - 1995
10   Fault-tolerant computing based on Mach (context) - Babaoglu - 1990
10   Experimental evaluation of multiprocessor cache-based error .. (context) - Janssens, Fuchs - 1991
10   Message-optimal incremental snapshots (context) - Venkatesan - 1989
10   Reducing interprocessor dependence in recoverable distribute.. - Janssens, Fuchs - 1994
10   Rollback recovery in distributed systems using loosely synch.. (context) - Tong, Kain et al. - 1992
9   Recording distributed snapshots based on causal order of mes.. - Acharya, Badrinath - 1992
9   Ensuring data security and integrity with a fast stable stor.. (context) - Banatre, Banatre et al. - 1988
9   Katholieke Universiteit Leuven (context) - Deconinck, Vounckx et al. - 1993
9   Kitlog: A generic logging service - Ruffin - 1992
8   State restoration in distributed systems - Merlin, Randell - 1978
8   Consistent checkpointing in message passing distributed syst.. - Baldoni, Helary et al. - 1995
8   Survey of backward error recovery techniques for multicomput.. - Deconinck, Vounckx et al. - 1993
8   A non-intrusive checkpointing protocol (context) - Israel, Morris - 1989
8   Atomic actions for fault-tolerance using CSP (context) - Jalote, Campbell - 1986
8   Use of common time base for checkpointing and rollback recov.. (context) - Ramanathan, Shin - 1993
8   Trade-offs in implementing causal message logging protocols - Alvisi, Marzullo - 1996
8   A recoverable object store - Strom, Yemini et al. - 1988
7   File system measurements and their application to the design.. - Bacon - 1991
7   Programmer-transparent coordination of recovering concurrent.. (context) - Kim - 1988
7   Job and process recovery in a UNIX-based operating system (context) - Kingsbury, Kline - 1989
7   Checkpoint space reclamation for uncoordinated checkpointing.. (context) - Wang, Chung et al. - 1995
7   Dynamic recovery schemes for distributed processes (context) - Tsuruoka, Kaneko et al. - 1981
7   Application-transparent error-recovery techniques for multic.. (context) - Frazier, Tamir - 1989
6   Active replication in Delta (context) - Chereque, Powell et al. - 1992
6   When piecewise determinism is almost true (context) - Cohen, Wang et al. - 1995
6   Virtual checkpoints: Architecture and performance (context) - Bowen, Pradhan - 1992
5   An implementation and performance measurement of the progres.. (context) - Suri, Huang et al. - 1995
5   Transparent recovery in distributed systems - Bacon - 1991
5   Implementing a general error recovery mechanism in a distrib.. (context) - Nett, Kroger et al. - 1986
5   On modeling consistent checkpoints and the domino effect in .. - Baldoni, Helary et al. - 1995
5   Application-transparent process-level error recovery for mul.. - Tamir, Frazier - 1989
4   An algorithm for minimizing roll back cost (context) - Hadzilacos - 1982
4   A software fault tolerance platform (context) - Huang, Kintala - 1995
4   Consistent logical checkpointing - Vaidya - 1994
4   Space Reclamation for Uncoordinated Checkpointing in Message.. (context) - Wang - 1993
4   The distributed recovery block scheme (context) - Kim - 1995
4   Sheaved memory: Architectural support for state saving and r.. (context) - Staknis - 1989
4   concurrent checkpointing for parallel programs (context) - Li, Naughton et al. - 1990
4   The inhibition spectrum and the achievement of causal consis.. (context) - Critchlow, Taylor - 1990
4   Error-recovery in multicomputers using asynchronous coordina.. (context) - Tamir, Frazier - 1991
4   Recovery control of communicating processes in a distributed.. (context) - Wood - 1985
3   IEEE Parallel and Distributed Technology (context) - Groselj, minimum - 1993
3   Tight upper bound on useful distributed system checkpoints - Wang, Chung et al. - 1995
3   A highly decentralized implementation model for the Programm.. (context) - Kim, You - 1990
3   Backward error recovery in a UNIX environment (context) - Taylor, Wright - 1986
3   Timestamp-based orphan elimination (context) - Herlihy, McKendry - 1989
3   A discussion of checkpoint restart (context) - Jasper - 1969
3   Repeated global snapshots in asynchronous distributed system.. (context) - Ahuja - 1989
3   Nested dynamic actions - How to solve the fault containment .. - Nett, Weiler - 1994
3   Efficient rollback-recovery technique in distributed computi.. (context) - Chiu, Young - 1996
3   Guaranteed deadlock recovery: Deadlock resolution with rollb.. - Wang, Merritt et al. - 1995
3   The Delta-4 extra performance architecture XPA (context) - Barrett, Hilborne et al. - 1990
3   Transparent optimistic rollback recovery - Johnson, Zwaenepoel - 1991
2   Architecture of fault-tolerant multiprocessor workstations (context) - Banatre, Banatre et al. - 1989
2   An efficient coordinated checkpointing scheme for multicompu.. (context) - Sharma, Pradhan - 1994
2   Fault Tolerance for Clusters of Workstations (context) - Elnozahy - 1994
2   IEEE Technical Committee on Operating Systems Newsletter (context) - Smith, Ioannidis et al. - 1989
2   Rollback basedon vector time (context) - Peterson, Kearns - 1993
2   An abstract model of rollback recovery control in distribute.. (context) - Cao, Wang - 1992
2   Completely asynchronousoptimistic recovery with minimal roll.. (context) - Smith, Johnson et al. - 1995
1   Minimizing timestamp size for completely asynchronous optimi.. - Smith, Johnson - 1996
1   Department of Computer Science (context) - Appel, system et al. - 1989
1   Some problems with optimistic recovery and their solutions (context) - Lowry, Strom - 1992
1   Fault tolerant distributed computing using atomicsend receiv.. (context) - andB, ojcik - 1990
1   High-level fault tolerance in distributed programs (context) - SeligmanandA - 1994
1   CoCheck: Checkpointingand processmigration for MPI (context) - Stellner - 1996
1   Survey of checkpoint and rollbak recovery techniques (context) - Bowen, Pradhan - 1991
1   Performance of consistent checkpointingin a modular operatin.. (context) - Muller, Hue et al. - 1994
1   A checkpointing protocol for an entry consistent shared memo.. (context) - Neves, Castro et al. - 1994
1   the correctnessof orphanmanagement algorithms (context) - Herlihy, Lynch et al. - 1992



The graph only includes citing articles where the year of publication is known.


Documents on the same site (http://www.cs.cmu.edu/~dbj/ft.html):   More
Output-Driven Distributed Optimistic Message Logging and.. - Johnson, Zwaenepoel (1990)   (Correct)
Distributed System Fault Tolerance Using Message Logging and.. - Johnson (1989)   (Correct)
Network Multicomputing Using Recoverable Distributed .. - Carter, Cox.. (1993)   (Correct)

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC