(Enter summary)
Abstract: The problem of rollback-recovery in message-passing systems has undergone extensive
study. In this survey, we review rollback-recovery techniques that do not require special
language constructs, and classify them into two primary categories. Checkpoint-based
rollback-recovery relies solely on checkpointed states for system state restoration.
Depending on when checkpoints are taken, existing approaches can be divided into
uncoordinated checkpointing, coordinated checkpointing and... (Update)
Cited by: More
Unknown - Apport De Recherche
(Correct)
Can we contain Internet worms? - Manuel Costa Jon
(Correct)
Enhancing Software Reliability With Speculative Threads - And The Committee
(Correct)
Similar documents (at the sentence level):
29.3%: A Survey of Rollback-Recovery Protocols in.. - Elnozahy, Alvisi.. (1996)
(Correct)
Active bibliography (related documents): More All
1.4: Support for Software Interrupts in Log-Based Rollback-Recovery - Slye, Elnozahy (1997)
(Correct)
1.2: On the Use and Implementation of Message Logging - Elnozahy (1994)
(Correct)
1.1: Semantics of Recovery Lines for Backward Recovery in.. - Brzezinski, Helary.. (1995)
(Correct)
Similar documents based on text: More All
0.6: Minimizing Timestamp Size for Completely Asynchronous.. - Smith, Johnson (1996)
(Correct)
0.5: Guaranteed Deadlock Recovery: Deadlock Resolution with.. - Wang, Merritt.. (1995)
(Correct)
0.4: Fault-Tolerance Using Cache-Coherent Distributed Shared.. - Hecht Kavi Gaede
(Correct)
Related documents from co-citation: More All
43: Distributed snapshots: Determining global states of distributed systems (context) - Chandy, Lamport - 1985
27: System structure for software fault tolerance (context) - Randell - 1975
25: The Performance of Consistent Checkpointing
- Elnozahy, Johnson et al. - 1992
BibTeX entry: (Update)
E. N. Elnozahy, D. B. Johnson, and Y. M. Wang. A survey of rollback-recovery protocols in message-passing systems. Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996. http://citeseer.ist.psu.edu/article/elnozahy96survey.html More
@misc{ elnozahy96survey,
author = "E. Elnozahy and D. Johnson and Y. Wang",
title = "A survey of rollback-recovery protocols in message-passing systems",
text = "E. N. Elnozahy, D. B. Johnson, and Y. M. Wang. A survey of rollback-recovery
protocols in message-passing systems. Technical Report CMU-CS-96-181, Carnegie
Mellon University, October 1996.",
year = "1996",
url = "citeseer.ist.psu.edu/article/elnozahy96survey.html" }
Citations (may not include all citations):
2732
Communicating sequential processes (context) - Hoare - 1978
901
Transaction Processing: Concepts and Techniques (context) - Gray, Reuter - 1993
572
Distributed snapshots: Determining global states of distribu.. (context) - Chandy, Lamport - 1985
476
Implementing remote procedure calls
- Birrell, Nelson - 1984
434
Parallel discrete event simulation (context) - Fujimoto - 1990
373
clocks and the ordering of events in a distributed system (context) - Lamport - 1978
352
Virtual time and global states of distributed systems
- Mattern - 1988
293
System structure for software fault tolerance (context) - Randell - 1975
217
Optimistic recovery in distributed systems (context) - Strom, Yemini - 1985
214
Preserving and using context information in interprocess com..
- Peterson, Buchholz et al. - 1989
193
Distributed discrete-event simulation (context) - Misra - 1986
184
Checkpointing and rollback-recovery for distributed systems (context) - Koo, Toueg - 1987
177
Fail-stop processors: An approach to designing fault-toleran..
- Schlichting, Schneider - 1983
163
Debugging parallel programs with Instant Replay (context) - LeBlanc, Mellor-Crummey - 1987
156
Recovery in distributed systems using optimistic message log..
- Johnson, Zwaenepoel - 1990
133
Manetho: Transparent rollback-recovery with low overhead
- Elnozahy, Zwaenepoel - 1992
132
Timestamps in message-passingsystems that preserve the parti.. (context) - Fidge - 1988
123
Preemptable remote execution facilities in the V-system (context) - Theimer, Lantz et al. - 1985
120
The performance of consistent checkpointing
- Elnozahy, Johnson et al. - 1992
117
Libckpt: Transparent checkpointing under Unix
- Plank, Beck et al. - 1995
110
Detecting causal relationships in distributed computations: ..
- Schwarz, Mattern - 1994
109
Sender-based message logging
- Johnson, Zwaenepoel - 1987
101
Supporting checkpointing and process migration outside the u.. (context) - Litzkow, Solomon - 1992
98
Fault Tolerance Principles and Practice (context) - Lee, Anderson - 1990
98
A message system supporting fault-tolerance (context) - Borg, Baumbach et al. - 1983
83
Efficient distributed recovery using message logging (context) - Sistla, Welch - 1989
83
Fault Tolerance in Distributed Systems (context) - Jalote - 1994
74
Publishing: A reliable broadcast communication mechanism (context) - Powell, Presotto - 1983
69
Fault tolerance in concurrent object-oriented software throu..
- Xu, Randell et al. - 1995
68
A NonStop Kernel (context) - Bartlett - 1981
67
Consistent global states of distributed systems: Fundamental..
- Babaoglu, Marzullo - 1993
60
Independent checkpointing and concurrent rollback for recove.. (context) - Bhargava, Lian - 1988
58
Nonblocking and orphan-free message logging protocols
- Alvisi, Hoppe et al. - 1993
58
Rollback mechanisms for optimistic distributed simulation sy.. (context) - Gafni - 1988
58
Crash recovery in a distributed data storage system
- Lampson, Sturgis - 1979
57
Distributed reset
- Arora, Gouda - 1994
57
Software implemented fault tolerance: Technologies and exper.. (context) - Huang, Kintala - 1993
56
Checkpointing and its applications
- Wang, Huang et al. - 1995
54
Checkpointing distributed applications on mobile computers (context) - Acharya, Badrinath - 1994
53
Necessary and sufficient conditions for consistent global sn.. (context) - Netzer, Xu - 1995
51
Optimal tracing and replay for debuggingmessage-passing para..
- andB, Miller - 1992
48
Recoverable distributed shared virtual memory (context) - Wu, Fuchs - 1990
46
A distributed domino-effect free recovery algorithm (context) - Briatico, Ciuffoletti et al. - 1984
46
Causal distributed breakpoints
- Fowler, Zwaenepoel - 1990
46
fault-tolerant distributed systems (context) - Strom, Bacon et al. - 1988
45
the use and implementation of message logging
- Elnozahy, Zwaenepoel - 1994
45
State restoration in systems of communicating processes (context) - Russell - 1980
44
Error recovery in asynchronous systems (context) - Campbell, Randell - 1986
44
A timestamp-based checkpointing protocol for long-lived dist.. (context) - Cristian, Jahanian - 1991
43
Using logging and asynchronous checkpointing to implement re..
- Singhal - 1993
40
Distributed system fault tolerance using message logging and..
- Johnson - 1989
39
Fail-safe PVM: A portable package for distributed programmin..
- Leon, Fisher et al. - 1993
39
Information Processing Letters (context) - Lai, Yang et al. - 1987
38
Efficient transparent optimistic rollback recovery for distr..
- Johnson - 1993
38
Crash recovery with little overhead (context) - T-Y, Venkatesan - 1991
37
Lazy checkpoint coordination for bounding rollback propagati..
- Wang, Fuchs - 1993
36
Checkpointing multicomputer applications (context) - Li, Naughton et al. - 1991
34
Global checkpointing for distributed programs (context) - Silva, Silva - 1992
34
Efficient distributed snapshots (context) - Spezialetti, Kearns - 1986
34
Consistent global checkpoints that contain a given set of lo.. (context) - Wang
33
Reduced overhead logging for rollback recovery in distribute..
- Suri, Janssens et al. - 1995
32
Error recovery in multicomputers using global checkpoints (context) - Tamir, Sequin - 1984
32
Transparent fault-tolerance in parallel Orca programs
- Kaashoek, Michiels et al. - 1991
32
Optimistic message logging for independent checkpointing in ..
- Wang, Fuchs - 1992
32
An efficient implementation of vector clocks (context) - Singhal, Kshemkalyani - 1992
32
Coordinated checkpointing-rollback error recovery for distri..
- Janakiraman, Tamir - 1994
31
Optimal checkpointing and local recording for domino-free ro.. (context) - Venkatesh, Radhakrishnan et al. - 1987
30
Relaxing consistency in recoverable distributed shared memor..
- Janssens, Fuchs - 1993
29
AdvancedConcepts in Operating Systems (context) - Singhal, Shivaratri - 1994
28
Using message semantics to reduce rollback in optimistic mes..
- Leong, Agrawal - 1994
28
A low-overhead recovery technique using quasisynchronous che..
- Manivannan, Singhal - 1996
28
Message logging: Pessimistic (context) - Alvisi, Marzullo - 1995
27
The evolution of the recovery block concept (context) - Randell, Xu - 1995
27
Internet Request For Comments RFC (context) - Postel - 1981
27
How to recover efficiently and asynchronously when optimism ..
- Damani, Garg - 1996
27
Efficient Checkpointing on MIMD Architectures (context) - Plank - 1993
26
Fault tolerance under UNIX (context) - Borg, Blau et al. - 1989
26
Approaches to mechanization of the conversation scheme based.. (context) - Kim - 1982
25
on Programming Languages and Systems (context) - Jefferson, Trans - 1985
23
A scheme for coordinated execution of independently designed.. (context) - Kim, You et al. - 1986
23
Fast recovery in distributed shared virtual memory systems (context) - Tam, Hsu - 1990
22
A software instruction counter (context) - Mellor-Crummey, LeBlanc - 1989
22
Adaptive independent checkpointing for reducing rollback pro.. (context) - Xu, Netzer - 1993
22
Transparent recovery of Mach applications
- Goldberg, Gopal et al. - 1990
22
Scheduling message processing for reducing rollback propagat.. (context) - Wang, Fuchs - 1992
20
Application-transparent setting of recovery points (context) - Barigazzi, Strigini - 1983
20
Consistent checkpoints of PVM applications
- Stellner - 1994
20
CATCH: Compiler-assisted techniques for checkpointing (context) - Li, Fuchs - 1990
20
Manetho: Fault Tolerance in Distributed Systems Using Rollba.. (context) - Elnozahy - 1993
20
Progressive retry for software error recovery in distributed.. (context) - Wang, Huang et al. - 1993
19
Filter: An algorithm for reducing cascaded rollbacks in opti..
- Prakash, Subramanian - 1991
18
Cheap hardware support for software debugging and profiling (context) - Cargill, Locanthi - 1987
18
An efficient checkpointing method for multicomputers with wo..
- Li, Naughton et al. - 1992
18
Converting a swap-based system to do paging in an architectu.. (context) - Babaoglu, Joy - 1981
18
Replay for concurrent nondeterministic shared memory applica.. (context) - Russinovich, Cogswell et al.
18
An efficient protocol for checkpointing recovery in distribu.. (context) - Kim, Park - 1993
18
Error recovery in shared memory multiprocessors using privat.. (context) - Wu, Fuchs et al. - 1990
18
WOLF: A rollback algorithm for optimistic distributed simula.. (context) - Madisetti, Walrand et al. - 1988
17
Supporting nondeterministic execution in fault-tolerant syst.. (context) - Slye, Elnozahy - 1996
17
Some optimal algorithms for decomposed partially ordered set..
- Garg - 1992
16
Experimental evaluation of concurrent checkpointing and roll.. (context) - Bhargava, Lian et al. - 1990
16
Compiler-assisted checkpointing
- Beck, Plank et al. - 1994
15
Compiler-assisted static checkpoint insertion (context) - Long, Fuchs et al. - 1992
15
Restoring consistent global states of distributed computatio.. (context) - Goldberg, Gopal et al. - 1991
14
Faster checkpointing with n+1 parity (context) - Plank, Li - 1994
13
Replicated distributed processes in manetho
- Elnozahy, Zwaenepoel - 1992
13
Processor- and memory-based checkpoint and rollback recovery (context) - Bowen, Pradhan - 1993
13
Checkpointing and rollback-recovery in distributed object ba.. (context) - Lin, Ahamad - 1990
13
Network multicomputing using recoverable distributed shared ..
- Carter, Cox et al. - 1993
12
the provision of backward error recovery in production progr.. (context) - Gregory, Knight - 1989
12
Fault tolerant processes (context) - Jalote - 1989
12
Why optimistic message logging has not been used in telecomm..
- Huang, Wang - 1995
12
Cache-aided rollback error recovery (context) - Ahmed, Frazier et al. - 1990
12
Orphan detection (context) - Liskov, Scheifler et al. - 1987
12
Optimistic failure recovery for very large networks (context) - Lowry, Russell et al. - 1991
12
Global states of a distributed system (context) - Fischer, Griffeth et al. - 1982
12
Integrating coherency and recovery in distributed systems (context) - Feeley, Chase et al. - 1994
12
Consistent global checkpoints based on direct dependency tra.. (context) - Wang, Lowry et al. - 1994
12
Checkpointing and rollback recovery in a distributed system .. (context) - Ramanathan, Shin - 1988
11
Quasi-synchronous checkpointing: Models (context) - Manivannan, Singhal - 1996
11
A new linguistic approach to backward error recovery (context) - Gregory, Knight - 1985
11
Using checkpoints to localize the effects of faults in distr.. (context) - Ahamad, Lin - 1989
11
An architecture for tolerating processor failures in shared-.. (context) - Banatre, Gefflaut et al. - 1993
10
Application transparent fault management in fault-tolerant m.. (context) - Russinovich, Segall et al. - 1993
10
rollback in a distributed system using coarsegrained dataflo.. (context) - Cummings, Alkalaj - 1994
10
A decentralized recovery control protocol (context) - Wood - 1981
10
Compiler-assisted memory exclusion for fast checkpointing (context) - Plank, Beck et al. - 1995
10
Adaptive message logging for incremental program replay (context) - Netzer, Xu - 1993
10
The maximum and minimum consistent global checkpoints and th..
- Wang - 1995
10
Fault-tolerant computing based on Mach (context) - Babaoglu - 1990
10
Experimental evaluation of multiprocessor cache-based error .. (context) - Janssens, Fuchs - 1991
10
Message-optimal incremental snapshots (context) - Venkatesan - 1989
10
Reducing interprocessor dependence in recoverable distribute..
- Janssens, Fuchs - 1994
10
Rollback recovery in distributed systems using loosely synch.. (context) - Tong, Kain et al. - 1992
9
Recording distributed snapshots based on causal order of mes..
- Acharya, Badrinath - 1992
9
Ensuring data security and integrity with a fast stable stor.. (context) - Banatre, Banatre et al. - 1988
9
Katholieke Universiteit Leuven (context) - Deconinck, Vounckx et al. - 1993
9
Kitlog: A generic logging service
- Ruffin - 1992
8
State restoration in distributed systems
- Merlin, Randell - 1978
8
Consistent checkpointing in message passing distributed syst..
- Baldoni, Helary et al. - 1995
8
Survey of backward error recovery techniques for multicomput..
- Deconinck, Vounckx et al. - 1993
8
A non-intrusive checkpointing protocol (context) - Israel, Morris - 1989
8
Atomic actions for fault-tolerance using CSP (context) - Jalote, Campbell - 1986
8
Use of common time base for checkpointing and rollback recov.. (context) - Ramanathan, Shin - 1993
8
Trade-offs in implementing causal message logging protocols
- Alvisi, Marzullo - 1996
8
A recoverable object store
- Strom, Yemini et al. - 1988
7
File system measurements and their application to the design..
- Bacon - 1991
7
Programmer-transparent coordination of recovering concurrent.. (context) - Kim - 1988
7
Job and process recovery in a UNIX-based operating system (context) - Kingsbury, Kline - 1989
7
Checkpoint space reclamation for uncoordinated checkpointing.. (context) - Wang, Chung et al. - 1995
7
Dynamic recovery schemes for distributed processes (context) - Tsuruoka, Kaneko et al. - 1981
7
Application-transparent error-recovery techniques for multic.. (context) - Frazier, Tamir - 1989
6
Active replication in Delta (context) - Chereque, Powell et al. - 1992
6
When piecewise determinism is almost true (context) - Cohen, Wang et al. - 1995
6
Virtual checkpoints: Architecture and performance (context) - Bowen, Pradhan - 1992
5
An implementation and performance measurement of the progres.. (context) - Suri, Huang et al. - 1995
5
Transparent recovery in distributed systems
- Bacon - 1991
5
Implementing a general error recovery mechanism in a distrib.. (context) - Nett, Kroger et al. - 1986
5
On modeling consistent checkpoints and the domino effect in ..
- Baldoni, Helary et al. - 1995
5
Application-transparent process-level error recovery for mul..
- Tamir, Frazier - 1989
4
An algorithm for minimizing roll back cost (context) - Hadzilacos - 1982
4
A software fault tolerance platform (context) - Huang, Kintala - 1995
4
Consistent logical checkpointing
- Vaidya - 1994
4
Space Reclamation for Uncoordinated Checkpointing in Message.. (context) - Wang - 1993
4
The distributed recovery block scheme (context) - Kim - 1995
4
Sheaved memory: Architectural support for state saving and r.. (context) - Staknis - 1989
4
concurrent checkpointing for parallel programs (context) - Li, Naughton et al. - 1990
4
The inhibition spectrum and the achievement of causal consis.. (context) - Critchlow, Taylor - 1990
4
Error-recovery in multicomputers using asynchronous coordina.. (context) - Tamir, Frazier - 1991
4
Recovery control of communicating processes in a distributed.. (context) - Wood - 1985
3
IEEE Parallel and Distributed Technology (context) - Groselj, minimum - 1993
3
Tight upper bound on useful distributed system checkpoints
- Wang, Chung et al. - 1995
3
A highly decentralized implementation model for the Programm.. (context) - Kim, You - 1990
3
Backward error recovery in a UNIX environment (context) - Taylor, Wright - 1986
3
Timestamp-based orphan elimination (context) - Herlihy, McKendry - 1989
3
A discussion of checkpoint restart (context) - Jasper - 1969
3
Repeated global snapshots in asynchronous distributed system.. (context) - Ahuja - 1989
3
Nested dynamic actions - How to solve the fault containment ..
- Nett, Weiler - 1994
3
Efficient rollback-recovery technique in distributed computi.. (context) - Chiu, Young - 1996
3
Guaranteed deadlock recovery: Deadlock resolution with rollb..
- Wang, Merritt et al. - 1995
3
The Delta-4 extra performance architecture XPA (context) - Barrett, Hilborne et al. - 1990
3
Transparent optimistic rollback recovery
- Johnson, Zwaenepoel - 1991
2
Architecture of fault-tolerant multiprocessor workstations (context) - Banatre, Banatre et al. - 1989
2
An efficient coordinated checkpointing scheme for multicompu.. (context) - Sharma, Pradhan - 1994
2
Fault Tolerance for Clusters of Workstations (context) - Elnozahy - 1994
2
IEEE Technical Committee on Operating Systems Newsletter (context) - Smith, Ioannidis et al. - 1989
2
Rollback basedon vector time (context) - Peterson, Kearns - 1993
2
An abstract model of rollback recovery control in distribute.. (context) - Cao, Wang - 1992
2
Completely asynchronousoptimistic recovery with minimal roll.. (context) - Smith, Johnson et al. - 1995
1
Minimizing timestamp size for completely asynchronous optimi..
- Smith, Johnson - 1996
1
Department of Computer Science (context) - Appel, system et al. - 1989
1
Some problems with optimistic recovery and their solutions (context) - Lowry, Strom - 1992
1
Fault tolerant distributed computing using atomicsend receiv.. (context) - andB, ojcik - 1990
1
High-level fault tolerance in distributed programs (context) - SeligmanandA - 1994
1
CoCheck: Checkpointingand processmigration for MPI (context) - Stellner - 1996
1
Survey of checkpoint and rollbak recovery techniques (context) - Bowen, Pradhan - 1991
1
Performance of consistent checkpointingin a modular operatin.. (context) - Muller, Hue et al. - 1994
1
A checkpointing protocol for an entry consistent shared memo.. (context) - Neves, Castro et al. - 1994
1
the correctnessof orphanmanagement algorithms (context) - Herlihy, Lynch et al. - 1992
The graph only includes citing articles where the year of publication is known.
Documents on the same site (http://www.cs.cmu.edu/~dbj/ft.html): More
Output-Driven Distributed Optimistic Message Logging and.. - Johnson, Zwaenepoel (1990)
(Correct)
Distributed System Fault Tolerance Using Message Logging and.. - Johnson (1989)
(Correct)
Network Multicomputing Using Recoverable Distributed .. - Carter, Cox.. (1993)
(Correct)
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC