Transparent process rollback recovery: some new techniques and a portable implementation
Date
1995
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Texas A&M University
Abstract
Processes in a distributed system can be made transparently recoverable through the use of process checkpointing, which periodically introduces a relatively large but temporary overhead, or message logging, which introduces a smaller overhead for every message sent, and requires process execution to be deterministic. Research prototypes have shown promising performance results [1, 21, but an implementation has not been readily available. To address that deficiency, this thesis describes the design, implementation, and failure-free performance of a new transparent recovery system for standard Unix workstations, which provides a basis for future experimental work. The system incorporates both coordinated checkpointing and family-based message logging with the logging site technique for reducing message logging overhead when some processes share a common memory address space (recently suggested independently by Vaidya [3) and Alvisi and Marzullo [4]). Performance measurements show the overhead of the consistent checkpointing and family-based message logging implementations to be reasonably small for a representative distributed application. This thesis also presents a new approach for efficient output commit and recovery when some processes are intermittently nondeterrninistic. The approach requires the application program to explicitly mark the beginning and end of each period of nondeterminism. Message logging is used during deterministic execution, but is disabled and replaced with optimistic checkpointing during periods of nondeterminism. The commit algorithm avoids communication with deterministic processes by using causal dependency information that makes a distinction between nondeterministic and deterministic state intervals. Finally, the concept of reactive replication for message logging is introduced. During failure free operation, reactive replication uses a low-overhead protocol that can tolerate only a single simultaneous failure, such as family-based message logging. If a failure occurs, the logged data necessary to recover the failed process is immediately copied to stable storage or to the volatile storage of another processor; a second failure can be tolerated only after the completion of that copy operation. This technique assumes that such a copy operation is faster than the recovery protocol.
Description
Due to the character of the original source materials and the nature of batch digitization, quality control issues may be present in this document. Please report any quality issues you encounter to [email protected], referencing the URI of the item.
Includes bibliographical references.
Issued also on microfiche from Lange Micrographics.
Includes bibliographical references.
Issued also on microfiche from Lange Micrographics.
Keywords
computer science., Major computer science.