Abstract
In the world of distributed processing it is desirable that the failure of one or more processes does not affect all the other processes. For long-running applications, it is necessary to provide fault-tolerance, to complete the application in a reasonable amount of time, in spite of failures. We would like to be able to deal with failures with minimum loss. This is achieved by rollback recovery which in turn is made possible by checkpointing. The purpose of this thesis is to come up with an efficient algorithm for taking checkpoints with lower overheads. A new checkpointing scheme, called Consistent Logical Checkpointing, is imple mented and compared to Chandy and Lamport Algorithm. Consistent Logical Checkpointing increases efficiency by staggering the checkpoints so that the contention for writing to the stable storage is minimized. The first phase of the algorithm consists of individual physical checkpoints by all processes. Message logging is done in the second phase to achieve a consistent recovery line. The reason why we obtain a consistent global state by logging messages is that we assume the processes to be deterministic. A deterministic process is one whose state at any point in time depends only upon its initial state and the messages received by it. The major result of this thesis is that Consistent Logical Checkpointing is shown to give better checkpoint overheads as compared to Chandy and Lamport algorithm for the different types of applications presented. A few variations to improve the scheme are discussed. And adaptive algorithm to choose from the different variation, using heuristics, is presented. This algorithm depends upon various factors such as the checkpoint size, the number of messages to be logged, etc., since they formulate the heuristics for the application.
Kaul, Surbhi (1995). Evaluation of Consistent Logical Checkpointing. Master's thesis, Texas A&M University. Available electronically from
https : / /hdl .handle .net /1969 .1 /ETD -TAMU -1995 -THESIS -K38.