Abstract
Distributed Shared Memory (DSM) is a model for interprocess communication, implemented on top of message passing systems. In this model, processes running on separate hosts can access a shared, coherent memory address space, provided by the underlying DSM system, through the normal read and write operations. Thus, by avoiding the programming complexities of message passing, it has become a convenient model to work with. It is a natural extension of parallel programming on uniprocessors to distributed environments, As the number of processors in the system and the running time of applications executing on such a system increases, the likelihood of processor failure due to machine malfunction, power failure, user error, etc., increases. The benefits given by these systems can possibly be achieved only if the whole system behaves like a failure-free system. Many algorithms that have been proposed for implementing a reliable DSM, require the processes to take checkpoints whenever there is a data transfer, thus resulting in high overhead during failure-free execution. We propose a new recoverable DSM algorithm to tolerate multiple node failures and where the checkpointing interval can be tailored to balance the cost of checkpointing versus the savings in recovery obtained by taking checkpoints often. The technique uses independent checkpointing and keeps track of the dependencies by logging writes and some additional information about the occurrence of reads. Unlike previous recovery techniques, this one reduces both the message and the logging overheads.
Kanthadai, Sundarrajan S (1996). Recoverable distributed shared memory. Master's thesis, Texas A&M University. Available electronically from
https : / /hdl .handle .net /1969 .1 /ETD -TAMU -1996 -THESIS -K35.