Abstract
Large multicomputer systems are inherently unreliable because of their enormous complexity. This has a direct impact on distributed computations performed on these systems. As the size and execution time of these distributed computations grows, so does the probability that a hardware failure will cause the computations to fail. This thesis presents a novel architecture for a software-implemented fault-tolerance layer, designed for the purpose of enhancing the reliability of distributed computations performed on large multicomputer systems, such as massively parallel computers and distributed computing systems. The objective of this research is to develop the conceptual framework for a purely software-based, user-level solution for fault detection, reconfiguration, and recovery in a parallel environment. The symmetrically distributed, multi-tiered layer envelopes user applications, enabling it to perform fault-tolerance related actions apart from, and transparent to the application. Its modular design enables dynamic run-time selection of the most appropriate fault-tolerant algorithm, and is, therefore, not restricted to one particular fault-tolerant method. Performance and coverage measurements of a minimal implementation of the proposed layer are presented, and indicate that user-level software-implemented fault-tolerance can be reasonably efficient and effective.
Holland, Gavin D (1995). Software implemented fault-tolernace on distributed-memory MIMD architectures. Master's thesis, Texas A&M University. Available electronically from
https : / /hdl .handle .net /1969 .1 /ETD -TAMU -1995 -THESIS -H643.