Abstract
Recently, a cost-effective and feasible approach to supercomputing by connecting a large number of processors has been offered by hypercube multicomputers. However, a major drawback of such systems is that a single processor failure may destroy the whole network. As the probability of any one or more processors failing in such a complex system is great, building some fault-tolerance feature into them becomes extremely important. In this dissertation, a new effective approach is proposed for achieving fault tolerance in hypercubes. This approach will enable the hypercube to continue working in case of failure in a short time. In particular, three algorithms are presented for reconfiguring faulty hypercubes using spanning trees to achieve communication between the unfaulty processors. The first algorithm uses the completely unbalanced spanning tree (CUST), the second one uses the balanced spanning tree (BST), and the third algorithm is based on using general spanning trees (GST) for reconfiguring the hypercube to avoid faulty nodes. The parallel algorithm developed for reconfiguring faulty hypercubes using GST can be generalized to work for reconfiguring multicomputer networks in general in the presence of faults. These algorithms use, at most, one used link and one unused link for every reconstructed path in the reconfigured hypercube. These reconstruction algorithms can reconfigure around any single fault in 0(l) reconfiguration time with no extra-dilation and will increase the congestion of a link by, at most, one. The algorithms are optimal, in terms of the reconfiguration time, and can be used for reconfiguration of spanning trees, which have been used extensively for communications and broadcasting, in a hypercube in the presence of node failures. Simulation results for the algorithms under multiple faults also are presented. Results show that the algorithms provide close to 100% fault coverage of double and triple faults for hypercubes having a dimension of 10 or more. Fault coverage and congestion results for up to 60 faults having different cube sizes are presented. We also evaluate the improvement of reliability in hypercubes based on these reconfiguration algorithms. With these algorithms, the system's recovery is shown to be faster, improving hypercube reliability. A dependability evaluation study, based on a Markov chain model for a cluster of four hypercube nodes, is presented.
Al-Tawil, K. M. (1994). Reconfiguration algorithms for fault tolerance in hypercubes. Texas A&M University. Texas A&M University. Libraries. Available electronically from
https : / /hdl .handle .net /1969 .1 /200901.