Detecting Failures in an Asynchronous System That Never Stops Changing
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This thesis presents an algorithm for detecting failures in dynamic asynchronous distributed systems or environments in which new participants may continually join the system and old participants may continually leave the system (a phenomenon called churn), and active participants may fail.
Such behavior is exhibited by many dynamic modern networks, for example, peer-to-peer networks. Devices are continually joining and leaving, and peers often remain in the network only long enough to retrieve the data they require. Another example would be mobile networks. Devices are constantly on the move, resulting in a continual change in participants. In these types of networks, the set of participants is rarely stable for very long and is dynamically changing.
Many problems are not solvable if the fraction of participants that are crashed is too large. Yet the participants will continue to leave the system or crash. To avoid crossing the threshold where too many participants are crashed, it is of paramount importance to detect crashed participants and remove them from the system.
The problem of detecting failures has been solved in static and synchronous distributed systems. However, since processes in an asynchronous dynamic distributed systems possess no global clock or synchronized logical clocks or timing information, detecting failures is a hard problem to solve in such systems.
We propose a failure detector for an asynchronous system with churn by exploiting the dynamic nature of the system to estimate elapsed time. We design an algorithm to detect failed processes in such an asynchronous system and prove that if a process is identified as crashed by our failure detector, it has indeed failed. Additionally, we also prove that if the churn continues forever, then under certain circumstances every failed process is eventually identified as crashed.
Description
Keywords
failure detection, distributed systems, asynchronous systems, churn,