The increased dependence on data processing systems for business, social, economic and political purposes has forced a demand for more reliable data processing systems. There are two basic approaches to the design of reliable systems: fault avoidance and fault tolerance. Fault avoidance is achieved by utilizing highly reliable components, thereby reducing the possibility of a failure. However, no matter how reliable the components, eventually a component failure will occur which causes a system failure. Moreover, the cost of such fault avoidance systems increases exponentially with the level of component reliability, making this approach an uneconomic choice for many applications.
A reliable system can also be designed to be fault tolerant, i.e., able to operate normally even in the presence of faults. Assuming the inevitability of failures, sufficient redundancy is built into the system to provide the extra resources needed to keep the system operating after a failure. Many strategies are known for utilizing these redundant resources in order to improve reliability. Failed components can be replaced. Even in the absence of redundant components, fault recovery actions such as retries, attempts to correct the failure, reconfiguration and rollback can be attempted. All of these fault recovery strategies, however, depend upon the ability to accurately detect faults or failures in the system.
The problem of detecting failures in a data processing system is greatly magnified when it is desired to detect software failures as well as hardware failures. While monitoring hardware for signs of failure is a well-known technique, there is no comparable prior art for monitoring software components for failures. There is therefore a great need for a reliable and dynamic strategy for detecting failures in a plurality of software processes, whether executing on the same processor or executing on multiple processors in a distributed processing system.