Generally, it is difficult to determine the health and status of software processes in distributed software system due to the complex inter-relationships and distributed nature of the software environment. Typically, fault detection mechanisms in this environment are either very fine-grained for specific errors (e.g., bus errors) or very course-grained for general errors (e.g., protocol timeouts). There are many other fault conditions (e.g., system hangs, priority inversion, scheduler thrashing, and over-burdened queue depths) which are also detrimental to proper system functionality but which are difficult to detect and isolate in distributed software systems.
Typically, a redundant software system is employed to increase the overall availability of the system. When a software fault is detected in one system the redundant system takes control of the system operations. Generally, three redundancy models are used that vary in cost and complexity. A first model, depicted in FIG. 1, includes four processor domains 100a, 100b, 100c, 100d (referred to generally as 100), grouped into two pairs. Each pair communicates with its own non-redundant comparator function 110a, 110b (referred to generally as 110) that checks the output from each pair separately in a synchronous fashion. Each processor element in each of the processor domains 100 of the pair should generate the same result (the same software is operating with the same data). When a comparator function 110 determines mismatch in any result, the other pair of processor domains 100 take over. If the comparator function 110 fails, the other pair of processor domains 100 takes over. Thus both the active processor domains 100 and comparator function 110 are protected from single points of failures.
With reference to FIG. 2, a second redundancy model includes three processor domains 200a, 200b and 200c (referred to generally as 200). The model runs as a single lockstep entity (i.e., each processor domain runs the same code and receives the same data). A comparator function 210 compares the output of all three processor domains 200. If the results of one of the processor domains disagrees with the results of the other two processor domains, that processor domain 200 is declared faulty and it is taken out of service. If the comparator 210 fails then one processor domain is taken out of service, but the other two processor domains remain in service.
The third typical redundancy model includes two processors domains, one active and one stand-by. The processor domains may be running in lockstep or the stand-by processor domain could constantly be updated by state messages from the active processor domain. There is no comparator function because there is no way to determine which processor domain is functioning correctly. Thus, failure is “self-determined” within a processor domain by running a low-level “heartbeat” function or relying on system traps (e.g., bus error timeouts). This model is generally less expensive than the other redundancy models mentioned above. However, the ability to isolate faults is reduced because of the lack of hardware comparator redundancy.
What is needed is a redundancy scheme capable of providing high availability with an increased sensitivity to process faults within a processor domain.