Many present day computer systems require high availability service for particular applications. For example, applications in which such high availability service is typically required are hospital systems, telecommunications systems, certain computing applications, certain control applications, and so forth. Further, in such applications where high availability service is of importance, redundant elements are quite often utilized to provide backup capabilities in the event that one element of the system becomes non-operational. For example, it is common to utilize redundant arrangements of central processing units (CPUs) and, in addition, most such redundant arrangements of CPUs utilize a peer-to-peer relationship, i.e., they share common peripherals such as, for example, disk storage and communication bus structures.
One such redundancy arrangement which is well known in the art involves the use of a standby processor which is operational but which is not used to provide processing capabilities until a primary or active processor, i.e., currently operating processor, becomes non-operational. In a system which utilizes such a redundancy arrangement, when the primary or active processor fails, the system switches control to the standby processor and the system continues to operate. The faulty processor is then serviced, either by restarting the faulty processor to correct an error caused by a transient fault or by replacing the faulty processor to correct a permanent fault. More specifically, in operation, such a redundancy arrangement requires the system to utilize a backup processor if the primary or active processor experiences a loss of processing capability for some predetermined reason. Thus, the switch in control to the backup processor occurs after the system has detected a loss of processing capability, i.e., has detected a problem. Further, to meet predetermined system requirements, the redundancy arrangement must switch from the primary or active processor to the backup processor with minimal, if any, loss of service. A typical such redundant arrangement utilizes "on-line" or "warm" duplexed processors. In such an arrangement, data bases associated with the backup processor are constantly updated to ensure readiness for immediate operation whenever a switchover occurs.
Several fault control schemes presently exist in the art. For example, U.S. Pat. No. 4,371,754 discloses a hierarchical fault recovery system for a telecommunication switching apparatus. The disclosed fault recovery system takes progressively more pervasive steps in an effort to rectify a problem. Included in such steps are: (a) rewriting active memory units from standby memory units; (b) switching between active and standby memory units; and (c) switching CPUs. Additionally, the disclosed fault recovery system will, if required, reload all of part of the source program from disk.
Further, U.S. Pat. No. 4,635,258 discloses another example, of a fault detection apparatus. The disclosed apparatus contains circuitry which, after detecting potential faults, causes a system to reset itself. Additional circuitry limits the number of resets which are permitted to occur with a predetermined time interval.
In addition to the above-mentioned systems disclosing fault isolation and handling, there is the senario of simultaneous duplex CPU failure and, more specifically, the difficulty of dealing with such an occurrence. In particular, if both CPUs in a redundant computer system become corrupted simultaneously and irreconcilably, certain such computer systems will begin convulsive switchovers, an effect which referred to in the art as "ping-pong."
However, there is little protection for the failure situation where an endless loop within code causes endless switchovers from one processor to the other. This endless loop can be caused by corrupted code, or it may be the result of a defect originally in the code which is only apparent when a certain function is performed.
As a result of the above, there is a need for a method and apparatus for preventing endless switchover attempts, i.e., "ping-pong" between redundant processors. In particular, there is a need for such a method and apparatus for use within a redundant telephony switching apparatus wherein "ping-pong" may be caused by corrupted or defective code: (a) which may have resulted from a defect which is produced during program execution; (b) which may be an inherent, original program defect; or (c) which may be corrupted during feature invocation and, as a result, will fail in a code path which is totally disjoint from that of system initialization.