Telecommunication system providers are driven by user demands for extremely reliable systems that experience little down time and that can take automatic corrective action without the need for human intervention. The time required to take corrective action for a system fault is typically much longer when a human is involved than when a fault is determined and automatically responded to by the system itself. In fact, faults may not even be readily observable by a human operator and can range, for example, from a system component or device element that actually stops working or produces a processor interrupt to something so minor that it is hard to even ascertain that anything is wrong. Among the latter there is an even more difficult subset where the device appears to be operating, but it is not operating correctly.
For example, consider the case where a processor starts incorrectly adding at some frequency larger than zero. Now assume that a message or packet is built by the processor. The processor does this by inserting a message code and associated data into the packet by adding an offset to a logical pointer from the start of the message packet and writing the desired data there. The incorrect addition by the processor causes an unapparent and hard to detect fault. Instead of writing the message data structure by writing it at the message plus the offset for the message code, the structure goes somewhere else causing some other structure to be corrupted and the receiver to get the value that was previously in the location rather than that intended to be written. An alternative fault could write the correct address but the wrong value into the message packet. If the message that was supposed to be sent out was to do something like report a system fault, it is not the correct and intended message that is sent. Depending on where the data was sent and what was actually sent, the results can vary from simply dropping a communication session, e.g., a call, to resetting the entire system. As such, one can not depend on the processor to simply know that a fault has occurred and remedy the situation by not sending the message. Further, even if the processor is aware of the problem, the rest of the system needs to be quickly notified so that back-up hardware can be activated.
When a fault occurs, it is desirable that the fault be contained within the malfunctioning device as quickly as possible to prevent “contamination” of other devices within the system. It is further desirable that the fault be repaired without breaking this containment. For example, a telecommunication device such as a blade-based switch may experience a failure in which it continuously transmits data packets to remote devices, thereby consuming network transmission resources, i.e., link resources, and consumes processing resources at the switch at the other end of the link or at the final destination of the message. In this case, it is desirable to contain the fault to the malfunctioning node and discontinue transmission to the destination node as quickly as possible to prevent causing that destination node to fail due to overloading or receiving incorrect messages.
It is certainly desirable for a system to recover from a fault as quickly as possible in order to restore service. Toward that end, recovery can be accomplished by replacing the function of the faulted element using an operational element. Such replacement should not violate the containment of the fault else the integrity of the system is unduly put at risk. As such, the architecture of the system should provide a way to quickly determine the presence of the fault without violating the containment. For example, merely sending a message to an external device to notify the system of the fault is not appropriate because it violates the desirability for fault containment and could actually spread the failure condition from the faulty node to the rest of the system. It is therefore also desirable to be able to notify other system elements in a manner that does not adversely impact fault containment so that a back-up blade server can be activated.
In other words, demands on system operators and equipment designers, especially in the telecommunications equipment industry where compliance with the Advanced Telecommunications Computing Architecture (“ATCA”) can constrain designers means that the system has to immediately find out that an element has failed, but the element cannot transmit anything due to the risk that such transmission may take the system down. However, current ATCA devices can take three to nine seconds for a fault to be reported from the failed board to the system after the fault has been detected on the card. This delay can add an average of six seconds to the recovery per board failure. It is therefore a general desire that the system architecture provide some method to firewall the fault and to provide a notification method that does not violate the firewalls.
An example of a system designed to do this is one that implemented an interface on the communication links that was based on a protocol that used idle codes between messages and a start of message code followed by the format of the message and then returning to idle after the message is complete. The protocol was modified to include two idle codes, namely codes for a normal idle and a fault idle. The interface chip had a special input pin that was connected to the circuit board fault detection tree so that it was active when no fault was detected and inactive when there was a fault. When this input was active the idle code was the normal code, the interface chip would accept new messages but when the signal was inactive the messages in progress would halt and it would return to a fault idle code being generated. The rest of the system had detectors looking for fault idle codes and two states for each link. Each state was associated with a set of operating characteristics and they were programmed so that essentially in a faulty state everything was blocked and in a correct state everything subject to normal routing and permission states was allowed.
When a fault idle was detected on the link, the state machine changed to fault mode and the links were shut down. Only system maintenance software could change the state back to normal mode and bring the element back into service once any fault was detected. This arrangement also required extensive fault detection capability on each element and the two things together provided the detection which fed into the detection tree and triggered a signal to the rest of the system without any violation of the containment. While workable, such an arrangement is expensive and requires the use of customized hardware. With a push toward building reliable communication systems out of stock hardware, the above-mentioned solution is not desirable.
It is desirable to have a system and method that contains faults within an element in a manner that is reliable, provides quick system notification and that allows for rapid resolution of the failure through, for example, the activation of a back-up element.