In high-availability routing switch platforms, a protection system must be able to detect switching errors and perform a switchover to a redundant component, a redundant switch fabric for instance. One widely used specification requires that a switchover occur within 60 ms, including 10 ms for detection and 50 ms to accomplish the actual switchover. In order to meet these requirements, the detection and switching functions are often hardware-driven for certain types of faults and when both redundant components are “healthy”.
The health of switch fabrics in communication equipment may be assessed by a software-driven demerit system, such as described in U.S. patent application Ser. No. 09/963,520, published on Jun. 19, 2003 as Publication No. 2003/0112746, entitled “SYSTEM AND METHOD FOR PROVIDING DETECTION OF FAULTS AND SWITCHING OF FABRICS IN A REDUNDANT-ARCHITECTURE COMMUNICATION SYSTEM”, and incorporated in its entirety herein by reference. This earlier application describes an error analysis and correlation (EAC) system which manages the demerit system. For fault and fabric health conditions not covered by hardware-driven functions, a fabric redundancy system (FRED) initiates fabric switchovers based on input from the error detection sub-systems such as FabMon and EAC.
Since fabric switchovers, and generally other protection switching operations in communication equipment, can be initiated by user-driven, hardware-driven and software-driven functions, it can be difficult for an operator to determine the cause of a switchover when multiple events that could have caused the switchover occur closely in time. For example, because a hardware-driven switchover occurs so quickly, events which are not reported until after the switchover has occurred must be considered in determining the cause. Also, because of the many factors that affect the health of a switch fabric or other components in a protection system, a seemingly harmless operator action could affect the health of an active fabric or components and result in a software-driven switchover.
Typically, alarms are raised by a protection system when a protection switching operation has occurred. These alarms alert an operator to the protection switching operation and provide an indication of the effect of the protection switching operation (i.e., that a particular switch fabric is now active), a time at which the protection switching operation occurred, and sometimes a general cause of the protection switching operation (i.e., whether the protection switching operation was user- , hardware- or software-driven).
Alarms generally do not provide sufficient information to allow an operator to accurately determine the reason for or the conditions which led to a protection switching operation. Although further information might be available through a maintenance interface to the protection system or communication equipment in which the protection system is implemented, this further information is normally updated, and thus lost, as the state of redundant components in the protection system changes. These difficulties in determining the cause of protection switchovers often necessitates the involvement of design personnel for product support.
Simply adding more alarms to detail protection switchovers would not necessarily overcome the above challenges. Due to the real time nature of alarm reporting and switchovers, for example, all of the information about a switchover might not be available at the time of the switchover.