In a communications network, there is a need for providing a high level of service availability for data traffic travelling on a datapath in the network. If there is a problem with a network element, such as a node or a link, the data traffic is re-routed onto an alternate datapath. At the network element level, as the service availability of each node and link may affect the overall service availability of the network, it is necessary to monitor each node and link for faults in order to maintain a high level of service availability for those nodes and links.
For example, a node comprising a routing switch may be monitored for faults so that its service availability can be maintained at a high level. While providing redundant datapaths within the routing switch partially addresses the issue of maintaining high service availability, it is also desirable to be able to isolate a fault, and to repair or replace any faulty components within the routing switch, so that the redundancy built into the routing switch continues to be fully functional. In the event of faults occurring in both redundant datapaths, the requirement for isolating and replacing a faulty component becomes more urgent.
The type of fault occurring within a device, such as a routing switch, may not be severe enough to cause the routing switch, or an adjacent link, to fail completely. Rather, the fault may be of such a severity that performance of the node is noticeably or significantly degraded. In such a situation, it is desirable to isolate, repair or replace any failing component or components so that performance of the device is fully restored, and so that more severe faults can be preemptively corrected and avoided.
In the prior art, various solutions have been proposed for isolating a datapath fault. One such solution involves a loop-back test in which a test signal is used to test whether a “looped-back” datapath provided within the routing switch is able to successfully complete a transmission of the test signal. A successful test suggests that the datapath is functioning normally. A failed test indicates that the datapath has a fault. However, depending on the configuration of the datapath, it is often not clear which component in the datapath is failing. It may then be necessary to proceed by trial and error, replacing a component and retesting the datapath to see if the fault has been corrected by the replaced component. While the source of the fault may be eventually identified through this trial and error method, it can be tedious and time consuming, potentially resulting in poor service availability. Furthermore, if the fault is intermittent, a trial and error method in replacing each component in turn may not be successful in identifying a faulty component the first time. Thus, the trial and error process may need to be repeated.
In another aspect, in devices having redundant datapaths, upon occurrence of a fault in an active datapath, prior art solutions generally do not provide the capability to test the inactive datapath for faults using a loop-back test. Thus, if a datapath switchover is being contemplated due to faults occurring in the active datapath, it may not be possible to determine whether the switchover to the inactive datapath may be desirable, in case the inactive datapath is worse off.
Thus, there is a need for an improved system and method of isolating a fault within a device, such as a routing switch, so that the fault can be corrected quickly and service availability of the device can be improved.