The present invention relates in general to the field of telecommunications switching equipment. More particularly, the invention relates to a system and method for managing faults in a data transmission system.
In any type of data transmission system, the ability to reliably transmit data without interruption is of the utmost importance. Data transmission, however, is always subject to error or faults due to signal integrity problems and/or failure of the physical devices or elements that form the data transmission path. To address these inevitable faults, most data transmission systems will include a subsystem or process by which data and device faults are detected and corrected. Such xe2x80x9cfault managementxe2x80x9d systems are intended to locate and correct system faults in the most efficient manner so that service disruptions are minimized
Adequate fault management systems must not only be able to detect faults, but also to determine the cause of the fault in order to ensure that the same type of fault does not continue to occur, and also to ensure that it does not cause other types of faults to subsequently occur. To do this the fault must be xe2x80x9cisolatedxe2x80x9d so that the physical device or element responsible for causing the fault can be identified, and the proper steps taken to ensure that the faulty device is repaired and returned to operation. Fault isolation is often achieved by providing fault detection at various points along the data transmission path. For example, if data passes through three separate processing circuits, each of which are coupled together by separate communication links, fault detection may be provided at each of the three circuits, or may even be provided at multiple points along each of the three circuits. In this manner, when a fault is detected the system can readily determine which circuit or communication link is faulty.
In known fault management systems, each time a fault is detected anywhere in the data transmission system a fault report identifying the fault is generated and forwarded to a centralized fault management node. This central node will then attempt to isolate the fault, and perform the steps necessary to correct the problem. Thus, every fault that is detected is reported, and is individually addressed by this centralized fault management node.
Individually addressing each and every fault, however, is inefficient and has many drawbacks that adversely affect system performance. The system is inefficient because not all faults need to be reported and addressed. Often times a single initial fault will spawn many subsequent faults, but if the underlying fault is isolated and corrected, the subsequently spawned faults will correct themselves. For example, a timing fault caused by a defective timing circuit may appear as a data integrity fault at various places along the data transmission path, and be detected as such at each of these places. Thus, one timing fault leads to multiple subsequent faults. Of each of these detected faults, however, only the very first generated fault report is helpful in isolating and correcting the source of the problem. It is only the initial fault that is critical to isolate and address, and once corrected the subsequent resulting faults will be eliminated automatically. Thus, under many circumstances, subsequent fault reports are superfluous, and the processing of these superfluous reports utilizes resources of the fault management system that could be better used on addressing more urgent or more critical fault reports. Accordingly, known fault management systems unnecessarily address each and every fault, and therefore, do not provide the most effective manner by which to manage faults.
Accordingly, a need currently exists for a method for managing faults in a data transmission system that is more efficient in managing faults, and that reduces the burden on the fault management system of addressing each fault that is detected.
In accordance with the present invention, a system and method for managing faults is provided in a data transmission system having a data path for transmitting signals containing data, and a plurality of application cards along the data path for processing the signals. The method includes the steps of detecting the occurrence of a first fault of a particular type by one of the application cards, and in response to detecting this fault, generating a fault report for the purpose of identifying the cause of the fault. Next, the generation of subsequent fault reports by the application card that relate to that particular type of fault are prevented until a signal is received that indicates that fault report generation may be reenabled. Subsequent steps may include receiving this signal and reenabling fault report generation, and generating a subsequent fault report in response to detecting a subsequent fault of that particular type. Further steps may also include in response to detecting the first fault of the particular type, setting a fault status indicator associated with the application card that represents the particular fault type, and in response to receiving the signal indicating that fault report generation may be reenabled, clearing the fault status indicator.
In an alternate embodiment of the present invention, the method includes detecting the occurrence of a first fault of a particular type by one of the plurality of application cards and determining a priority level of the detected fault in response to its detection. The application card generates a fault report for the purpose of identifying the cause of the detected fault. Subsequently, the application card prevents the generation of subsequent fault reports relating to faults of the determined priority level and lower until receiving a signal indicating that fault report generation may be reenabled. Subsequent steps may include receiving this signal and reenabling fault report generation, and generating a subsequent fault report in response to detecting a subsequent fault of the determined priority level or lower. Further steps may also include in response to detecting the first fault of the particular type, setting a fault status indicator associated with the application card that represents the particular fault type, and in response to receiving the signal indicating that fault report generation may be reenabled, clearing the fault status indicator.
In another embodiment according to the present invention, a fault management system for managing faults in a data transmission system is provided, where the data transmission system includes a data transmission path for transmitting signals containing data, a plurality of application cards along the data path for processing the signals, at least one unit controller for controlling the application cards, and at least one system manager for controlling the at least one unit controller. The system includes application card software residing on the plurality of application cards, unit controller software residing on the at least one unit controller, and system manager software residing on the at least one system manager. The application card software is capable of generating a first fault report in response to detecting that a first fault of a particular type has occurred in the data transmission system, and also of suppressing the generation of subsequent fault reports relating to faults of that particular type until receiving a signal indicating that fault report generation may be reenabled. In one embodiment the fault report is sent to a fault management subroutine within the unit controller software, and the signal indicating that fault report generation may be reenabled is received from the fault manager. In an alternate embodiment the fault report is sent to a fault management subroutine within the system manager software, and the signal indicating that fault report generation is to be reenabled is received from the fault manager.