The present invention relates generally to the field of fault management, and, more particularly, to a system and method for filtering redundant alarm messages or alarm state transitions that do not convey useful or necessary fault information.
Communication networks being comprised of a complex combination of electronic hardware systems and software programs can be vulnerable to faults in equipment and transport media. A fault can generally be defined as a persistent condition in a component (e.g., hardware and/or software) that prevents the component from performing its function. These faults include hardware malfunctions as well as program and data errors. To cope with these faults or failure events when they occur, communication networks include a fault management subsystem that is responsible for the preservation and restoration of service in the presence of faults.
One aspect of fault management is known as alarm surveillance. An alarm is an adverse event that signifies a detected failure or fault in some aspect of the communication network. The alarm may be brought to the attention of a person responsible for taking remedial action, or may trigger an automated diagnostic or maintenance capability to run a test on the faulty component or take the faulty component out of service. Alternatively, the alarm may be simply recorded for analysis at a later time.
Alarms can be used to signal a variety of types of failure events. For example, one classification of failure events is known as functional failure events. These types of faults are specific to an externally visible feature and include such examples as loss of a line signal and protocol errors between two units remote from one another. A second classification of failure events corresponds to hardware faults, which are generally detected by specific circuit checks. A third classification of failure events corresponds to software faults. Examples of software faults include detection of illegal commands, process time outs due to lack of response from another process or unit, audit errors due to database inconsistencies, and assertions resulting from defensive program checks.
Alarm messages are processed by the fault management subsystem in two ways: The first way is referred to as alarm correlation and has the goal of identifying the root cause of each fault. The second way is referred to as alarm validation and has the goal of ensuring that the alarm message truly indicates some fault in the system. At first, it would seem that alarm validation should be straightforward. That is, when an alarm message is received, the alarm is automatically correlated by running some type of diagnostic on the component that triggered the alarm. While this approach is thorough in ensuring that any fault generating component is immediately attended to, it is also highly inefficient. Components frequently incur faults of a sporadic nature that while they may be worthy of review at some point in time, they do not require immediate attention. If affirmative action is taken for every alarm message in a complex communication network, the performance of the network could be severely degraded as processor time becomes dominated with diagnostic and maintenance activity. Moreover, critical faults could be overshadowed by large numbers of redundant alarms.
As part of alarm validation, faults are typically divided into three groups according to their duration: permanent, intermittent, and transient. Permanent faults are those faults that exist in the system until some remedial action is taken. Intermittent faults are those faults that occur in a discontinuous and periodic way causing service degradation or interruption as a result. Transient faults are those faults that momentarily cause a minor degradation in service. Permanent faults typically do not generate an abundance of redundant alarm messages and are therefore relatively easy to validate. On the other hand, intermittent and transient faults can generate numerous alarm messages, many of which are redundant and should be ignored. In addition, intermittent and transient faults may generate a small number of alarm messages indicating only a minor service interruption that does not require any diagnostic or maintenance attention.
Accordingly, what is sought is an improved system and method for validating intermittent and transient alarms that filters out redundant alarm messages or alarm state transitions that do not convey useful or necessary fault information to thereby improve overall system performance.