Data transmission is always subject to error or failures due to the transmission medium errors, signal integrity problems and/or failure of the equipment along the transmission path. Since the ability to reliably transmit data is of utmost importance the transmission systems are equipped with fault managers, which detect, locate and correct the faults so that the service disruptions are minimized.
Current fault managers generate a fault report whenever a fault is detected anywhere in the data transmission system, identifying the location, type and sometimes providing other information with respect to the nature of the fault. These reports are called fault logs. The fault manager then processes the logs; the basic processing includes sorting, storage, retrieval and other processing functions necessary to analyze the logs for isolating the cause of the fault.
As the data transmission networks increase in size and capacity, the rate at which these observable events occur also increases, making fault management more complex. As an example, a fault in the network may cause many active calls to clear for the same reason, generating a log for each affected call. Also, if a call attempt fails, new failure logs due to the same problem will be generated each time the originator re-attempts to set-up the call. In some cases, almost thousands of identical failure logs, related to the same fault could be generated. Still further, as the fault rate increases with the number of nodes, the traffic generated by fault also increases. In turn, fault propagation may further generate additional events.
To deal with the increase in the number and complexity of failure logs resulting from failed calls, modern management systems enable automatic collection and reporting of failures, thereby reducing the load on human operators or programs. However, current methods of storing failure logs require large storage space and involve huge failure log queues. Large queues are undesirable because they tend to consume large amounts of memory. Also, the queue can overflow when large numbers of failure logs are being created at a high rate, resulting in lost failure information. Unless the failure logs are correlated to the event that produced them, a single problem in a single subsystem could result in multiple, uncoordinated corrective actions. This can lead to wasteful resources spent on duplicate efforts and inconsistent corrective actions, which result in an escalation of problems.
There is a need to provide a method for reducing the number of records that pertain to the same failed connections (calls), while maintaining the integrity of the fault information. By reducing the number of failure records that need to be analyzed, accurate determination of the number and identity of discrete problems that need to be analyzed and rectified becomes easier.
To avoid corrupting the integrity of the fault information, the information in the failure logs needs to be processed sequentially. Therefore, both the managed system and the management system send and respectively process the failure logs in time sequence, and the failure records must provide this timing information.
There is need to maintain the timing information associated with the failure logs resulting from the failed calls, for enabling accurate processing and investigation of the failure records.