This invention relates to a method and apparatus for time-based event correlation using logical event triggers for fault management of distributed processing elements. While the invention is particularly directed to the art of telecommunications, and will be thus described with specific reference thereto, it will be appreciated that the invention may have usefulness in other fields and applications.
By way of background, a major contribution to unplanned downtime in the field of telecommunications is lack of fault coverage. The ability to isolate and recover faults is a customer need and a major differentiator in the market. While standards address interfaces, they do not address implementation. As MSC-based and ISP-based networks evolve, elimination of unplanned downtime will be required. Integration of third party hardware and software will increase to drive costs down. The need to perform event driven fault management between such commercial elements in a running system is essential to meet the unplanned downtime needs of the end user.
Previously, platforms used single fault events to alarm a fault and control the recovery on a processing element. This was first applied at the chassis system level or in a host processor in the chassis where faults can be received, typically via a heartbeat mechanism or over a bus on the backplane. This approach relies on a single input event to determine a fault. The event itself can be part of a fault. Prior art modified this by having a single central function collect, count and threshold events to perform recovery. These approaches do not use an event correlation window (time-based window) nor do they allow for parallel time-based event correlation functions to determine the appropriate fault isolation and recovery of components in the system. This is due to the prior art not separating fault detection time needed to trigger application recovery from the time needed for fault isolation, alarming and self-healing (auto repair) operations in the same system.
What is needed, therefore, are event correlation functions that utilize an event correlation window to collect and analyze a larger set of input events over a time period (from multiple sources) to perform fault isolation and self-healing (auto repair) in the system to maximize system availability.