It is known that in most processing systems, such as communications systems and computing systems, next generation products typically tend to be more complex as compared with products developed from predecessor technologies. Furthermore, in any such system, there are typically a number of dissimilar processes running on a multitude of dissimilar platforms. When the system works, the results can be impressive. However, when there are capacity issues or failures that can not be attributed to a particular processing device in the system, it is increasingly difficult to quickly determine when a portion of the system starts to fail to correctly handle traffic, especially when traffic is being distributed across a number of dissimilar processing devices.
As an example, assume a processing system composed of collections of different types of processing devices. Further, assume that the system has 20 processing devices of type A performing a critical function for the completion of traffic. Further assume one of the processing devices is continuing to send a heartbeat signal to a monitor (e.g., system administrator console or central monitoring device) of the system, but the processing device is not correctly handling all of the traffic that the processing device is expected to handle. Still further, assume that the other 19 processing devices are performing correctly.
As is known, sending of a heartbeat signal indicates to the receiver that the sender is still “alive” (operating in at least some capacity—e.g., able to handle at least some traffic), as opposed to the failure of the sender to send a heartbeat signal which indicates to the expectant receiver that the sender is no longer “alive” (failing to operate in at least some capacity—e.g., not able to handle any traffic).
In the above case, a system administrator observing the system would see a system handling at least 95% of the offered traffic and may not easily detect either the loss of the 5% or that this loss can be attributed to one processing device that is not performing well. Even a short period of time in this condition can adversely affect the ability of the processing system to be considered a highly reliable system. Reliability of a processing system is one of the critical criteria that service providers agree to with customers in service level agreements.
The schemes used by existing processing systems measure and report on traffic periodically, for example, on 15 minute periods. By examining the records from a few periods, manually or by automatic means, a system administrator is often able to detect a problem, but only after the passage of a significant amount of time and only if the system administrator is looking carefully at the data. While heartbeat signals tell the system administrator when a processing device ceases to function, what is missing is the ability to, in near real time, detect a processing device that is still alive but is not correctly handling traffic.