1. Field of the Invention
This invention pertains generally to communications systems. In particular, this invention discloses a method and apparatus for detecting a failure condition between two communications nodes with improved reliability (fewer false failures) than previous detection methods.
2. The Prior Art
Packetized methods of communicating between sites or nodes on a network are well known. A relatively recent application of packetized communications is Voice over Internet Protocol (VoIP), where voice communications, which are typically transmitted using multiplexed analog based technologies, are instead transmitted in a packetized fashion. Used in this context, the packets are commonly referred to as datagrams.
When using VoIP, each communication site or node on the VoIP network sends datagrams to other communication sites or nodes. When data integrity is important, packets can be sent utilizing reliable protocols, that is, protocols that will use error control, acknowledgements, and other techniques that increase the reliable transmission of each datagram.
However, because of the real-time limitations of delivering a realistic voice message (when compared to other applications such as text transmittal) there is often not enough time to detect failures in the transmission path—missing packets, a downed intermediate node creating added transmission time for a set of datagrams, etc. Normally the damaged or missing datagrams are rejected or ignored by the receiving node, as there is no time to send a request for a retransmit and to wait for a response from the source node.
This leads to a situation where the two end nodes (or communication sites) do not detect a break in the transmission until a significant amount of time has passed, or don't detect the communications break at all, and the link is lost.
There have been recent attempts to correct this situation, with the most apropos solution found in U.S. Pat. No. 6,134,221 issued Oct. 17, 2000, by Stewart et al. Stewart reveals a method where two end nodes, who are communicating using datagrams, each have two counters and two thresholds. There is one threshold for each counter. The two counters at each node consist of one “messages sent” counter and one “messages received” counter. Basically, one counter is used to determine if the local node has sent too many messages without getting a response (conclusion—communication path is down). To handle the case where a communication may be one-sided for specific intervals during the life of the overall communication, that is, where one would expect to send a large number of datagrams without receiving any, the other counter is used to allow the receiving node to know when to send a periodic equivalent of an “I'm alive and receiving” message to the sender. The actual datagram sent for this case is typically the null datagram. For further details, see U.S. Pat. No. 6,134,221.
The method disclosed in Stewart et al. has some serious shortcomings when put to actual use. A primary failure is its inability to deal with the “bursty” nature of the transmission media. That is, if either of the two end nodes, or an intermediate routing node, is temporarily subjected to a high peak workload such that the process handling the transmission in questions is swapped out, it may appear to the node where the process is still active that the transmission path has failed. In addition, even if the swapped process becomes active in time to send before the receiving process decides it has lost a connection, the newly active process will tend to send a large burst of traffic (a relatively large number of datagrams) all at once. This causes the sent counter to increment faster than a datagram can be received from the target node, which will be misread by the sender as a false failure of the receiving node. That happens because the expected “I'm alive” datagram coming from the target node cannot be received by the time the sending node has incremented its sent counter past the number when an “I'm alive” datagram would ordinarily have been expected due to the quick burst of datagrams sent.
Thus, a need exists for a more reliable method and apparatus for evaluating a communication link between nodes using packets, datagrams, or any packetized protocol.