The detection of communication failures within a computer network is necessary to prevent the loss of data, to recover lost data packets, to reorganize the network configuration to maintain functionality, and to correct the failure.
The detection of network communication failures is not particularly difficult by itself from a technical point of view. However, most communication failure detection schemes introduce additional overheads of both CPU consumption and network traffic. Keeping these overheads at a reasonably low level, without sacrificing detection accuracy, could improve product acceptance and customers satisfaction. High performance in terms of low overhead is thus the first goal in the design of a network communication failure detection system.
One known method for detecting network communication failures utilizes test messages which are sent between the server and clients to test the server, clients, network connections, and server and client interfaces. An echo request message packet is generated by the server and dispatched to each client the server "pings" each client. Following receipt of this message from the server, each client is mandated to respond with an echo reply message to the server--the clients "pong" the server. The failure to receive an echo reply message from a client within a predetermined time interval indicates a possible communication failure between the client and the server.
One problem with this approach is the introduced overhead. Consider, for instance, the network shown in FIG. 1. FIG. 1 shows a portion of a computer network including a network server S.sub.1, a network bus 11 and several clients, C.sub.1 through C.sub.x, attached to the network bus. Server S.sub.1 includes at least three network interfaces identified by reference numerals en.sub.1 through en.sub.3. The network is configured such that each server interface is responsible for communication with a different group of clients. If the system includes nine clients and the communication failure detection scheme pings each of these clients every second, then eighteen additional data packets, 9 pings and 9 pongs, would be added to the network traffic every second. The processing of these eighteen data packets would consume the server's CPU time. Moreover, each client would need to process an incoming ping and an outgoing pong every second, slowing down the client's applications. Note that all the additional CPU consumption and network traffic would be present even when the server has a high CPU load, when network traffic is heavy, and when no communication failures are present.