This invention relates generally to inter-computer communications, and more particularly to providing a method, system and computer program product for diagnosing communications between computer systems.
Diagnosing errors in inter-computer communication links is often difficult because the site where an error is detected is often not where the error is actually occurring. For example, one side of a communication link may send a message to another and expect a reply, but never receive one. The error is likely to be on the other side, but the sending side detects the error.
The situation is often complicated by several factors. One such factor is the difficulty in correlating log information on both sides of the communication link because separate logs are based on time clocks that are not synchronized. Another factor is that the different sides can be physically distant from one another, from across a room from one another to kilometers apart. In the latter case in particular, the time delay in communication substantially increases the difficulty of correlating error indications on the two sides of the communication link. A root issue, however, is that error detection is on one side of a communication link while the error cause is on the other.
In addition, cases are regularly encountered which are not adequately anticipated and covered by the responses and diagnostics preprogrammed or built into the communications system. Exemplary situations that typically cause problems include a silent (e.g., unlogged) drop of a message by the recipient node for reasons unknown, a message received with header information that is incorrect or unanticipated, and a message exchange sequence that exhibits unanticipated delays or hangs. Often such under-anticipated events will elicit standard responses, such as logging of status or error information, but, in practice, this has turned out to not always be enough.
While many existing systems do not address this problem at all, some attempt to solve this problem by deliberately injecting an error into the communications link. This forces the receiving node to produce an error log entry, giving some information about what is going on at the receiving node. However, this has two disadvantages. First, doing this brings down the communication link; this will at best cause unnecessary recovery actions, and at worst, particularly if the link has no redundant alternative, can cause the system to go down. Second, the type of information gathered on the receiving node will be appropriate to the error that was induced, but may not be appropriate for the kind of error originally detected.
It would be desirable to be able to improve the diagnosis of communication errors between nodes.