This invention relates generally to inter-computer communications, and more particularly to providing a method, system and computer program product for communication of offline status between computer systems.
Diagnosing errors in inter-computer communication links is often difficult because the site where an error is detected is often not where the error is actually occurring. For example, one side of a communication link may send a message to another and expect a reply, but never receive one. The error is likely to be on the other side, but the sending side detects the error.
The situation is often complicated by several factors. One such factor is the difficulty in correlating any log information on both sides of the communication link because separate logs are based on time clocks that are not synchronized. Another factor is that the different sides can be physically distant from one another, from across a room from one another to kilometers apart. In the latter case in particular, the time delay in communication substantially increases the difficulty of correlating error indications on the two sides of the communication link. A root issue, however, is that error detection is on one side of a communication link while the error cause is on the other.
In addition, cases are regularly encountered which are not adequately anticipated and covered by the responses and diagnostics preprogrammed or built into the communications system. Exemplary situations that typically cause problems include a silent (e.g., unlogged) drop of a message by the recipient node for reasons unknown, and a message exchange sequence that exhibits unanticipated delays or hangs.
It is regularly the case that communication links between active computer systems must be taken out of service at scheduled times for a variety of reasons, such as physical rerouting, replacement with new cabling, upgrading of hardware to which they are attached, firmware or software modifications, etc. Simply unplugging the link, producing a link failure that is most often implicitly detected by timeouts, will result in re-establishment of communication using different paths, (assuming the communications fabric is sufficiently redundant).
However, the scope of error recovery may have to extend beyond low-level infrastructure into the application itself, particularly because of the delays intrinsically involved in timeouts. Many applications normally lack such recovery, resulting in application failure. This is clearly undesirable. If applications fail, a customer can rightly consider a system “down” even if its communications subsystem or operating system remains operational. Such situations are even less desirable when the change, and resultant errors, were planned. For example, the change may be an upgrade or other modification of the system, planned months in advance. As is well known, planned downtime is currently a significant fraction of all system downtime.
It would be desirable to be able to improve the communication of status between computer systems.