The present invention relates in general to the operation of computerized data communication networks, and more particularly, to the recovery of communication network operations after a failure of one of the network components.
Computer data communication networks are used to transmit information between geographically dispersed computers and between user devices such as computer terminals or workstations and host computer applications. A variety of communication architectures exist. Two such data communication architectures are the IBM System Network Architecture (SNA) and the International Standards Organization""s (ISO) Open System Interconnection (OSI) architecture. One embodiment of IBM""s System Network Architecture is described in a co-pending, commonly assigned U.S. patent application, Ser. No. 08/245,053, entitled xe2x80x9cVirtual Route Resynchronizationxe2x80x9d, the entirety of which is hereby incorporated herein by reference.
High Performance Routing (HPR) is a recent enhancement to the IBM Systems Network Architecture. HPR uses rapid transport protocol (RTP), and the logical connection between two HPR-capable nodes is called an RTP connection. The ends of the connection are referred to as the RTP endpoints, while any intermediate nodes along the RTP connection route are called the automatic network routing (ANR) nodes. Error recovery on an RTP connection is done end-to-end rather than node-to-node, meaning that only the RTP endpoints are involved.
Many end-user sessions can flow on a given RTP connection. Also, data messages sent on an RTP connection can get lost in the network or might arrive out of order at the destination RTP endpoint. Each message that flows on an RTP connection is assigned a byte sequence number (BSN) which enables the destination node to determine when data is lost or arrives out of order. It is critical that the origin RTP endpoint fill in the correct BSN when sending out a message, otherwise the RTP connection will fail causing all the end-user sessions to also fail.
Because of the need to maintain the sequence of messages between the data host and other components, communications with a failing unit can only be restarted if the sequence number information is known or if the entire communications network is reinitialized. Reinitialization of a large network is highly undesirable because of the considerable time required. This lost time can be costly to a business that is dependent upon transaction processing for its operations. Various schemes have been proposed for retaining sequence information so that the network can be restarted without reinitialization. However, data host failure may occur unpredictably and may not afford an opportunity to save the necessary sequencing information. In these situations, a network reinitialization is required. There is therefore a need to have a system or method for resynchronizing data communications without reinitializing the network.
The present invention addresses the technical problems of recovering synchronization information lost during a network component failure. It is also directed to the problem of resynchronizing message traffic between adjacent communication components following a component failure.
Briefly summarized, this invention comprises in one aspect a system for resynchronizing message traffic between a first data processing system and a second data processing system connected by a data communications network. The message traffic travels over a logical connection linking the first and second data processing systems, and each message in the message traffic includes a SYNC number and a byte sequence number. A recipient of each message tests to determine whether the message has a next expected byte sequence number and discards any byte sequence number older than the next expected byte sequence number. The system includes means for retrieving, after the failure of the first data processing system, a stored SYNC number and byte sequence number (BSN) from external memory, as well as means for incrementing the SYNC number by a predetermined amount to obtain a new SYNC number, the predetermined amount being sufficient to ensure that the new SYNC number comprises a current SYNC number. Means for sending a status request message from the first data processing system to the second data processing system are also provided wherein the status request includes the new SYNC number, and the BSN read from the external memory. The first data processing system includes means for waiting for receipt of a response message to the status request message, wherein the response message will contain a BSN of a next piece of data that the second data processing system is expecting. The system also includes means for updating logical connection control information at the first data processing system with the BSN value for the next piece of data expected by the second data processing system upon receipt of the response message.
To restate, provided herein is a technique for rapidly resynchronizing and recovering virtual network routes without reinitializing the communications network upon startup from a component failure. Further, the process described herein achieves resynchronization of message traffic quickly with low system processing overhead. The solution is described herein with reference to IBM""s Transation Processing Facility (TPF) operating system; however, is applicable to various systems as will be understood by those in the data communications art.