1. Field of the Invention
The present invention relates to the operation of computerized data communication networks, and more particularly, to the recovery of communication network operations after a failure of one of the network components.
2. Background of the Invention
Computer data communication networks are used to transmit information between geographically dispersed computers and between user devices such as computer terminals or workstations and host computer applications. A variety of communication architectures exist. Data communication architectures such as the IBM System Network Architecture (SNA) and the International Standards Organization (ISO) Open Systems Interconnection (OSI) Architecture define an architecture in terms of functional layers. The SNA and OSI layered architecture models are shown in FIG. 1. The architecture defines functional layers with well defined interfaces between the layers. This allows interchangeability of a single layer's components without impacting the overall operation of the network.
The two lowest layers of the SNA architecture are Data Link Control (DLC) and Path Control. Path Control is in turn divided into Transmission Group Control, Explicit Route Control and Virtual Route Control. At a higher level, Data Flow Control implements session control. Session control defines the end to end communication between a particular terminal and a system application. For example, a workstation or terminal is operated by a computer user to enter and display data from a host application. A host application, such as a banking system for tracking customer accounts or an airline reservation system for tracking seat reservations, accepts requests for data and returns the desired information to the requestor. The session defines the connection between a terminal and an application and is unconcerned with the physical details of that interconnection.
An SNA network is defined in terms of Network Addressable Units (NAUs). Each NAU has a unique address and can send and receive messages. NAUs are divided into two classes: peripheral nodes and subarea nodes. A subarea node has an ability to route messages, while a peripheral node can only receive and transmit messages from and to the subarea to which it is attached. A subarea is defined as a subarea node and all attached peripheral nodes, if any.
SNA specifies three types of Network Addressable Units: System Service Control Points (SSCP); Logical Units (LU); and Physical Units (PU). The SSCP controls the physical configuration of the resources in the domain, including establishing communications paths and testing resources. Physical units are software entities for managing the physical connection of the network. Logical units perform user or application tasks.
The physical interconnection of a network is defined by the Data Link Control layer. This layer defines addressable devices and the physical paths connecting those devices. The physical definition may include particular communication control units and host processing systems that are to communicate.
Transmission Group Control defines one or more bidirectional logical connections between adjacent subarea nodes. Explicit route control defines bidirectional physical connections between end-to-end subarea nodes. Each explicit route (ER) consists of a fixed set of transmission groups, but can use at most one transmission group between two adjacent subareas. An explicit route can include one or more intermediate subarea nodes.
The virtual route layer defines an end-to-end logical connection between subareas supporting one or more sessions. Each virtual route is associated with an underlying Explicit Route. An Explicit Route can be associated with multiple Virtual Routes. Data flow over the virtual route is controlled by data pacing which defines a minimum and maximum window size and establishes a protocol for controlling the flow.
An example of a network is shown in FIG. 2. The network shown is an example of the use of the IBM Transaction Processing Facility (TPF) operating system to support transaction processing in, for example, a bank or airline application. Application processing occurs in host processor 20 frequently referred to as the "data host". The host processor can be one of a number of computer processors such as the IBM 3090 system. Communications from processor 20 are routed through a communication control unit (CCU) 22, such as the IBM 3745 or 3725. CCU 22 is connected over communication lines 24 to a compatible communications control unit 26 at the remote location. Communications control unit 26 has a plurality of communication lines 28 connected to control units 30. Each of these control units, in turn, can have a plurality of terminal devices 32 including computer terminals, workstations, or printers.
A communication network configuration is typically controlled by a communications management configurator (CMC) 34, which can be an IBM 3090 or similar system. The CMC 34 provides the System Service Control Point (SSCP) for the illustrated network and operates to establish the resource configuration. Additional communication control units 36 can be connected to host 20 and communication management configurator 34. These, in turn, are connected to remote CCUs and devices.
The application host 20, communications management configurator 34, communication control units 22, 26, and 36 are each subarea network addressable devices or nodes. The interconnection of these subareas defines a subarea network 40. The interconnections between subarea nodes are defined in physical terms as explicit routes connecting subareas. For example, Explicit Route 1 (ER1) is shown at 42 and 43, Explicit Route 3 (ER3) at 44 and 46 and Explicit Route 4 (ER4) at 48.
Virtual routes are defined as endpoint to endpoint connections between subareas. For example, Virtual Route 1 (VR1) is shown at 50 connecting data host 20 to CCU 24 and connecting CCU 24 to CCU 26.
A session is created between a terminal or logical unit 52 and data host 20. The session requires data to flow from terminal 52 through control unit 31, CCU 26 and via virtual route 1 50 to data host 20. Virtual route 1 50 carries all session traffic between CCU 26 and data host 20. This is shown figuratively in FIG. 3 where a number of applications A, B, C, and so on, have sessions that are routed through a virtual route to terminals 1, 2, 3, . . . N. The virtual route thus acts as a pipeline for messages between the terminals and the applications.
The communications between a data host 20 and communication control unit 26 over a virtual route can be shown, in simplified form, in FIG. 4. Virtual Route 1 (VR1) carries the two-way communication of messages between the communication control unit and the data host. A data communication architecture requires that some control be exercised over the messages passing between the two devices. Errors in communication or interference introduced on the communication lines may corrupt messages or cause the loss of messages between units. Network integrity requires that the communication over the virtual route be monitored to ensure that no messages are lost.
The SNA architecture controls messages integrity over a virtual route by the sending node assigning sequence numbers to each message and by verifying at the recipient (data host 20 or communication control unit 26) that each message is received in sequence. The architecture also provides data pacing as a method for controlling data flow so that the recipient is not overwhelmed by a number of messages that it cannot process or store. Data pacing defines a variable window as a number of messages that will be sent before waiting for a response from the recipient. The message sender will first send a "Pacing Response Request" asking that the recipient respond when it can accept additional messages. When the recipient has the capacity to receive another window of messages, it generates a "Pacing Response" to the sender.
A series of messages between a data host 20 and communication control unit 26 is shown in FIG. 5. CCU 26 generates the first message over virtual route 1 (VR1) with a sequence number of 1 and a Request for a Pacing Response (RPR). Additional messages with sequence number 2, 3, and so forth, are next sent by CCU 26. After a period of time data host 20 responds to the pacing request with a pacing response (PRSP) and begins sending messages with consecutive sequence numbers as shown. The recipient monitors the sequence numbers as they are received and discards any message arriving out of sequence.
Initial startup and configuration of a large communications network (e.g. 20,000 terminals) by the SSCP can take an hour or more. The communications setup initializes all transmission sequences, determines configuration information, and sends the necessary set up and test messages to ensure that all elements in a network are ready to respond. Once established, the network can operate indefinitely without reinitialization.
The failure of the data host 20 or other component can cause the entire communication network to cease operation. Because of the need to maintain the sequence of messages between the data host 20 and other components, communications with the failing unit can only be restarted if the sequence number information is known or if the entire communications network is reinitialized. Reinitialization of a large network is highly undesirable because of the considerable time required. This lost time can be costly to a business that is dependent upon transaction processing for its operations. Various schemes have been proposed for retaining sequence information so that the network can be restarted without reinitialization. However, data host failure may occur unpredictably and may not afford an opportunity to save the necessary sequencing information. In these situations, a network reinitialization is required. There is therefore a need to have a system or method for resynchronizing data communications without reinitializing the network.
The present invention addresses the technical problem of recovering synchronization information lost during a network component failure. It also is directed to the problem of resynchronizing message traffic between adjacent communication components following a component failure.