The present disclosure relates generally to connection state recovery and, in particular, to a method, system, and computer program product for quick recovery of existing connections in a network of computer systems after one or more faults occur in a node or link of that network.
Existing protocols and implementations provide ways for channels to exchange data with control units in a fiber channel network and to recover from a wide range of local and system faults. However, the current architecture does not provide a way to recover from certain types of control unit faults without invoking higher levels of recovery to recreate the local, connection-specific operating state.
Channel and control units are implemented by specialized computer programs. These programs can hold large amounts of data that are used to maintain the proper operation of the flow of data between the programs and devices. State information that is kept local to the control unit is considered unreliable for restoring after a fault, since the fault may have corrupted the state information. Even though well-known memory error correction techniques may have been employed, the nature of the fault may have changed the contents of the local memory in a manner such that future references to it do not detect memory errors.
Current techniques for safely saving state information require a storage medium that is remote from the hardware and computer program that may encounter faults. These techniques require additional hardware, and thus additional real monetary cost. The remote nature of this storage medium adds latency and computing time to save state information as it changes, and when it needs to be restored after a fault.
The channel is typically physically located with a host computer system and there are mechanisms provided by the host system to allow the channel to save and restore its operational data. Control units are traditionally stand-alone devices that are relatively stateless. Control units do not take actions on their own but rather their operations are directed by commands sent from channels.
Currently, there is no general, architected mechanism to save and restore operational data in a control unit. This is not normally a problem as control units operate in a relatively simple, deterministic manner. The essential data of most control units would be reconstructed by knowing the devices which are attached to the control unit and knowing how the control unit is attached to the data network. These pieces of information can be learned from the attached components.
Some control units, in particular, channel-to-channel (CTC) control units do not operate in a simple, reproducible way. As an example, CTC control units have specialized load-balancing facilities that can cause exact distribution of workload to be placed in a way that maximizes overall system performance. These load-balancing decisions are made when a communications path between two CTC-capable channels is established. The load balancing mechanism uses a snapshot of a subset of the system-wide resource information, so it is possible that different load balancing decisions will be made depending on details that vary over time. Since the exact conditions at the moment of the decision cannot be reproduced, the data which describes the results of the load balancing decision must be preserved, across local and system faults, to preserve the ability of the particular connection to operate.
The loss of the state information within one CTC control unit, due to a fault within the channel hardware or computer program, is detected by and causes errors in the other CTC channels which had been communicating with the affected CTC. The other affected channels may be within the same or other physical computing system as the channel containing the CTC control unit. Depending upon the state of the CTC channel at the time of the fault, the application software that is using the CTC connection(s) between the channel that had the fault and the other channels may not be able to recover and will cease using that connection.
What is needed, therefore, is a way to preserve operational data needed by integrated control units, such as a CTC control unit, thereby maintaining the load balance and preventing the loss of the communication path.