A cluster of host computing systems attached to a global shared storage unit is called System Complex, or Sysplex. With each host computing system running with a single operating system, the Sysplex will have a multisystem operating system. Each host computing system consists of processors with its channel I/O devices is referred to as a Central Processor Complex (CPC).
One kind of sysplex system has been described in the referenced application assigned to the International Business Machines Corporation, U.S. Ser. No. 07/754,815 filed Sep. 4, 1991 by B. Glendening, entitled "Method and Apparatus for Timer Synchronization in a Logically Partitioned Data Processing System." A sysplex timer is used to provide synchronization among multiple hosts.
Closely-coupled systems with a global shared storage communicate and synchronize the operations of the total system through the shared memory. A highly-available closely coupled-system requires all the critical system components be designed to tolerate faults and be highly-available. The global shared data is a critical system component on which all the connected CPCs depend, therefore it is imperative that the global shared storage controller and its data be highly-available.
The shared storage controller (SSC)is used not only as repository of customer's data for critical data base application, but also providing control information to manage multiple systems. Loss of the shared data in a SSC is a disaster that customer can ill afford. With this shared controller's design, the primary and backup SSC can be physically separated so that failures of a SSC is unlikely to propagate to the other controller.
The prior art approach to the highly available shared storage controller used fault-tolerant design with multiple processors operating in lock-step synchronism or a dual-copy system controlled by a CPC. The lock-step synchronized multiple processor approach requires unduly complex hardware and costly development to maintain synchronization and compare or vote on results. Processors with high built-in error-detection circuits do not need to depend on the lock-step design of processors to detect faults. Faults occurring during normal operation will be detected quickly by the error-detecting hardware.
Many previous designs of dual-copy systems provide a duplicate copy of data by writing the data to the primary device and the secondary device. It is a straight-forward approach of maintaining two copies of data in two devices. One kind of dual copy function has been described in the IBM TDB by J. T. Robinson in the article called "Method for Scheduling Writes in a Duplexed DASD Subsystem", TDB vol. 29 Oct. 5, 1986, pp 2102-2107. Another dual copy function has been described by B. H. Berger in IBM docket TU986013 titled "Maintaining duplex paired devices by means of a dual copy function". The current invention provides not only dual copy of shared data but also maintaining data coherency of shared data by the SSC for all CPCs.
To achieve dual copy function of shared data, the current invention employs technique of parallel execution of message commands in duplexed shared storage controllers, (SSC) with synchronization being performed between the primary and secondary controllers instead of by the originating CPC. The command execution is sequenced and synchronized at the SSC using timestamp values from a tightly-synchronized TOD which is transmitted to all the CPC's and SSC.
One of the major roadblocks in designing synchronized duplexed controller or processor is the difficulty in determining the faulty processor after a failure. An "out of sync" condition or timeout of an operation in either controller is usually not sufficient to determine the faulty processor, especially for complex mainframe processors. There are obscure error conditions that may cause the loosely synchronized operation to lose synchronization, and they can not be easily detected by the processor itself. This invention has a novel feature of using an integrated support processor (SP) of each CPC and SSC to monitor and diagnose the abnormal conditions of processor operations in each synchronization interval. That monitoring information will significantly enhance the diagnosis of faulty processor by SSC when an "out of sync" is detected during an message operation.
It is also the object of the present invention to repair a failing page of shared storage in controller. A corrupted storage page will be repaired by copying good data from the same virtual address of the other controller. The recovery action will be transparent to programs of both controllers and connected CPCs.