Businesses employing large scale data processing systems must maintain hardware and software to assure that critical business data is not lost in the event of a disaster. Disasters can range from catastrophic events such as fire, terrorist attack or flood to relatively minor occurrences such as local power outages. A commonly used method of assuring that no critical business data is lost in the event of a disaster consists of maintaining separate high reliability disk based data storage facilities at separate locations. Often the separate data storage facilities are located miles away from each other to assure that both of the multiple data storage facilities are not compromised by a common disaster.
When multiple storage facilities are employed, it is necessary to synchronize the data between the facilities. One protocol for synchronizing the data between separate storage facilities is peer-to-peer remote copy (PPRC). PPRC is a hardware based disaster recovery and workload migration solution that maintains a synchronous copy (always up to date with the primary copy) of data at the remote location. The backup copy of data can be used to quickly recover from a failure in the primary system without losing any transactions. Typically, a host computer such as an IBM® System/390® communicates to a first storage facility such as an IBM Enterprise Storage Server® (ESS). The first storage facility is typically designated as the primary storage facility. Communication between the host computer and the primary storage facility typically occurs over a dedicated data link such as an optical ESCON® (Enterprise System Connection Architecture®) link. A second data storage facility completes the fundamental PPRC based data storage system. The second data storage facility is typically designated the secondary data storage facility and is connected to the primary data storage facility via a communication link similar to that connecting the host computer to the primary. The PPRC protocol maintains a synchronous copy on the secondary of all data stored to the primary by the host computer. To achieve additional safety and reliability, multiple storage facilities can be cascaded in a manner similar to the implementation of a primary and secondary storage facility.
Storage facilities such as the IBM ESS are inherently reliable and self-healing. These facilities are capable of detecting and correcting a range of both software and hardware errors. Various recovery processes are known in the art. The process used on the IBM ESS to perform a recovery is referred to as a “warmstart”. Warmstart is an accelerated method of accomplishing a system reboot. Typically, warmstart does not involve every re-initialization step of a full reboot. Warmstarts are typically initiated by simple debug commands, or initiated by a server upon itself when the server detects an internal error. In the case of the IBM ESS, a device specific control function such as the IOCTL (warmstart) command is used to initiate the warmstart. In addition to performing a system recovery, prior to or upon execution of a warmstart command a data storage facility will typically save the state of the data storage facility and a continuous event log buffer to disk. This information can later be reviewed by a system developer to facilitate root cause problem analysis.
One of the problems historically experienced with a data storage system implemented with PPRC protocols is that a first peer may occasionally send erroneous or incorrect data to the second peer. The problem can arise either when the primary sends erroneous data to the secondary or, conversely, when the secondary sends an erroneous response back to the primary. In the event of the primary sending erroneous data to the secondary, the secondary may detect an error with the data and commence a warmstart recovery process upon itself along with storage of root cause data. Unfortunately the problem is actually occurring on the primary or the data link, and therefore initiation of a recovery process on the secondary does not address the problem, and no useful data is collected. In cases where the error is caused by a hardware or software problem associated with either a single peer storage system or the data link between the peers, and the problem is recognized by the other peer, there is no mechanism known in the art to invoke a warmstart and cause data collection on the peer causing the error. In summary, the problem may only exist on the primary, but the secondary is the storage facility able to detect the error. Conversely, the primary can be the only peer able to detect an error on the secondary. For example: the primary may attempt to send an “update write” command to the secondary, but the format of the data track is different on the secondary (relative to the primary). It could be of a different record length, for example. In such a case, it would be highly desirable to invoke the warmstart process on both the primary and the secondary and to collect root cause data from both storage facilities. Or, the primary may receive an unexpected response from the secondary, for example an unexpected unit check. In such a case, it is desirable to have the primary force a warmstart with data collection upon the secondary.
In addition, it is possible that the communication link between the primary and secondary storage facilities may be the cause of the data error. Therefore, it is desirable to use an out-of-band communication path to invoke the error recovery and data collection operations on the peers.
The present invention is directed to overcoming one or more of the problems discussed above.