The present invention relates to a hot stand-by method and a computer system for implementing the hot stand-by method which are preferably adapted to improve reliability about software fault.
The hot stand-by method is a method by which a spared device is kept in a hot state by way of precaution for fault occurrence in a computer system. This method is, in general, employed as a method for enhancing a fault-tolerant characteristic of a computer system.
As a representative prior art, a well known hot stand-by method is a disclosed in "Architecture of Fault-Tolerant Computers", IEEE COMPUTER, Vol. 17, No. 8 (August 1984), pp. 9-17 (First Prior Art). This first prior art is designed including two systems each having a central processing unit, a main storage unit, and an I/O processor one of the systems is prepared for normal use and the other for stand-by use. The first prior art operates such that the normal system performs the requested processing and the stand-by system succeeds to the processing if some fault takes place in the hardware or software included in the normal system. In the first prior art, the two systems include corresponding hardwares, each of which has a program pre-loaded thereon. Hence, whether fault takes place in the hardware or software, the normal-use system is switched to the stand-by system in the same routine without having to load the program. The resources such as data required for the succeeded processing are, however, stored in the same secondary storage unit such as a magnetic disk unit and shared in both normal and stand-by systems by connecting an I/O channel to the secondary storage unit to each of the systems.
As another representative prior art, is disclosed in JP-A-57-111900 (second prior art). This second prior art is designed so that two main storage units each pre-loading a program are prepared for normal use and stand-by use. The prior art takes the steps of, in the normal condition, writing data from a central processing unit and an I/O channel to both of the main storage units, if fault takes place in the hardware of the normal main storage unit, separating the normal main storage unit from the central processing unit, and having access to the stand-by main storage unit only for keeping the processing. The secondary storage units such as a magnetic disk unit are shared by both of the main storage units so that those storage units are connected to the main storage units through the I/O channels.
The first prior art providing two systems is advantageously capable of overcoming fault occurring in both of the hardware and software and operating the stand-by system without having to wait for new loading of a program. However, the first prior art is disadvantageous in light of the cost, since the art has double hardwares. The double hardwares are meaningless to the fault occurring on the software, which accepts for almost all of the actually-occurring faults. Further, in this prior art, the stand-by system cannot directly access the main storage unit included in the normal system. For fault analysis, this prior art has to take the steps of loading a program in a dump-dedicated system included in the normal system, dumping the content of the main storage unit of the normal system into the secondary storage unit, loading a program in the normal system again and analyzing the dumped data or staring up the stand-by system and analyzing the dumped data. In case the system includes a large-volume main storage, a considerably long time is needed for dumping the data, thereby for the fault analysis.
On the other hand, the second prior art providing two main storage units is less costly than the first prior art providing two overall systems but is not capable of overcoming software fault properly. That is, the second prior art is designed so that the data is written in both of the normal-use and stand-by main storage units. In case, therefore, the software contains such a bug as breaking the data on the main storage unit (that is, software fault), the data or program stored in the stand-by main storage unit may be broken down, thereby making it practically impossible to succeed to the processing using the stand-by main storage unit. Further, the prior art discloses nothing about fault analysis.