Field of the Invention
The present invention relates in general to multi-computer systems comprised of a main computer and a backup computer and relates in particular to multi-computer systems equipped with a common disk unit for use as an external memory device by the main computer and backup computers.
Generally, a multi-computer system is utilized when high reliability is required of personal computers which are inferior to large-scale computers for certain applications such as railroad management, industrial plant operations or control of utility power supply systems. The multi-computer system makes use of backup or slave personal computers in a preferably fault-tolerant system in which the backup computer takes over the processing performed by the main computer when some kind of breakdown or malfunction has occurred in the main computer.
Some examples of multi-computer systems utilizing personal computers in this way are the Netware SFT III system by the Novell Company (USA) and the Standby Recovery Server by the Compaq Company (USA).
In the Netware SFT III system a main computer and a backup computer are each connected to an expansion port with the expansion port linked to an optical fiber network. The expansion ports of both computers work in a mutually coordinated manner by means of the optical fiber network to periodically copy the memory of the main computer into the backup or slave computer. Further, each computer monitors the other so that if the signals from the main computer are cutoff, the backup or slave computer will take over the processing by utilizing the data copied from the main computer.
In a system such as Netware SFT III system, the data contents from the disk unit of the main computer are copied into the disk unit of the backup or slave computer so that the main computer and the backup computer do not share a common disk unit.
The above mentioned Standby Recovery Server generally utilizes a method referred to as "Cold Standby". This cold standby system is different from (hot standby) systems such as the Netware SFT III in that the backup computer cannot instantly take over the processing from the main computer.
In the Standby Recovery Server, the main computer performs the processing, while the OS (operating system) of the backup computer is in standby.
The carrying over of data between the main computer and the backup or slave computer is performed by a common disk unit utilized by both the main and the backup computer.
In the Standby Recovery Server, the backup computer is in a state equivalent to standby and cannot be started and the common disk unit can only be utilized by one computer so that the main computer will be stopped when a breakdown occurs.
More specifically, both computers mutually monitor each other and when the signal from the main computer is cut off, the backup (or slave) computer first switches the I/O (input/output) of the common disk unit from the main computer to the backup computer. Next, the backup computer loads the OS (operating system) and installs a file system into the common disk unit. Finally, the backup computer loads and runs the application program needed for the processing that had been performed by the main computer. In this way the Standby Recovery System carries over the data by switching the input/output of the common disk unit from the main computer to the backup computer when a malfunction occurs in the main computer.
The technology for mutual monitoring in the multi-computer system is listed in Japanese Patent Laid-Open No. Sho 58-214952. A technology is also known in which a plurality of service processors each mutually send periodic signals to indicate correct operation and when the main processor no longer sends a signal indicating correct operation, another main service processor is selected from among subordinate service processors.
Also, rather than a fault-tolerant system, a multi-processor system for performing processing with a plurality of processors arranged in parallel is known in Japanese Patent Laid-Open No. Hei-4-2483. In the technology proposed here, when a malfunction occurs in the processor, a processor selected beforehand is used as the control processor and monitors the status of each processor. The technology for resetting a processor when an error occurs is also described.
In the above related Netware SFT III, the provision of a large capacity transmission path between the expansion ports was necessary due to the need to copy the contents from the main computer onto the backup computer. Therefore, optical fiber was used in this Netware SFT III as the transmission path between the expansion ports between the main computer and the backup computer. However, methods such as the use of large capacity optical fibers as the transmission path cause an extremely large increase in the cost of a multi-computer type system.
In the above related Standby Recovery Server on the other hand, a common disk unit is utilized to carry over data from the main computer to the backup computer so that there is no need for the large capacity transmission path used in Netware SFT III.
However, in order that the common disk unit can be utilized by the backup computer when the main computer malfunctions in the Standby Recovery Server, a file system has to be installed in the backup computer. Here, restarting of the operating system is necessary so that this file system must be installed in order to use a typical personal computer operating system such as WindowsNT or Windows95.
Therefore, in the Standby Recovery Server the backup computer is in standby with the OS (operating system) still not loaded. When a breakdown or malfunction in the main computer occurs, this OS is loaded and the common disk unit incorporated into the file system. However in this kind of method, several minutes is required from the time of the malfunction until the backup computer can take over the processing tasks.
Further, these computer malfunctions may have different circumstances and degree of severity along with a different likelihood of recovery such as due to power supply errors, heat runaway, fan malfunctions or parity errors, etc. Accordingly the design of the multi-computer system should take into account the extent of these malfunctions and suitable countermeasures.