The present invention relates to an error recovery technique in a virtual computer system, and in particular to a technique which is effective to application to an error recovery technique in a cache memory.
As a conventional technique making it possible to operate OSs (operating systems) which have operated separately in a plurality of physical computers and software programs which operate on the OSs, by using one physical computer, there is the virtual computer technique.
In the virtual computer technique, for example, a virtual computer control program called hypervisor logically divides one physical computer into a plurality of logical partitions. The virtual computer control program assigns computer resources (a CPU (central processing unit), a main storage and an I/O (input/output device) to each of logical partitions obtained by the division. An OS (guest OS) operates on the logical partition under the control of the virtual computer control program.
This virtual computer technique is a technique which has heretofore been used in large-sized computers such as general purpose computers (main frames). Owing to the performance improvement of microprocessors in recent years, however, the virtual computer technique has begun to be applied to low-end PC servers as well. It can be said that applying such low-end PC servers to mission critical servers used in enterprise business or the like is advantageous in reducing the cost and has great needs.
On the other hand, with the internationalization of the enterprise business and the globalization of computer networks represented by Internet for a background, the necessity for long time continuous operation (operation for 24 hours on 365 days) of the computer system is becoming high. As a mater of course, this necessity also holds true for the case where a virtual computer system using a low-end PC server is used.
Speaking of large capacity memory in the conventional computer system, the main storage is the main stream and the occurrence probability of main storage errors is high in proportion to the increased capacity of the main storage. As the capacity of a cache memory used to improve the performance of access to main storage data from a CPU becomes large, however, the occurrence probability of cache memory errors tends to be high in recent years.
As the capacity of the cache memory becomes large, the probability that data will stay in the cache memory long becomes high and the cases where the latest data exists in only the cache memory also increase. For implementing the long time continuous operation in the virtual computer system, therefore, a technique for continuously operating the system not only at the time of occurrence of a main storage error but also at the time of a cache memory error becomes very important.
As regards the error recovery in the memory, various techniques have been proposed heretofore. For example, according to a technique disclosed in JP-A-6-52049 (Patent Document 1), contents in the main memory are recovered by managing data accessed since the start to end of processing which is being executed in the processor, in the cache memory as an intermediate state, writing back contents of a block before rewriting to the main memory, rewriting only a block stored in the cache memory when rewriting to a block in the intermediate state, and invalidating only a rewritten block on the cache memory when suspending processing which is being executed.
As a memory error recovery technique other than the above-described error recovery technique, an apparatus for periodically conducting error check on all data stored in the memory, apart from access to the memory conducted by the processor is proposed. In other words, a memory scrubbing method for conducting error check sequentially on all data periodically for a RAM (Random Access Memory) chip is used apart from the memory access from the processor. As the technique relating to the memory scrubbing method, there is, for example, a technique described in JP-A-8-194648 (Patent Document 2).
If an error is found in data by conducting this error check, then data codes of all addresses of a line on which the error has occurred are taken out from a RAM chip one by one, and subject to ECC (Error Correcting Code) check. If an error can be corrected, the data error is corrected. As techniques relating to this, for example, techniques disclosed in JP-A-1-112599 (Patent Document 3) and JP-A-63-269233 (Patent Document 4) can be mentioned.