1. Field of the Invention
The present invention relates to a technique in which a system controller realized by system software that operates a plurality of programs with the use of computer resources detects a memory fault.
2. Background Art
With an improvement in the performance and function of an open server installing an x86 CPU therein, as a method of effectively using a CPU core installed in the server, a hypervisor providing the function of server virtualization has been extensively used. The hypervisor is system software that creates a plurality of virtual machines by using computer resources such as a CPU, a memory and an I/O device installed in one physical server, and operates an OS and applications on the respective virtual machines.
With the popularization of the multicore CPU, the number of virtual machines created on one physical server tends to increase, and a memory capacity installed in the physical server also tends to increase. With those tendencies, in order to increase the memory capacity, a reduction in the size of a memory element included in a memory module is promoted.
In general, when a size of semiconductor is reduced, data is liable to be garbled due to a disturbance such as cosmic radiation or a failure of the memory element. In order to prevent malfunction caused by data garble, a coding technique using an ECC (error correcting code) disclosed in U.S. Pat. No. 6,480,982, and so on, is applied to a memory controller within the server CPU.
With the use of the ECC, if a correctable error such as a 1-bit error occurs in the read data, the error is corrected at the time of error detection, and the operation of a program can be continued. However, if an uncorrectable error (UE) such as a 2-bit error occurs in the read data, the operation of the program is disturbed.
In the related-art server having the x86 CPU, if the read data has the UE, a fault interrupt for instructing a forced outage is transmitted to all CPU cores within the system. For that reason, all of OSs and applications go down.
On the contrary, Intel Corporation (“Intel” registered in trademark) modifies the specification of fault processing for the CPU as disclosed in Intel, “Intel 64 and IA-32 Architecture Software Developer's Manual June 2009, Volume 3A: System Programming Guide, Part 1”, Chapter 15.6, Chapter 15.9.3. In the modified specification, a fault transmission means of a class called “SRAR (software recoverable action required) is added, and a range of the forced outage is restricted by addition of this specification.
In the SRAR, information different for each of the CPU cores is transmitted at the time of the fault interrupt. A memory address at which the UE is held, and information indicating that an execution state (command address) of the program is lost are transmitted to the CPU core that reads the UE. The memory address at which the UE is held, and information indicating that the execution state of the program is valid are transmitted to other CPU cores. For that reason, system software that receives the respective information subjects only the program in which the execution state is lost to forced outage, and can continue the operation of the other programs.
However, a fault is also present in the SRAR. When the system software per se reads the UE, because the system software per se goes down, the other programs operated by the system software also go down.
In order to suppress the occurrence of the UE, the memory controller conducts scrubbing. The “scrubbing” is a function of correcting the correctable error such as the 1-bit error at a timing when the memory is accessed. When data is repetitively garbled by the disturbance such as cosmic radiation, if the error can be corrected at a stage of the 1-bit error, the occurrence of the UE can be suppressed.
However, this method is effective for a storage area frequently referred to, but ineffective for the storage area which is low in reference frequency. For example, the hypervisor has processing low in execution frequency such as start of the virtual machines and live migration. A storage area used at the time of executing those processing low in the execution frequency is low in the reference frequency, and therefore the UE is relatively liable to occur.
Under the circumstances, in order to scrub the storage area low in the reference frequency, there has been known a patrol scrubbing technique in which hardware such as the memory controller cyclically inspects an overall region of the memory regardless of the execution of the program.
However, when the patrol scrubbing is conducted, software processing and patrol scrubbing competes against each other for a storage area, resulting in the degradation of the execution performance of the program. Accordingly, practically, the program execution is prioritized by taking a countermeasure that a cycle of the patrol inspection is sufficiently prolonged. For that reason, even if the patrol scrubbing is conducted, there may occur a case in which UE suppression or UE preceding detection fails, and the system software reads the UE.
As another method for avoiding the UE, memory mirroring disclosed in U.S. Pat. No. 7,328,315, and so on has been also extensively known. The memory mirroring is a redundant technique in which data is held in a memory module of a main system and a memory module of a sub-system.
As usual, data is read from the memory module of the main system. However, if the UE is present in the read result, data is automatically reread from the memory module of the sub-system. For that reason, if data stored in any one of the memory module of the main system and the memory module of the sub-system is safe, the program such as the system software can be prevented from going down.
However, in the memory mirroring, because the available memory capacity is reduced to half, this technique is not suitable for a configuration in which a large number of virtual machines is configured on a single server.