Along with expanded open server performance and functions, the comparatively inexpensive and high-performance Xeon (registered trademark, same hereafter) server containing the Xeon CPU made by the Intel Corporation (registered trademark, same hereafter) has become the mainstream in corporate IT systems. The Xeon CPU contains numerous internal CPU core that boost processing performance of the server as a whole and by 2010 each CPU package is expected to include 8 cores.
Virtual server technology is a widely utilized method for efficiently operating the CPU cores within the Xeon server. In this server technology, multiple virtual server environments (virtual machines, VMs) are generated on a single actual (hardware) Xeon server and the OS and applications are operated in these VMs. In recent years, users operating ten to dozens of VMs in standard Xeon servers have become common.
However as more and more VMs are operated on the single actual hardware server, the risk of VM operation stopping due to a server component failure becomes drastic. Encoding by ECC (Error Correcting Code) for example is applied to data in the memory but if a UE (Uncorrectable Error) such as a 2-bit error occurs then the Xeon server of the related art is seen as having a fatal error and operation of all VMs on that server operation must be stopped.
In contrast, in the Xeon CPU (Nehalem-EX) scheduled for market shipment in 2010, the failure management was redesigned on the architecture level (See for example, IntelR 64 and IA-32 Architectures Software Developer's Manual 3A Chapter 15.6 Recovery of Uncorrected Recoverable (UCR) Errors) and a mechanism to trace error data and perform error correction was added based on Poisoning. Here “Poisoning” is a function to generate error data (poison) that is assigned a specified graph or syndrome pattern (decoding symbol error pattern) when the hardware detects an UE, and to perform failure management at the point in time that the software reads the poison. If the poison was eliminated by overwriting, then the software can no longer read the poison so no failure management is performed.
To carry out UE detection and failure management, the Nehalem-EX contains a core section to execute commands and an uncore section to exchange data between the memory and the I/O. The core and the uncore sections handle different UE levels within the Nehalem-EX.    (1) The core section detects UE relating to memory readout. caused by executing commands. In this case the core section conveys a fatal error message to the software and system operation stops, causing all software on the applicable server to stop.    (2) The uncore section detects UE relating to a scrubbing process that periodically reads-out/rewrites the memory, and does write back of data onto the memory from the cache. In this case, after generating the poison, the uncore section conveys a recoverable error message to the software.
Therefore, when using Nehalem-EX, all of the UE detected by the core usually end in failure management processing such as system stoppage.
However if the uncore detects an UE, then the uncore conveys position information on the failed component to the software to allow overall server system operation to continue. The hypervisor for example controls the VM when notified of a recoverable error and at the IDF 2009 (Intel Developer Forum) an application was announced that stops only the VM containing the failure component (See for example, Building IT Server Solutions on Intel Microarchitecture (Nehalem-EX)-based Platforms Featuring Windows Server 2008 R2 and Hyper-V. Intel Developer Forum 2009). This technology moreover applies to the OS so an application of this technology was also announced where the OS stops just the application containing the failed memory under the same conditions.