1. Technical Field
This disclosure generally relates to multi-partition computer systems, and more specifically relates to transparent correctable error handling in a logically partitioned computer system.
2. Background Art
Computer systems typically include a combination of hardware and software. The combination of hardware and software on a particular computer system defines a computing environment. It was recognized that it is possible to provide different computing environments on the same physical computer system by logically partitioning the computer system resources into different computing environments. The logical portioning allows multiple operating systems and processes to share the hardware resources of a host computer. The eServer computer system developed by International Business Machines Corporation (IBM) is an example of a computer system that supports logical partitioning. For logical partitioning on an eServer computer system, a firmware partition manager called a “hypervisor” allows defining different computing environments on the same platform. The hypervisor manages the logical partitions to assure that they can share needed resources in the computer system while maintaining the separate computing environments defined by the logical partitions.
Processes on computer systems today are generally at the mercy of an uncorrectable memory error. When such an error occurs, the process or the entire partition itself must be terminated since a load instruction cannot be completed. Furthermore, the frequency of such errors appears to be exacerbated by newer, denser memory chips with smaller dies and faster clocks. Prior solutions to address this situation usually involve identifying a bad area of memory or affected area via a high frequency of correctable errors and attempting to deactivate the bad memory area the next time the partition is powered off. This solution can leave a critical system operating with a potential fatal error until it can be shut down for maintenance. Alternately, the OS can try to dynamically free up the memory that is incurring the correctable errors, but the OS may not be able to free up memory if it contains critical operating systems processes or data. In any case, it is preferable to address the problem memory before the correctable error becomes an uncorrectable error and the process or partition must be terminated.
In some systems, memory mirroring is used to overcome memory errors. Memory mirroring involves maintaining alternate copies of memory contents in two different regions of memory. When an uncorrectable data error is detected, the second copy is accessed, thus avoiding loss of data. A memory controller or equivalent device must be able to access the backup memory region when an error is detected in the first memory region. This type of access for retrieving a backup memory copy responsive to a detected error is commonly referred to as a mirror failover read. See for example U.S. Pat. No. 7,328,315 to Hillier et al. While mirrored memory provides a more robust memory system, there may be memory errors in the mirrored memory or a combination of the main memory and the mirrored memory.
Shutting down the computer system to prevent system failure from correctable and uncorrectable memory errors is a costly and inefficient solution. Without a way to transparently handle recurring correctable errors and uncorrectable errors, it will continue to be necessary to shut down complex computer systems to deal with correctable memory errors before the memory errors become uncorrectable and cause the system to fail.