1. Technical Field
This disclosure generally relates to multi-partition computer systems, and more specifically relates to a method and apparatus for transparent correctable error handling in a logically partitioned computer system.
2. Background Art
Computer systems typically include a combination of hardware and software. The combination of hardware and software on a particular computer system defines a computing environment. Different hardware platforms and different operating systems thus provide different computing environments. It was recognized that it is possible to provide different computing environments on the same physical computer system by logically partitioning the computer system resources into different computing environments. The logical portioning allows multiple operating systems and processes to share the hardware resources of a host computer. The eServer computer system developed by International Business Machines Corporation (IBM) is an example of a computer system that supports logical partitioning. For logical partitioning on an eServer computer system, a firmware partition manager called a “hypervisor” allows defining different computing environments on the same platform. The hypervisor manages the logical partitions to assure that they can share needed resources in the computer system while maintaining the separate computing environments defined by the logical partitions.
Processes on computer systems today are generally at the mercy of an uncorrectable memory error. When such an error occurs, the process or the entire partition itself must be terminated since a load instruction cannot be completed. Furthermore, the frequency of such errors appears to be exacerbated by newer, denser memory chips with smaller dies and faster clocks. Prior solutions to address this situation usually involve identifying a bad area of memory or affected area via a high frequency of correctable errors and attempting to deactivate the bad memory area the next time the partition is powered off. This solution can leave a critical system operating with a potential fatal error until it can be shut down for maintenance. Alternately, the OS can try to dynamically free up the memory that is incurring the correctable errors, but the OS may not be able to free up memory if it contains critical operating systems processes or data. In either case, it is preferable to address the problem memory before the correctable error becomes an uncorrectable error and the process or partition must be terminated.
Shutting down the computer system to prevent system failure from correctable and uncorrectable memory errors is a costly and inefficient solution. Without a way to transparently handle recurring correctable errors, it will continue to be necessary to shut down complex computer systems to deal with correctable memory errors before the memory errors become uncorrectable and cause the system to fail.