1. Technical Field
The present invention relates generally to the field of computer systems and, more specifically to a system, method, and computer program product for preventing machine crashes due to hard errors in logically partitioned systems.
2. Description of Related Art
A logical partitioning option (LPAR) within a data processing system (platform) allows multiple copies of a single operating system (OS) or multiple heterogeneous operating systems to be simultaneously run on a single data processing system hardware platform. A partition, within which an operating system image runs, is assigned a non-overlapping subset of the platform's hardware resources. These platform allocable resources include one or more architecturally distinct processors with their interrupt management area, regions of system memory, and input/output (I/O) adapter bus slots. The partition's resources are represented by its own open firmware device tree to the OS image.
Each distinct OS or image of an OS running within the platform is protected from each other such that software errors on one logical partition can not affect the correct operation of any of the other partitions. This is provided by allocating a disjoint set of platform resources to be directly managed by each OS image and by providing mechanisms for ensuring that the various images can not control any resources that have not been allocated to it. Furthermore, software errors in the control of an operating system's allocated resources are prevented from affecting the resources of any other image. Thus, each image of the OS (or each different OS) directly controls a distinct set of allocable resources within the platform.
Hard errors sometimes occur in processors, however. Hard errors, or fatal errors, are those errors which cause the processor to crash. In logically partitioned systems that include multiple processors, a hard error may occur in only one processor causing the crash of only one partition while the remaining processors, and thus their partitions, continue to operate. One type of hard error is an error that occurs in the address translation logic within the processor. For example, some processors include as part of their address translation logic a translation look aside buffer (TLB), and may also include a data effective to real address translation (D-ERAT) buffer. An error may occur in either, or both, of these buffers.
In some logically partitioned systems, a hard error occurring in the address translation logic of a single processor can result in a crash of the entire logically partitioned system. When such an error is detected, a request is made to deconfigure the processor within which the error occurred. After this request is made, a request is made to reboot that partition. If the deconfiguration of the processor is not complete when the partition is rebooted, the entire machine will crash. Because the code path length for reboot code is significantly shorter than the path length for the deconfiguration code, the reboot code will be executed prior to the deconfiguration being completed.
A processor is “deconfigured” when all execution streams have been removed from the processor and it has been successfully marked as unusable for use in subsequent reboot. Marking a processor as “bad” indicates that the processor can no longer be used.
Therefore, a need exists for a method, system, and product whereby machine crashes due to hard errors in logically partitioned systems are prevented.