1. Technical Field
The present invention is related generally to the management of multiple operating system partitions in a logical partition data processing system and more specifically to the handling of errors and other events.
2. Description of Related Art
A logical partitioned (LPAR) functionality within a data processing system (platform) allows multiple copies of a single operating system (OS) or multiple heterogeneous operating systems to be simultaneously run on a single data processing system platform. A partition, within which an operating system image runs, is assigned a non-overlapping subset of the platform's resources. These platform allocable resources include one or more architecturally distinct processors with their interrupt management area, regions of system memory, and input/output (I/O) adapter bus slots. The partition's resources are represented by the platform's firmware to the OS image.
Each distinct OS or image of an OS running within the platform is protected from each other such that software errors on one logical partition cannot affect the correct operation of any of the other partitions. This is provided by allocating a disjoint set of platform resources to be directly managed by each OS image and by providing mechanisms for ensuring that the various images cannot control any resources that have not been allocated to it. Furthermore, software errors in the control of an operating system's allocated resources are prevented from affecting the resources of any other image. Thus, each image of the OS (or each different OS) directly controls a distinct set of allocable resources within the platform.
With respect to hardware resources in a LPAR system, these resources are disjointly shared among various partitions, themselves disjoint, each one seeming to be a stand-alone computer. These resources may include, for example, input/output (I/O) adapters, memory dimms, non-volatile random access memory (NVRAM), and hard disk drives. Each partition within the LPAR system may be booted and shutdown over and over without having to power-cycle the whole system.
In reality, some of the I/O devices that are disjointly shared among the partitions are themselves controlled by a common piece of hardware, such as a host Peripheral Component Interface (PCI) bridge, which may have many I/O adapters controlled or below the bridge. This bridge may be thought of as being shared by all of the partitions that are assigned to its slots. Hence, if the bridge becomes inoperable, it affects all of the partitions that share the devices that are below the bridge. Indeed, the problem itself may be so severe that the whole LPAR system will crash if any partition attempts to further use the bridge. In other words, with a crash, the entire LPAR system fails. The normal course of action is to terminate the running partitions that share the bridge, which will keep the system from crashing due to this failure.
What usually occurs is an I/O adapter failure that causes the bridge to assume a non-usable (error) state. At the time of occurrence, the I/O failure invokes a machine check interrupt (MCI) handler, which, in turn, will report the error and then terminate the appropriate partitions. This process is a “normal” solution that prevents the whole LPAR system from crashing due to this problem.
Depending on the particular operating system that is running in a given partition, however, some errors may be recoverable by the operating system and others not. If an error can be recovered from by a particular operating system, the best course of action would be to notify the operating system of the error so that appropriate action can be taken. If an operating system does not have the capability to recover from the error, however, attempting to notify the operating system of the error will do no good; the operating system, not being able to interpret the error notification, will simply continue regular processing until a crash occurs.What is needed, then, is a way to notify operating systems that are capable of handling particular errors when the errors occur and terminating the operating systems that are not capable of handling the particular errors.